Spark学习笔记

14.Spark学习(Python版本):构建一个机器学习工作流

2018-09-02  本文已影响48人  马淑
准备

Spark2.0以上版本的pyspark创建一个名为spark的SparkSession对象,当需要手工创建时,SparkSession可以由其伴生对象的builder()方法创建出来,如下代码段所示:
spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()
pyspark.ml依赖numpy包,Ubuntu 自带python3是没有numpy的,执行如下命令安装:
sudo pip install numpy

实现过程

在vim中编写程序,输入以下代码:
mashu@mashu-Inspiron-5458:/usr/local/spark/python_code/ML$ vim ml_pipelines.py

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

# Create SparkSession
spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()

# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
        (0,"a b c d e soark", 1.0),
        (1,"b d",0.0),
        (2,"spark f g h",1.0),
        (3,"hadoop mapreduce",1.0)],
        ["id","text","label"])

# Define pipeline stage including tokenizer, hashingTF and lr
tokenizer = Tokenizer(inputCol="text",outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(),outputCol="features")
lr = LogisticRegression(maxIter=10,regParam=0.001)

# Build pipeline
pipeline = Pipeline(stages=[tokenizer,hashingTF,lr])

# Generate a pipeline model
model = pipeline.fit(training)

# Prepare test documents
test = spark.createDataFrame([
        (4, "spark i j k"),
        (5, "l m n"),
        (6, "spark hadoop spark"),
        (7, "apache hadoop")],
        ["id","text"])

# Predict
prediction = model.transform(test)
selected = prediction.select("id","text","probability","prediction")
for row in selected.collect():
        rid, text, prob, prediction = row
        print("(%d,%s)-->prob=%s,prediction=%f" % (rid,text,str(prob),prediction))

保存退出vim,运行程序,输出结果:

mashu@mashu-Inspiron-5458:/usr/local/spark/python_code/ML$ python ml_pipelines.py

(省略部分Warning信息)

(4,spark i j k)-->prob=[0.08271837966143641,0.9172816203385635],prediction=1.000000
(5,l m n)-->prob=[0.26514807025650067,0.7348519297434992],prediction=1.000000
(6,spark hadoop spark)-->prob=[0.001869208706242662,0.9981307912937574],prediction=1.000000
(7,apache hadoop)-->prob=[0.029108488540555488,0.9708915114594445],prediction=1.000000

由于训练数据集较少,如果有更多的测试数据进行学习,预测的准确率将会有显著提升。

上一篇 下一篇

猜你喜欢

热点阅读