Spark mllib运行决策树分类器实例(附代码)

2020-07-19  本文已影响0人  夜空中最亮的星Simon

1. 数据集介绍:

此数据集包含了一些特征如下图,主要通过这些特征来决定是否雇佣员工。

代码:

# 快速查看数据集

Hire_data=pd.read_csv('/Users/{0}/Documents/MLCourse/PastHires.csv'.format(username))

Hire_data.head()

数据集样例

2. 安装相应的packages, 并请提前安装好Anaconda3, Python3 链接可参考: Anaconda 安装

2.1 安装 package pydotplus (used for visualizing decision tree)

安装方法:Terminal 输入 conda install pydotplus

2.2 安装 Pyspark (used for running machine learning job in Spark)

安装方法:Terminal 输入 pip install pyspark

2.3 安装 Pandas

安装方法:Terminal 输入 pip install pandas

2.4 安装 Numpy

安装方法:Terminal 输入 pip install numpy


3. 用决策树建模并且预测结果

3.1 标签和特征参数

Hired这一列作为label, 其它作为特征参数。

3.2 代码部分实现:

from pyspark.mllib.regression import LabeledPoint

from pyspark.mllib.tree import DecisionTree

from pyspark import SparkConf, SparkContext

from numpy import array

import pandas as pd

import getpass

username = getpass.getuser()

# Initialize a Spark Context:

conf = SparkConf().setMaster("local").setAppName("SparkDecisionTree")

sc = SparkContext(conf = conf)

# 定义函数binary, mapEducation将一些非数字的值转换为数字,便于处理

# 定义函数createLabeledPoints 返回label列和所需特征列

def binary(YN):

    if (YN == 'Y'):

        return 1

    else:

        return 0

def mapEducation(degree):

    if (degree == 'BS'):

        return 1

    elif (degree =='MS'):

        return 2

    elif (degree == 'PhD'):

        return 3

    else:

        return 0

# Convert a list of raw fields from our CSV file to a

# LabeledPoint that MLLib can use. All data must be numerical...

def createLabeledPoints(fields):

    yearsExperience = int(fields[0])

    employed = binary(fields[1])

    previousEmployers = int(fields[2])

    educationLevel = mapEducation(fields[3])

    topTier = binary(fields[4])

    interned = binary(fields[5])

    hired = binary(fields[6])

    return LabeledPoint(hired, array([yearsExperience, employed,

        previousEmployers, educationLevel, topTier, interned])

# SparkContext read local file, will create a RDD

rawData = sc.textFile('/Users/{0}/Documents/MLCourse/PastHires.csv'.format(username))

# Get all fields name, save result in a RDD

header = rawData.first()

# Get all data except for the row with fields name, save result in a RDD

rawData = rawData.filter(lambda x:x != header)

# 将数据集根据逗号分割分成不同的fields

csvData = rawData.map(lambda x: x.split(","))

# 建立训练数据集

trainingData = csvData.map(createLabeledPoints)

# 建立测试数据集,下面数字的意思可参考对应列名

testCandidates = [ array([10, 1, 3, 1, 0, 0])]

# 将测试数据集变为RDD, 如果不明白RDD, 请查看Spark RDD的说明

testData = sc.parallelize(testCandidates)

# 用训练数据集建立决策树模型,categoricalFeaturesInfo为种类mapping,例如1代表field[1],即为employed?列,2代表mapping成两个值,因为employed?列只有Y和N。剩余3:4, 4:2, 5:2的意思也可根据上述描述进行赋值。

model = DecisionTree.trainClassifier(trainingData, numClasses=2,

                                    categoricalFeaturesInfo={1:2, 3:4, 4:2, 5:2},

                                    impurity='gini', maxDepth=5, maxBins=32)

# 将测试数据集放入决策树模型

predictions = model.predict(testData)

# 打印最终决策后的结果

print('Hire prediction:')

results = predictions.collect()

for result in results:

    print(result)

预测结果为 Hire

# 打印决策树内部是如何工作的

print('Learned classification tree model:')

print(model.toDebugString())

决策树内部分析图

参考 https://www.udemy.com/course/data-science-and-machine-learning-with-python-hands-on/

如要转载,请注明参考。最后,希望我的总结能帮到您,happy learning :)

上一篇 下一篇

猜你喜欢

热点阅读