Spark MLlib实现贝叶斯多分类
参考书籍《Spark机器学习》 彭特里思 (Nick Pentreath) 人民邮电出版社 Machine Learning with Spark
实验数据
针对Wine数据集(https://archive.ics.uci.edu/ml/datasets/Wine),在Spark平台上,采用贝叶斯模型进行编程实验,得出分析结果。
观察数据:第一列是分类,有0,1,2三类,其余列都是数值。

Wine.name文件显示:
Missing Attribute Values:
None
说明没有缺失数据,不用填充缺失值。
数据处理
数据处理成>0的格式
var s="\\作业\\wine.data"
val rawData = sc.textFile(s)
val records = rawData.map(line => line.split(","))
records.first()
val nbData = records.map { r =>
val trimmed = r.map(_.toDouble)
val label = trimmed(0).toInt
val features = trimmed.slice(1, r.size).map(_.toDouble).map(d => if (d < 0) 0.0 else d)
LabeledPoint(label, Vectors.dense(features))
}
nbData.foreach(println)

进行实验预测
val numData = nbData.count
val nbModel = NaiveBayes.train(nbData)
val nbTotalCorrect = nbData.map { point =>
if (nbModel.predict(point.features) == point.label) 1 else 0
}.sum
val ex = nbData.map(x=>nbModel.predict(x.features))
println(ex.take(100).mkString(","))
val nbAccuracy = nbTotalCorrect / numData
println(nbAccuracy)
val nbMetrics = Seq(nbModel).map{ model => val scoreAndLabels = nbData.map { point =>( model.predict(point.features), point.label)
}
预测结果前100个打印
预测准确率打印
因为我们预测的就是原数据,原数据比较规律前面的都是1,所以肉眼可以看到有些是错误的预测,预测结果准确率是0.8707865168539326

使用自带库进行评估,这里使用MulticlassMetrics,这个类位于org.apache.spark.mllib.evaluation包下面,API在http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/MulticlassMetrics.html
我们使用accuracy和每一类的F1值来评估模型的好坏。
val nbMetrics = Seq(nbModel).map{ model => val scoreAndLabels = nbData.map { point =>( model.predict(point.features), point.label)
}
val metrics = new MulticlassMetrics(scoreAndLabels)
(model.getClass.getSimpleName, metrics.accuracy, metrics.fMeasure(1.0),metrics.fMeasure(2.0),metrics.fMeasure(3.0))
}
nbMetrics.foreach{ case (m, pr, f1,f2,f3) =>
println(f"$m, Accuracy: ${pr * 100.0}%2.4f%%, label1Fvalue ${f1* 100.0}%2.4f%%,label2Fvalue ${f2* 100.0}%2.4f%%,label3Fvalue ${f3* 100.0}%2.4f%%")
}

NaiveBayesModel, Accuracy: 87.0787%, label1Fvalue 90.4348%,label2Fvalue 86.5248%,label3Fvalue 84.0000%
可以看到 accuracy和经统计计算得到的0.8707865168539326相同,F1值,label1即1.0时最大,模型预测1.0类比较好。