Spark MLlib Basic Statistics

2020-01-27 本文已影响0人 spraysss

Correlation

Calculating the correlation between two series of data is a common operation in Statistics. In spark.ml we provide the flexibility to calculate pairwise correlations among many series. The supported correlation methods are currently Pearson’s and Spearman’s correlation.

计算两组数据之间的相关性是统计学中常用的操作，目前支持的相关方法有皮尔逊相关法和斯皮尔曼相关法。

例子

import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.{Row, SparkSession}

/**
 * 相关性
 */

object CorrelationExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("CorrelationExample").master("local[16]").getOrCreate()
    import spark.implicits._
    /**
     * 1.0  0.0 0.0 -2.0
     * 4.0  5.0 0.0 3.0
     * 6.0  7.0 0.0 8.0
     * 9.0  0.0 0.0 1.0
     */
    // Correlation

    val data = Seq(
      //稀疏向量v[0]=1.0 v[3]=-2.0
      Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
      Vectors.dense(4.0, 5.0, 0.0, 3.0),
      Vectors.dense(6.0, 7.0, 0.0, 8.0),
      Vectors.sparse(4, Seq((0, 9.0), (3, 1.0)))
    )


    val df = data.map(Tuple1.apply).toDF("features")

    val Row(coeff1: Matrix) = Correlation.corr(df, "features").head
    println(s"Pearson correlation matrix:\n $coeff1")

    val Row(coeff2: Matrix) = Correlation.corr(df, "features", "spearman").head
    println(s"Spearman correlation matrix:\n $coeff2")


  }


}

Vectors.sparse用于创建稀疏向量
Vectors.dense用于创建稠密向量

输出结果

...
Pearson correlation matrix:
1.0                   0.055641488407465814  NaN  0.4004714203168137  
0.055641488407465814  1.0                   NaN  0.9135958615342522  
NaN                   NaN                   1.0  NaN                 
0.4004714203168137    0.9135958615342522    NaN  1.0    

...

Spearman correlation matrix:
1.0                  0.10540925533894532  NaN  0.40000000000000174  
0.10540925533894532  1.0                  NaN  0.9486832980505141   
NaN                  NaN                  1.0  NaN                  
0.40000000000000174  0.9486832980505141   NaN  1.0

输出结果是一个矩阵M,M[i][j]表示第i个向量与第j个向量的相关系数，以皮尔逊相关系数的输出矩阵为例,第一行

1.0                   0.055641488407465814  NaN  0.4004714203168137

表示第一个向量和第二个向量的皮尔逊相关系数为1，第一个向量和第二个向量的皮尔逊相关系数为0.055641488407465814，第一个向量和第三个向量的皮尔逊相关系数不确定，第一个向量和第四个向量的相关系数为0.4004714203168137

Hypothesis testing

Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically significant, whether this result occurred by chance or not. spark.ml currently supports Pearson’s Chi-squared ( χ2) tests for independence.

ChiSquareTest conducts Pearson’s independence test for every feature against the label. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the Chi-squared statistic is computed. All label and feature values must be categorical.

假设检验是统计学中一个强有力的工具，用来确定一个结果是否具有统计学意义，这个结果是否是偶然发生的,spark.ml目前支持皮尔逊卡方（x2）独立性测试

例子


import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.stat.ChiSquareTest
// $example off$
import org.apache.spark.sql.SparkSession

/**
 * An example for Chi-square hypothesis testing.
 * Run with
 * {{{
 * bin/run-example ml.ChiSquareTestExample
 * }}}
 * 皮尔逊卡方检验
 */
object ChiSquareTestExample {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("ChiSquareTestExample").master("local[16]").getOrCreate()
    import spark.implicits._

    // $example on$
    val data = Seq(
      (0.0, Vectors.dense(0.5, 10.0)),
      (0.0, Vectors.dense(1.5, 20.0)),
      (1.0, Vectors.dense(1.5, 30.0)),
      (0.0, Vectors.dense(3.5, 30.0)),
      (0.0, Vectors.dense(3.5, 40.0)),
      (1.0, Vectors.dense(3.5, 40.0))
    )

    val df = data.toDF("label", "features")
    val chi = ChiSquareTest.test(df, "features", "label").head
    println(s"pValues = ${chi.getAs[Vector](0)}")
    println(s"degreesOfFreedom ${chi.getSeq[Int](1).mkString("[", ",", "]")}")
    println(s"statistics ${chi.getAs[Vector](2)}")
    // $example off$

    spark.stop()
  }

Summarizer

We provide vector column summary statistics for Dataframe through Summarizer. Available metrics are the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.


import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.stat.Summarizer
// $example off$
import org.apache.spark.sql.SparkSession

object SummarizerExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("ChiSquareTestExample").master("local[16]").getOrCreate()

    import Summarizer._
    import spark.implicits._

    // $example on$
    val data = Seq(
      (Vectors.dense(2.0, 3.0, 5.0), 1.0),
      (Vectors.dense(4.0, 6.0, 7.0), 2.0)
    )

    val df = data.toDF("features", "weight")

    val (meanVal, varianceVal) = df.select(metrics("mean", "variance")
      .summary($"features", $"weight").as("summary"))
      .select("summary.mean", "summary.variance")
      .as[(Vector, Vector)].first()

    println(s"with weight: mean = ${meanVal}, variance = ${varianceVal}")

    val (meanVal2, varianceVal2) = df.select(mean($"features"), variance($"features"))
      .as[(Vector, Vector)].first()

    println(s"without weight: mean = ${meanVal2}, sum = ${varianceVal2}")
    // $example off$

    spark.stop()
  }
}

Spark MLlib Basic Statistics

Correlation

例子

输出结果

Hypothesis testing

例子

Summarizer

猜你喜欢

热点阅读