PySpark NoteBook-9:GLM

2018-01-11 本文已影响14人 7125messi

GLM and Data Preparation

摘要：我们将探索在PySpark中准备数据进行分析，特别是配置自变量。然后我们将在GLM函数上测试它，并显示如何查看结果。

使用的主要操作：Pipeline, StringIndexer, OneHotEncoder, VectorAssembler, GeneralizedLinearRegression, fit, build_indep_vars, summarizer

from IPython.display import Image
Image(filename='stata_reg.PNG')

image.png

df = spark.read.csv('s3://ui-spark-social-science-public/data/diamonds.csv', inferSchema=True, header=True, sep=',')

df = df[['carat', 'clarity', 'price']]

from pyspark.sql.functions import log
df = df.withColumn('lprice', log('price'))
df.show(5)
+-----+-------+-----+------------------+
|carat|clarity|price|            lprice|
+-----+-------+-----+------------------+
| 0.23|    SI2|  326| 5.786897381366708|
| 0.21|    SI1|  326| 5.786897381366708|
| 0.23|    VS1|  327|5.7899601708972535|
| 0.29|    VS2|  334| 5.811140992976701|
| 0.31|    SI2|  335| 5.814130531825066|
+-----+-------+-----+------------------+

from pyspark.ml.regression import GeneralizedLinearRegression

glm = GeneralizedLinearRegression(family='gaussian', 
                                  link='identity', 
                                  labelCol='lprice', 
                                  featuresCol='indep_vars', 
                                  fitIntercept=True)

model = glm.fit(df)

model.coefficients
> DenseVector([2.0808, 0.7221, 0.8172, 0.5682, 0.8555, 0.9341, 0.9195, 0.9979])

model.intercept
> 5.356165273724909

model.summary.tValues
[573.8145328923426,
 51.24311447314843,
 57.736801306869914,
 40.069982524765486,
 59.53629044537795,
 63.12762399645197,
 60.46906625364304,
 60.69082330168219,
 372.05231564539275]

model.summary.coefficientStandardErrors
[0.0036263332330212492,
 0.014091572711412436,
 0.014154515091987462,
 0.014180739640881894,
 0.014369486052126049,
 0.01479661778476952,
 0.015205659565030371,
 0.01644307646451729,
 0.014396269149497595]

summarizer.summarize(model)
---------------------------------------------------------
               |   Coef    Std Err    T Stat    P Val   
---------------------------------------------------------
      intercept|  5.3562    0.0036   573.8145    0.0    
          carat|  2.0808    0.0141   51.2431     0.0    
            SI1|  0.7221    0.0142   57.7368     0.0    
            VS2|  0.8172    0.0142    40.07      0.0    
            SI2|  0.5682    0.0144   59.5363     0.0    
            VS1|  0.8555    0.0148   63.1276     0.0    
           VVS2|  0.9341    0.0152   60.4691     0.0    
           VVS1|  0.9195    0.0164   60.6908     0.0    
             IF|  0.9979    0.0144   372.0523    0.0    
---------------------------------------------------------

summarizer.param_crosswalk
{0: 'carat',
 1: u'SI1',
 2: u'VS2',
 3: u'SI2',
 4: u'VS1',
 5: u'VVS2',
 6: u'VVS1',
 7: u'IF'}

PySpark NoteBook-9:GLM

猜你喜欢

热点阅读