PySpark NoteBook-9:GLM
2018-01-11 本文已影响14人
7125messi
GLM and Data Preparation
摘要:我们将探索在PySpark中准备数据进行分析,特别是配置自变量。 然后我们将在GLM函数上测试它,并显示如何查看结果。
使用的主要操作:Pipeline, StringIndexer, OneHotEncoder, VectorAssembler, GeneralizedLinearRegression, fit, build_indep_vars, summarizer
from IPython.display import Image
Image(filename='stata_reg.PNG')
image.png
df = spark.read.csv('s3://ui-spark-social-science-public/data/diamonds.csv', inferSchema=True, header=True, sep=',')
df = df[['carat', 'clarity', 'price']]
from pyspark.sql.functions import log
df = df.withColumn('lprice', log('price'))
df.show(5)
+-----+-------+-----+------------------+
|carat|clarity|price| lprice|
+-----+-------+-----+------------------+
| 0.23| SI2| 326| 5.786897381366708|
| 0.21| SI1| 326| 5.786897381366708|
| 0.23| VS1| 327|5.7899601708972535|
| 0.29| VS2| 334| 5.811140992976701|
| 0.31| SI2| 335| 5.814130531825066|
+-----+-------+-----+------------------+
from pyspark.ml.regression import GeneralizedLinearRegression
glm = GeneralizedLinearRegression(family='gaussian',
link='identity',
labelCol='lprice',
featuresCol='indep_vars',
fitIntercept=True)
model = glm.fit(df)
model.coefficients
> DenseVector([2.0808, 0.7221, 0.8172, 0.5682, 0.8555, 0.9341, 0.9195, 0.9979])
model.intercept
> 5.356165273724909
model.summary.tValues
[573.8145328923426,
51.24311447314843,
57.736801306869914,
40.069982524765486,
59.53629044537795,
63.12762399645197,
60.46906625364304,
60.69082330168219,
372.05231564539275]
model.summary.coefficientStandardErrors
[0.0036263332330212492,
0.014091572711412436,
0.014154515091987462,
0.014180739640881894,
0.014369486052126049,
0.01479661778476952,
0.015205659565030371,
0.01644307646451729,
0.014396269149497595]
summarizer.summarize(model)
---------------------------------------------------------
| Coef Std Err T Stat P Val
---------------------------------------------------------
intercept| 5.3562 0.0036 573.8145 0.0
carat| 2.0808 0.0141 51.2431 0.0
SI1| 0.7221 0.0142 57.7368 0.0
VS2| 0.8172 0.0142 40.07 0.0
SI2| 0.5682 0.0144 59.5363 0.0
VS1| 0.8555 0.0148 63.1276 0.0
VVS2| 0.9341 0.0152 60.4691 0.0
VVS1| 0.9195 0.0164 60.6908 0.0
IF| 0.9979 0.0144 372.0523 0.0
---------------------------------------------------------
summarizer.param_crosswalk
{0: 'carat',
1: u'SI1',
2: u'VS2',
3: u'SI2',
4: u'VS1',
5: u'VVS2',
6: u'VVS1',
7: u'IF'}