03 - linear regression

2017-07-15  本文已影响0人  西瓜三茶

Pisa Tower Example

import pandas
import matplotlib.pyplot as plt

pisa = pandas.DataFrame({"year": range(1975, 1988), 
                         "lean": [2.9642, 2.9644, 2.9656, 2.9667, 2.9673, 2.9688, 2.9696, 
                                  2.9698, 2.9713, 2.9717, 2.9725, 2.9742, 2.9757]})

print(pisa)

plt.scatter(pisa["year"], pisa["lean"])
plt.show()

增加一列截距项

import statsmodels.api as sm

y = pisa.lean # target
X = pisa.year  # features
X = sm.add_constant(X)  # add a column of 1's as the constant term

# OLS -- Ordinary Least Squares Fit
linear = sm.OLS(y, X)
# fit model
linearfit = linear.fit()
print (linearfit.summary())

计算residual value

# Our predicted values of y
yhat = linearfit.predict(X)
print(yhat)

residuals = yhat - y
print (residuals)

#plot residuals and if it looks similar to a bell curve then we will accept the assumption of normality
plt.hist(residuals, bins = 5)
plt.show()

3个summary of squares

Sum of Square Error (SSE)  
Regression Sum of Squares (RSS) 
Total Sum of Squares (TSS)   # total variance

SSE = np.sum((y.values-yhat)**2)
RSS = np.sum((y.mean()-yhat)**2)
TSS = np.sum((y.values-yhat.mean())**2)

TSS=RSS+SSE

R_squared = 1 - SSE / TSS = RSS / TSS
# what the percentage of variation in the target variable is explained by our model.

Show params for the regression model

print("\n",linearfit.params)

The variance of each of the coefficients is an important and powerful measure.

# Compute SSE
SSE = np.sum((y.values - yhat)**2)
# Compute variance in X
xvar = np.sum((pisa.year - pisa.year.mean())**2)
# Compute variance in b1 
s2b1 = SSE / ((y.shape[0] - 2) * xvar)

pdf and cdf

from scipy.stats import t

# 100 values between -3 and 3
x = np.linspace(-3,3,100)

# Compute the pdf with 3 degrees of freedom
tdist3 = t.pdf(x=x, df=3)
tdist30 = t.pdf(x=x, df=30)
            
plt.plot(x, tdist3)
plt.plot(x, tdist30)

计算系数的significance

The variable s2b1 is in memory.  The variance of beta_1
tstat = linearfit.params["year"] / np.sqrt(s2b1)

t检验

# At the 95% confidence interval for a two-sided t-test we must use a p-value of 0.975
pval = 0.975

# The degrees of freedom
df = pisa.shape[0] - 2

# The probability to test against
p = t.cdf(tstat, df=df)

beta1_test = p > pval
上一篇 下一篇

猜你喜欢

热点阅读