solution to ISLR
Chapter 2
1,
a) better, the more samples can make the function fit pratcal problem better.
b) worse, since the number of observations is small, the more flexiable statistical method will result in the more over-fit function.
c) better, the more samples enable the flexiable method to fit the data better.
d) worse, due to the variance is high, with a fixed number of samples, the more flexible the statistical method is, the more function will overfit.
2,
a) regression, n: 500 firms in US; p: profit, number of employees, industry.
b) classification, n: 20 products that were previously launched; p: price, marketing budget, competition price, ten other variables.
c) regression, n: weekly data for all of 2012; p: % change in the US market, the % change in British market, the % change in the German market.
3,
a) the Bayes Error is going to be a fixed horizonal line which goes in parallel with the x axis, since it is a constant value; training error goes down with the flexibility goes up, since the stastics method fits the training data more precisely with the increasing flexibility; test error first goes down but increase after a point which near the line of Bayes Error due to over-fitting; the variance starts low but goes up constantly because the more method fit the traininig data, the great difference will be when change training data set.
b) as previous.
4,
a) i. recognize the number in one image; inference; ii. provide the statistcs of history data, judge if a project will succeed; prediction; iii. recommendation system by categorize the user; prediction
b) i. get a function with provided statistics; inference; ii. predict the stock prize of future; prediction; iii. give the possible price of next year of one house; prediction
c) i. find types of virus; ii. judge the possible geographical boundary between people holding different political belief; iii. categorize users by their taste for book
5,
Advantage: can match the test data better; Disadvantage: easy to over-fit if the data is not enough; high variance; require high computatiblity to get parameters.
A non-linear and complicated module need more flexible approach; otherwise a less flexible approach is preferred.
6,
Parametric statistics learning method:
pros: simpler, speed, less data; cons: constrained, limited comlexity, poor fit
Non-parametric statistics learning method:
pros: flexibility, power, performance; cons: more data, slower, overfitting.
7,
a) 1: 3; 2: 2; 3: 10^(1/2); 4: 5^(1/2); 5: 2^(1/2); 6: 3^(1/2)
b) Green, since Obs.5 is the closest point to the test data.
c) Red, since Obs.2, 5, 6 is the cloeset 3 points to the test data, and both 2 and 6 are red.
9,
a) str(auto); #all variables are quantitative variable excpet name and horsepower
b) summary(auto[, -c(4,9)])
c) sapply(auto[, -c(4, 9)], mean)
sapply(auto[, -c(4, 9)], sd)
mpg cylinders displacement weight acceleration
7.8258039 1.7015770 104.3795833 847.9041195 2.7499953
year origin
3.6900049 0.8025495
d)
sapply(auto[-c(10:85), -c(4, 9)], range)
sapply(auto[-c(10:85), -c(4, 9)], mean)
sapply(auto[-c(10:85), -c(4, 9)], sd)
Chapter 3
1,
NULL hypothesis is that the predictors "TV", "radio", "newspaper" have no effect on sale.
Conclusion: p-value are not significatnt for "TV" and "Radio", so they have great probability have effect on sales, while we can't reject newspaper does not have effect.
2,
The KNN classifier is typically used to solve classification problems (those with a qualitative response) by identifying the neighborhood of x0 and then estimating the conditional probability P(Y=j|X=x0) for class jj as the fraction of points in the neighborhood whose response values equal jj. The KNN regression method is used to solve regression problems (those with a quantitative response) by again identifying the neighborhood of x0x0 and then estimating f(x0) as the average of all the training responses in the neighborhood.
3,
a) iii is right
b) 137.1k
c) false, to identify whether predictor has effect on response, we should look at p-value.
4,
a) The RSS of cubic regression will be lower, since it over-fits the training data.
b) it will be inverse, i.e., the RSS for linear regression have lower RSS.
c) Polynomial regression has lower train RSS than the linear fit because of higher flexibility.
d) it depends on which model is closer to the real model.
6,
f(ave(x)) = b0 + b1ave(x) = ave(y) - b1ave(x) + b1*ave(x) = ave(y)
7,
a)
attach(Auto)
lm.fit = lm(mpg~horsepower)
summary(lm.fit)
i.
since p-value is not significant, we may assume there is a relationship between the predictor and the response.
ii.
R^2 is 0.6059, which indicates that 60.59% of the variability of mpg can be explained by horsepower.
iii.
negative
iv.
predict(lm.fit, data.frame(horsepower=98), interval="confidence")
fit lwr upr
1 24.46708 14.8094 34.12476
predict(lm.fit, data.frame(horsepower=98), interval="prediction")
fit lwr upr
1 24.46708 23.97308 24.96108
b)
plot(horsepower,mpg)
abline(lm.fit)
10,
c) Sales=13.0434689+(−0.0544588)×Price+(−0.0219162)×Urban+(1.2005727)×US+ε
with Urban=1Urban=1 if the store is in an urban location and 00 if not, and US=1US=1 if the store is in the US and 00 if not.
d) Price and US
f) The model can explain 23.93% of Sales's variance.
g)
confint(fit2)
2.5 % 97.5 %
(Intercept) 11.79032020 14.27126531
Price -0.06475984 -0.04419543
USYes 0.69151957 1.70776632
11,
c) We obtain the same value for the t-statistic and consequently the same value for the corresponding p-value. Both results in (a) and (b) reflect the same line created in (a). In other words, y=2x+εy=2x+ε could also be written x=0.5(y−ε)x=0.5(y−ε).
15,
c)
simple.reg <- vector("numeric",0)
simple.reg <- c(simple.reg, fit.zn$coefficient[2])
simple.reg <- c(simple.reg, fit.indus$coefficient[2])
simple.reg <- c(simple.reg, fit.chas$coefficient[2])
simple.reg <- c(simple.reg, fit.nox$coefficient[2])
simple.reg <- c(simple.reg, fit.rm$coefficient[2])
simple.reg <- c(simple.reg, fit.age$coefficient[2])
simple.reg <- c(simple.reg, fit.dis$coefficient[2])
simple.reg <- c(simple.reg, fit.rad$coefficient[2])
simple.reg <- c(simple.reg, fit.tax$coefficient[2])
simple.reg <- c(simple.reg, fit.ptratio$coefficient[2])
simple.reg <- c(simple.reg, fit.black$coefficient[2])
simple.reg <- c(simple.reg, fit.lstat$coefficient[2])
simple.reg <- c(simple.reg, fit.medv$coefficient[2])
mult.reg <- vector("numeric", 0)
mult.reg <- c(mult.reg, fit.all$coefficients)
mult.reg <- mult.reg[-1]
plot(simple.reg, mult.reg, col = "red")
Chapter 4
4,
a)
for x in [0.5,0.95]: 10%;
for x < 0.5, [0,x]: (x+5)%;
for x > 0.95, [x,1]: (100-x+5)% = (105-x)%;
so, the average fraction of the available observations which we will use to make prediction are
10% * 0.9 + ave((x+5)%) * 0.05 + ave((105-x)%) * 0.05 = 9% + 7.5% * 0.05 + 7.5% * 0.05 = 9.75%.
b)
9.75% * 9.75% = 9.50625%
c)
9.75%^100 ≈ 0%
d)
limp->∞9.75p = 0
e)
1: 10%,
2: √10%
100: log1000.1
5,
a) QDA, LDA.
b) QDA, QDA.
c) It depends. If the true model is linear, LDA may perform better than QDA, but roughly speaking, when given large training set, the QDA can perform better since it is more flexible.
d) true. When given large training set, this probably happens.
6,
a)
P = exp(-6 + 0.05 * 40 + 1 * 3.5) / (exp(-6 + 0.05 * 40 + 1 * 3.5) + 1) = 0.3775.
b)
-6 + 0.05 * x + 3.5 * 1 = 0 --> x = 50
7,
0.752
8,
KNN with K = 1 has 0% trainning error rate. But its average error rate is 18%, so its test error rate is 36%. So we choose to use logistic regression since it only has 30% test error rate.
9,
a)
p(x)/(1-p(x)) = 0.37 --> p(x) = 0.27
b)
16% / (1-16%) = 0.19
10,
b) Yes, Lag2, since its p-value is less than 0.05.
c)
pred = predict(lm.fit, type="response")
lm.pred = rep("Down", length(pred))
lm.pred[pred>0.5] = "Up"
table(lm.pred, Direction)
We may conclude that the percentage of correct predictions on the training data is (54+557)/1089 wich is equal to 56.1065197%. In other words 43.8934803% is the training error rate, which is often overly optimistic. We could also say that for weeks when the market goes up, the model is right 92.0661157% of the time 557/(48+557). For weeks when the market goes down, the model is right only 11.1570248% of the time 54/(54+430).
Chapter 5
the solution to chapter 5 has gotten lost due to my misbehavior in CNBlog. The exercise in chapter 5 is not compound, so I will not rewrite it.