机器学习应用建议(一)
决定下一步做什么
这里我们以房价预测为例,假设我们需要使用线性回归模型对房价进行预测,其代价函数J(θ)如下图所示。当我们将已经训练好的模型用来预测房价时,我们发现有较大的误差,那么我们下一步应该怎么做?
我们可能会想到以下几种方法:
- 获取更多的样本;
- 尝试减少特征变量的数量;
- 尝试获取更多的特征变量;
- 尝试增加多项式特征;
- 尝试减小正则化参数λ的值;
- 尝试增大正则化参数λ的值;
......
这些方法可能有用也可能没用,我们更不应该在实际应用中随机选择上述方法。还有一点要说明的是:上述方法中的任意一个方法,在具体实践过程中都可能转变为一个为期半年甚至时间更长的项目。因此,我们需要引入机器学习诊断法来帮助我们决定下一步该做什么才是有效的方法。
Question:
Which of the following statements about diagnostics are true? Check all that apply.
A. It's hard to tell what will work to improve a learning algorithm, so the best approach is to with gut feeling and just see what works.
B. Diagnostics can give guidance as to what might be more fruitful things to try to improve a learning algorithm.
C. Diagnostics can be time-consuming to implement and try, but they can still be a very good use of your time.
D. A diagnostics can sometimes rule out certain courses of action (changes to your learning algorithm) as being unlikely to improve its performance significantly.
我们不难选出B,C和D这三个正确答案。
假设评估
对于我们之前所提及的欠拟合问题和过拟合问题,我们是通过画图的方法来检验的。如若训练集中有较多的特征变量时,我们就无法将函数图给呈现出来。
对此,我们可以将数据集分为训练集和测试集两个部分,其中训练集占数据集的70%,测试集占数据集的30%。首先,我们利用训练集将代价函数J(θ)最小化,然后利用测试集测试模型误差,从而出模型是否出现欠拟合或过拟合问题。
注:如果数据集有一定规律,则要从数据集中随机选取70%的样本作为训练集,30%的样本作为测试集。
线性回归模型
- 利用训练集将代价函数J(θ)最小化,得到此时参数θ的值
- 利用测试集计算误差:
逻辑回归模型
- 利用训练集将代价函数J(θ)最小化,得到此时参数θ的值
- 利用测试集计算误差:
除此之外,对于逻辑回归模型我们还能计算误分类率来帮助我们理解逻辑回归模型的误差。
误分类率因此,我们可将误差测试的表达式改写为:
Question:
Suppose an implementation of linear regression (without regularization) is badly overfitting the training set. In this case, we would expect:
A. The training error J(θ) to be low and the test error Jtest(θ) to be high
B. The training error J(θ) to be low and the test error Jtest(θ) to be low
C. The training error J(θ) to be high and the test error Jtest(θ) to be low
D. The training error J(θ) to be high and the test error Jtest(θ) to be high
我们不难选出A这个正确答案。
补充笔记
Evaluating a Hypothesis
Once we have done some trouble shooting for errors in our predictions by:
- Getting more training examples
- Trying smaller sets of features
- Trying additional features
- Trying polynomial features
- Increasing or decreasing λ
We can move on to evaluate our new hypothesis.
A hypothesis may have a low error for the training examples but still be inaccurate (because of overfitting). Thus, to evaluate a hypothesis, given a dataset of training examples, we can split up the data into two sets: a training set and a test set. Typically, the training set consists of 70 % of your data and the test set is the remaining 30 %.
The new procedure using these two sets is then:
- Learn Θ and minimize Jtrain(Θ) using the training set
- Compute the test set error Jtest(Θ)
The test set error
- For linear regression:
- For classification ~ Misclassification error (aka 0/1 misclassification error):
This gives us a binary 0 or 1 error result based on a misclassification. The average test error for the test set is:
This gives us the proportion of the test data that was misclassified.
模型选择以及训练集、交叉验证集和测试集的划分
假设我们要在以下的多项式模型中选择一个合适的模型:
- (d = 1) hθ(x) = θ0 + θ1x
- (d = 2) hθ(x) = θ0 + θ1x + θ2x2
- (d = 3) hθ(x) = θ0 + θ1x + θ2x2 + θ3x3
...... - (d = 10) hθ(x) = θ0 + θ1x + θ2x2 + θ3x3 + ... + θ10x10
其中,参数d表示多项式的次数。对于这种情况,我们使用将数据集划分为训练集和测试集的方法。
上图中,我们假设d=5时其测试误差最小。但这时我们只是找到了一个对于测试集非常拟合的模型,我们无法判断其实际泛化误差是否完美。
因此,我们不能再将数据集只分为两部分,训练集和测试集。对此,我们引入交叉验证集(Validation Set)。我们将数据集分为三个部分,训练集(60%)、交叉验证集(20%)和测试集(20%)。
对于上例,我们需要计算三部分数据集的误差。
最终,我们可得到d=4时测试误差最小。
Question:
Consider the model selection procedure where we choose the degree of polynomial using a cross validation set. For the final model (with parameters θ), we might generally expect JCV(θ) To be lower than Jtest(θ) because:
A. An extra parameter (d, the degree of the polynomial) has been fit to the cross validation set.
B. An extra parameter (d, the degree of the polynomial) has been fit to the test set.
C. The cross validation set is usually smaller than the test set.
D. The cross validation set is usually larger than the test set.
终上所述,我们不难选出A这一正确答案。
补充笔记
Model Selection and Train/Validation/Test Sets
Just because a learning algorithm fits a training set well, that does not mean it is a good hypothesis. It could over fit and as a result your predictions on the test set would be poor. The error of your hypothesis as measured on the data set with which you trained the parameters will be lower than the error on any other data set.
Given many models with different polynomial degrees, we can use a systematic approach to identify the 'best' function. In order to choose the model of your hypothesis, you can test each degree of polynomial and look at the error result.
One way to break down our dataset into the three sets is:
- Training set: 60%
- Cross validation set: 20%
- Test set: 20%
We can now calculate three separate error values for the three different sets using the following method:
- Optimize the parameters in Θ using the training set for each polynomial degree.
- Find the polynomial degree d with the least error using the cross validation set.
- Estimate the generalization error using the test set with Jtest(Θ(d)), (d = theta from polynomial with lower error);
This way, the degree of the polynomial d has not been trained using the test set.