[ML] 6 Advice for Applying Machi

2018-11-11 本文已影响0人反复练习的阿离很笨吧

Evaluating a learning algorithm

machine learning diagnostics

当training好的样本test的时候出现了很大的误差，应该如何改进算法？

Evaluating a Hypothesis

如何去量化评估训练得到的假设函数的好坏呢？
引入a training set and a test set
在training set上评估是不准确的，应该在没有被学习过的test set上跑，得到泛化误差。
error analysis: The test set error $J_{test}(theta)$

Model Selection and Train/Validation/Test Sets

假如想要确定对于某组数据，拟合它的最合适的多项式次数是几次；或者怎样选用正确的特征来构造学习算法；或者假如你需要选择学习算法中的正则化参数λ，这些问题我们称之为模型选择问题。
由于在training完之后还要进行选择，选择后才是最后得到的结果，所以处理模型选择问题时，需要将数据集分成三个。
One way to break down our dataset into the three sets is:
Training set: 60%
Cross validation set: 20%（新增，用于模型选择）
Test set: 20%

We can now calculate three separate error values for the three different sets using the following method:

Optimize the parameters in Θ using the training set for each polynomial degree.
Find the polynomial degree d with the least error using the cross validation set.
Estimate the generalization error using the test set with $J_{test}(\Theta^{(d)})$ , (d = theta from polynomial with lower error);

Bias vs. Variance

算法预测结果不好，就两个原因，underfitting和overfitting，对应Bias vs. Variance。
通过比较 $J_{train}(\Theta)$ and $J_{CV}(\Theta)$ 可以知道算法是过拟合还是欠拟合：
High bias (underfitting): both $J_{train}(\Theta)$ and $J_{CV}(\Theta)$ will be high. Also, $J_{CV}(\Theta) \approx J_{train}(\Theta)J$
High variance (overfitting): $J_{train}(\Theta)$ will be low and $J_{CV}(\Theta)$ will be much greater than $J_{train}(\Theta)$

通过比较Jcv和Jtrain可以知道算法是过拟合还是欠拟合

Regularization and Bias/Variance

正则化可以有效地防止过拟合，而正则化跟算法的偏差和方差又有什么相互关系
正则化参数 $\lambda$ 对过拟合和欠拟合的影响。
今天终于弄懂正则化了！
为什么cv和test不加正则化项？
因为正则化是用在training阶段减少过拟合的，而在test阶段计算 $J_{test}(\Theta)$ 就是单纯的计算J，不像training一样需要求J的最小值。
quick and dirty

学习曲线

总结

Machine Learning System Design

Building a Spam Classifier

交叉验证检测设定误差度量值（分类精确度）
人工看分错的例子
是属于什么类
然后优化

Handling Skewed Data

skew时
分类精确度不好使了
并不能真正很好地衡量算法

Using Large Data Sets

https://blog.csdn.net/sundy0808/article/details/78919646