kaggle竞赛：Instant Gratification

2019-06-02 本文已影响0人深度学习模型优化

该竞赛针对一个表格数据，来做二分类问题。具体的含义可以参考Kaggle官网给出的问题描述。

这个竞赛要注意的地方：

需要按照列wheezy-copper-turtle-magic将数据集氛围512个子数据集；
数据基于高斯分布，所以记得要使用平方判别分析QDA；
第3点就是老人了，需要做CV和stacking。（事实证明做用QDA对同样的数据集，结果是一样的，所以要么做数据集采样，然后使用QDA做bagging）

这么想是因为没有什么工具，除了sklearn开箱即用。只能自己写类了。

这里借鉴了Chris Deotte的优秀工作。
我在这里简单解释一下原理。

1 数据集分割

按照wheezy-copper-turtle-magic字段进行分割，这个比较简单就不详细说。这里只给出代码：

for i in tqdm_notebook(range(512)):
    train2 = train[train['wheezy-copper-turtle-magic']==i]
    test2 = test[test['wheezy-copper-turtle-magic']==i]

我们已知wheezy-copper-turtle-magic的值从0到511，因此这里直接遍历。

2 平方判别分析

具体的原理解释参考。

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

做QDA之前，需要将方差太小的特征给删除掉。

data2 = StandardScaler().fit_transform(VarianceThreshold(threshold=1.5).fit_transform(data[cols]))
train4 = data2[:train2.shape[0]]; test4 = data2[train2.shape[0]:]

这里的门限设置为1.5。该门限对结果影响不是很大，可以稳定在96%的准确率。

3 CV和Stacking

有了以上两个工作之后，基本能够保证排名还是比较靠前的。但是真正的冲击排名，则需要CV和Stacking了。

先说下CV。

3.1 CV

CV采用StratifiedKFold来做。问题在于选择n_splits的值，如果值太大，容易过拟合而且运行速度非常慢。如果太小，则容易欠拟合。这里做两个结果，因为我们可以投两个脚本。

第一个脚本的n_splits为10。

第二个脚本的n_splits为20。（一般认为20的要优于10，但是不知道private LB的结果是怎么样的?）

3.2 Stacking

Stacking模型包括：

QDA
NuSVC
SVC
LR
KNN
MLP

QDA的特征选择使用方差选择，其它的几种模型的特征选择使用PCA降维。

clf = NuSVC(probability=True, kernel='poly', degree=4, gamma='auto', random_state=4, nu=0.59, coef0=0.053)
clf.fit(train3[train_index,:],train2.loc[train_index]['target'])
oof_svnu[idx1[test_index]] = clf.predict_proba(train3[test_index,:])[:,1]
pred_te_svnu[idx2] += clf.predict_proba(test3)[:,1] / skf.n_splits
        
clf = neighbors.KNeighborsClassifier(n_neighbors=17, p=2.9)
clf.fit(train3[train_index,:],train2.loc[train_index]['target'])
oof_knn[idx1[test_index]] = clf.predict_proba(train3[test_index,:])[:,1]
pred_te_knn[idx2] += clf.predict_proba(test3)[:,1] / skf.n_splits
        
clf = linear_model.LogisticRegression(solver='saga',penalty='l1',C=0.1)
clf.fit(train3[train_index,:],train2.loc[train_index]['target'])
oof_lr[idx1[test_index]] = clf.predict_proba(train3[test_index,:])[:,1]
pred_te_lr[idx2] += clf.predict_proba(test3)[:,1] / skf.n_splits
        
clf = neural_network.MLPClassifier(random_state=3,  activation='relu', solver='lbfgs', tol=1e-06, hidden_layer_sizes=(250, ))
clf.fit(train3[train_index,:],train2.loc[train_index]['target'])
oof_mlp[idx1[test_index]] = clf.predict_proba(train3[test_index,:])[:,1]
pred_te_mlp[idx2] += clf.predict_proba(test3)[:,1] / skf.n_splits
        
clf = svm.SVC(probability=True, kernel='poly', degree=4, gamma='auto', random_state=42)
clf.fit(train3[train_index,:],train2.loc[train_index]['target'])
oof_svc[idx1[test_index]] = clf.predict_proba(train3[test_index,:])[:,1]
pred_te_svc[idx2] += clf.predict_proba(test3)[:,1] / skf.n_splits
        
clf = QuadraticDiscriminantAnalysis(reg_param=0.111)
clf.fit(train4[train_index,:],train2.loc[train_index]['target'])
 oof_qda[idx1[test_index]] = clf.predict_proba(train4[test_index,:])[:,1]
pred_te_qda[idx2] += clf.predict_proba(test4)[:,1] / skf.n_splits

上面独立训练了6个分类器。然后按照分类的能力对6个分类器的结果进行。

最后使用SVM分类器来实现最终的stacking。