模型融合
Code: https://github.com/SimonLliu/DGB_AlphaTeam
在本次比赛中,一开始观察了下sklearn和 mlxtend.classifier内置的模型融合函数
https://blog.csdn.net/LAW_130625/article/details/78573736
mlxtend:
sclf = StackingCVClassifier(classifiers=[xgb, rfc, etc], meta_classifier=lr, use_probas=True, n_folds=3, verbose=3)
sklearn:
fromsklearn.ensembleimportVotingClassifier
clf1=LogisticRegression(random_state=1)
clf2=RandomForestClassifier(random_state=1)
clf3=GaussianNB()
eclf=VotingClassifier(estimators=[('lr',clf1),('rf',clf2),('gnb',clf3)],voting='soft')
params={'lr__C':[1.0,100.0],'rf__n_estimators':[20,200],}
grid=GridSearchCV(estimator=eclf,param_grid=params,cv=5
grid=grid.fit(iris.data,iris.target)
由上可观察出利用sklearn内置的模型融合工具,需要对融合后的模型进行再训练。这就迫使了我们开发出了概率加和和分类投票两种模型融合方法。
1、概率加和
1)读取模型+预测概率
clf2 = joblib.load("lr(c40).pkl")
y_test=clf2.predict_proba(x_test)
df_test['proba']=y_test.tolist()
df_result = df_test.loc[:,['id','proba']]
df_result.to_csv('result_proba_lg.csv',index=False)
2)读取概率+概率相加
lg_df = pd.read_csv('result_proba_lg.csv')
def series2arr(series):
res = []
for row in series:
res.append(np.array(eval(row)))
return np.array(res)
lg_prob_arr = series2arr(lg_df['proba'])
final_prob = svm_prob_arr+lg_prob_arr+kn_prob_arr+nb_prob_arr
3)重新预测结果
y_class=[np.argmax(row) for row in final_prob]
df_test['proba']=y_class
df_test['proba']=df_test['proba']+1
2、分类投票
1、把每个数组按照列combine
a_l = []
for i in range(len(res_l[0])):
a_l.append([res_l[j][i] for j in range(10)])
2、投票
def voting(class_l):
final_class = []
c_l = []
for row in class_l:
c = Counter(row)
c_v_set = set(c.values())
# 票数不等取最大
if(len(c_v_set) > 1):
res = max(c,key=c.get)
else: # 票数相等取最好结果的的值
res = row[max_idx]
final_class.append(res)
c_l.append(c)
return final_class,c_l
final_class,c = voting(a_l)