机器学习与网络

大师兄的数据分析学习笔记(十七):分类模型(三)

2022-07-15  本文已影响0人  superkmi

大师兄的数据分析学习笔记(十六):分类模型(二)

三、决策树

1. 信息增益

H(X) = -\sum{p_ilog(p_i)}

I(X,Y) = H(Y) - H(Y|X) = H(X) - H(X|Y)

2. 信息增益率
3. 基尼系数
4. 决策树的问题
4.1 连续值切分
4.2 规则用尽
4.2 过拟合
  • 前剪枝表示在构造决策树之前,规定每个叶子节点的样本数量,或规定决策树的最大深度。
  • 后剪枝表示构造决策树后,对样本值比较悬殊的枝叶进行修建。
5. 代码实现
import os
import pandas as pd
import numpy as np
import pydotplus
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB,BernoulliNB
from sklearn.metrics import  accuracy_score,recall_score,f1_score
from sklearn.tree import DecisionTreeClassifier,export_graphviz

models = []
models.append(("DesicionTree",DecisionTreeClassifier())) # 基尼决策树
models.append(("DesicionTreeEntropy",DecisionTreeClassifier(criterion="entropy"))) # 信息增益决策树

df = pd.read_csv(os.path.join(".", "data", "WA_Fn-UseC_-HR-Employee-Attrition.csv"))
X_tt,X_validation,Y_tt,Y_validation = train_test_split(df.JobLevel,df.JobSatisfaction,test_size=0.2)
X_train,X_test,Y_train,Y_test = train_test_split(X_tt,Y_tt,test_size=0.25)

for clf_name,clf in models:
    clf.fit(np.array(X_train).reshape(-1,1),np.array(Y_train).reshape(-1,1))
    xy_lst = [(X_train,Y_train),(X_validation,Y_validation),(X_test,Y_test)]
    for i in range(len(xy_lst)):
        X_part = xy_lst[i][0]
        Y_part = xy_lst[i][1]
        Y_pred = clf.predict(np.array(X_part).reshape(-1,1))
        print(i)
        print(clf_name,"-ACC",accuracy_score(Y_part,Y_pred))
        print(clf_name,"-REC",recall_score(Y_part,Y_pred,average='macro'))
        print(clf_name,"-F1",f1_score(Y_part,Y_pred,average='macro'))
        print("="*40)
        dot_data = export_graphviz(clf,out_file=None,filled=True,rounded=True,special_characters=True)
        graph = pydotplus.graph_from_dot_data(dot_data)
        graph.write_pdf(f"graph_{clf_name}.pdf")
基尼决策树
信息增益决策树
上一篇下一篇

猜你喜欢

热点阅读