机器学习(七):决策树
一、基本原理
1.1、思想与流程
决策树(decision tree)是机器学习中常见的分类与回归方法,是一种呈树形结构的判别模型。决策树可以看做一个互斥且完备的if-then规则集合。同时决策树还表示定义在特征条件下类的条件概率分布,条件概率分布将特征空间划分为互不相交的单元(cell)或区域(region),每个单元定义一个类的概率分布就构成了一个条件概率分布。该条件概率分布可表示为,其中表示特征的随机变量,表示类的随机变量。
决策树学习本质上是从训练集中归纳出一组分类规则,是训练数据矛盾较小,同时具有很好的泛化性能。决策树的损失函数通常是正则化的极大似然函数,学习的目标是以损失函数为目标函数的最小化。决策树学习的算法通常是一个递归地选择最优特征,并根据该特征对训练数据进行分割,使得对各个子数据集有一个最好的分类的过程。其基本过程如下:
决策树基本过程1.2、特征选择
特征选择在于选取对训练数据具有分类能力的特征,这是决策树学习的关键。常见的特征选择的准则是信息增益和信息增益率。
信息增益(information gain)的表达式为 其中为特征,为训练数据集,为信息熵,为条件熵。
随机变量的信息熵为,条件熵为给定下的条件概率分布的熵对的数学期望。
信息增益率(information gain rate)即其信息增益与信息熵之比,即
1.3、剪枝
剪枝(pruning)是决策树处理过拟合的主要手段。具体地,剪枝从已生成的树上裁掉一些子树或叶节点,并将其根节点或父节点作为新的叶节点,从而简化树模型。剪枝可分为“预剪枝”和“后剪枝”,预剪枝是指在决策树生成过程中,对每个节点在划分前进行估计,若当前节点的划分不能带来决策树泛化性能的提升,则停止划分并将当前节点标记为叶节点;后剪枝则是先从训练集生成一棵完整的决策树,然后自底向上对非叶节点进行考察,若将该节点对应子树替换为叶节点能提高决策树泛化性能,则将该子树替换为叶节点。
二、算法实现
2.1、手动实现
1、模块导入与数据生成
import pickle
import operator
import matplotlib.pyplot as plt
from math import log
def createDataSet():
dataSet = [[0, 0, 0, 0, 'no'], #数据集
[0, 0, 0, 1, 'no'],
[0, 1, 0, 1, 'yes'],
[0, 1, 1, 0, 'yes'],
[0, 0, 0, 0, 'no'],
[1, 0, 0, 0, 'no'],
[1, 0, 0, 1, 'no'],
[1, 1, 1, 1, 'yes'],
[1, 0, 1, 2, 'yes'],
[1, 0, 1, 2, 'yes'],
[2, 0, 1, 2, 'yes'],
[2, 0, 1, 1, 'yes'],
[2, 1, 0, 1, 'yes'],
[2, 1, 0, 2, 'yes'],
[2, 0, 0, 0, 'no']]
labels = ['age', 'work', 'house', 'credit'] #特征标签
return dataSet, labels
def splitData(data,axis,value):
newData = []
for vec in data:
if vec[axis] == value:
newData.append((vec[:axis]+vec[axis+1:]))
return newData
2、计算交叉熵
def calcEntropy(data):
row = len(data)
label = {}
for vec in data:
current_label = vec[-1]
if current_label not in label.keys():
label[current_label] = 0
label[current_label] += 1
entropy = 0
for key in label:
prob = float(label[key])/row
entropy -= prob*log(prob,2)
return entropy
3、选择最优特征
def chooseFeature(data):
features = len(data[0])-1
entropy = calcEntropy(data)
best_info_gain = 0.0
best_feature = -1
for i in range(features):
feature_list = set([example[i] for example in data])
temp = 0.0
for value in feature_list:
subdata = splitData(data,i,value)
prob = len(subdata)/len(data)
temp += prob*calcEntropy(data)
info_gain = entropy - temp
if info_gain > best_info_gain:
best_info_gain = info_gain
best_feature = i
return best_feature
4、统计类标签中最多的元素
def majorityClass(class_list):
class_count = {}
for vote in class_list:
if vote not in class_count.keys():
class_count[vote] = 0
class_count[vote] += 1
sort_class = sorted(class_count.items(),key = operator.itemgetter(1),reverse = True)
return sort_class[0][0]
5、创建决策树
def createTree(data,labels,features):
class_list = [example[-1] for example in data]
if class_list.count(class_list[0]) == len(class_list):
return class_list
if len(data[0]) == 1 or len(labels) == 0:
return majorityClass(class_list)
best_feature = chooseFeature(data)
best_label = labels[best_feature]
features.append(best_label)
tree = {best_label:{}}
del(labels[best_feature])
feature_list = set([example[best_feature] for example in data])
for value in feature_list:
sublabels = labels[:]
tree[best_label][value] = createTree(splitData(data,best_feature,value),sublabels,features)
return tree
6、获取决策树叶子节点数目及决策树层数
def numLeafs(tree):
leafs = 0
string = next(iter(tree))
tree_dict = tree[string]
for key in tree_dict.keys():
if type(tree_dict[key]).__name__=='dict':
leafs += numLeafs(tree_dict[key])
else:
leafs += 1
return leafs
def treeDepth(tree):
max_depth = 0
string = next(iter(tree))
tree_dict = tree[string]
for key in tree_dict.keys():
if type(tree_dict[key]).__name__=='dict':
depth = 1 + treeDepth(tree_dict[key])
else:
depth = 1
if depth>max_depth:
max_depth = depth
return max_depth
7、绘制图像,包括树、节点、边属性
def plotNode(node_txt,centerPt,parent,node_type):
arrow = dict(arrowstyle = '<-')
createPlot.ax1.annotate(node_txt,xy=parent,xycoords='axes fraction',xytext=centerPt,textcoords='axes fraction',va='center',ha='center',bbox=node_type,arrowprops=arrow)
def plotText(cntr,parent,txt):
x_mid = (parent[0]-cntr[0])/2.0 + cntr[0]
y_mid = (parent[1]-cntr[1])/2.0 + cntr[1]
createPlot.ax1.text(x_mid,y_mid,txt,va='center',ha='center',rotation=30)
def createPlot(tree):
fig = plt.figure(1,facecolor='white')
fig.clf()
axprops = dict(xticks=[],yticks=[])
createPlot.ax1 = plt.subplot(111,frameon=False,**axprops)
plotTree.totalW = float(numLeafs(tree))
plotTree.totalD = float(treeDepth(tree))
plotTree.xOff = -0.5/plotTree.totalW
plotTree.yOff = 1.0
plotTree(tree,(0.5,1.0),'')
plt.show()
def plotTree(tree,parent,txt):
decision_node = dict(boxstyle='sawtooth',fc='0.8')
leaf_node = dict(boxstyle='round4',fc='0.8')
leafs = numLeafs(tree)
depth = treeDepth(tree)
string = next(iter(tree))
cntr = (plotTree.xOff + (1.0+float(leafs))/2.0/plotTree.totalW, plotTree.yOff)
plotText(cntr,parent,txt)
plotNode(string,cntr,parent,decision_node)
tree_dict = tree[string]
plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
for key in tree_dict.keys():
if type(tree_dict[key]).__name__=='dict':
plotTree(tree_dict[key],cntr,str(key))
else:
plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
plotNode(tree_dict[key],(plotTree.xOff,plotTree.yOff),cntr,leaf_node)
plotText((plotTree.xOff,plotTree.yOff),cntr,str(key))
plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
8、使用决策树进行分类及存储决策树
def classify(tree,labels,vec):
string = next(iter(tree))
tree_dict = tree[string]
index = labels.index(string)
for key in tree_dict.keys():
if vec[index] == key:
if type(tree_dict[key]).__name__=='dict':
class_label = classify(tree_dict[key],labels,vec)
else:
class_label = tree_dict[key]
return class_label
def storeTree(tree,filename):
with open(filename,'wb') as f:
pickle.dump(tree,f)
9、主函数
if __name__ == '__main__':
data,labels = createDataSet()
feature_labels = []
tree = createTree(data,labels,feature_labels)
createPlot(tree)
test_vec = [0,1]
result = classify(tree,feature_labels,test_vec)
if result == 'yes':
print('lending')
if result == 'no':
print('no lending')
下图为上述决策树产生的分类示意图,展现了分类树的作用过程。
分类示意图
2.2、使用sklearn库
可以调用mglearn库,展现动物分类的过程,如下:
import mglearn
mglearn.plots.plot_animal_tree()
动物分类举例
使用sklearn库中DecisionTreeClassifier对癌症数据集进行处理,结果如下:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import export_graphviz
import graphviz
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,stratify=cancer.target,random_state=42)
tree = DecisionTreeClassifier(max_depth=4,random_state=0)
tree.fit(X_train,y_train)
print('accuracy on training set:{:.3f}'.format(tree.score(X_train,y_train)))
print('accuracy on test set:{:.3f}'.format(tree.score(X_test,y_test)))
accuracy on training set:0.988
accuracy on test set:0.951
绘制其分类过程如下:
export_graphviz(tree,out_file='tree.dot',class_names=['malignant','benign'],feature_names=cancer.feature_names,impurity=False,filled=True)
with open('tree.dot') as f:
dot_graph = f.read()
graphviz.Source(dot_graph)
癌症数据集分类过程
同样,可找出其特征重要性,如下:
n_features = cancer.data.shape[1]
plt.barh(range(n_features),tree.feature_importances_,align='center')
plt.yticks(np.arange(n_features),cancer.feature_names)
plt.xlabel('Feature importance')
plt.ylabel('Feature')
特征重要性分布
三、问题探讨
(信息)熵、联合熵、条件熵、相对熵、互信息
熵(Entropy)是随机变量的不确定性的度量。设是离散随机变量,其概率密度函数,则其信息熵为: 当对数底数是时,单位是bit,当对数底数是时,单位是nat(奈特)。
如果随机变量,则其联合熵(Joint entropy)为:
类似的,其条件熵(Conditional entropy)为:
另外,有如下关系成立:,证明如下:
交叉熵(Cross entropy)(又称相对熵,KL散度等),是两个随机分布之间距离的度量。当真实分布为,而假定分布为,其交叉熵为:
互信息(Mutual information)是一个随机变量包含另一个随机变量信息量的度量,也可以说是在给定一个随机变量的条件下,原随机变量的不确定性的减少量,即:
熵与互信息的关系: 证明如下:
参考资料
[1] https://github.com/lawlite19/MachineLearning_Python
[2] 周志华 著. 机器学习. 北京:清华大学出版社,2016
[3] 李航 著. 统计学习方法. 北京:清华大学出版社,2012
[4] 史春奇等 著. 机器学习算法背后的理论与优化. 北京:清华大学出版社,2019
[5] Peter Harrington 著. 李锐等 译. 机器学习实战. 北京:人民邮电出版社,2013
劝君更尽一杯酒,西出阳关无故人。 ——王维《送元二使安西》