深度学习

2020机器学习决策树(3)

2020-03-19  本文已影响0人  zidea
machine_learning.jpg

CART

Loss = \min_{j,s} [\min_{c1} L(y^{(i)},c_1) + \min_{c2} L(y^{(i)},c_2)]\, \tag{1}

\hat{y} = \frac{1}{|R_i|} \sum_{x_i \in R_1} y^{(i)}
我们在变量 x 的 j 维度上,选取一点 s 将数据在 j 维度上分为两个区域 R_1R_2\hat{y} 表示 x 在 R_i 区域上求 y^{(i)} 平局值。我们选取点要让上面目标函数(1) 最小。\hat{y}y^{(i)} 之间距离可是绝对值或者是方差

\sum_i (y^{(i)} - \hat{y})^2

Loss = \min_{j,s} [\min_{c1} \sum_{x \in R_1(j,s)}(y^{(i)} - c_1)^2 + \min_{c2} \sum_{x \in R_2(j,s)} (y^{(i)} - c_2)^2]

\begin{cases} R_1(j,s) = \{ x | x_j \le s \} \\ R_2(j,s) = \{ x | x_j > s \} \end{cases}

c_m = \frac{1}{N_m} \sum_{x \in R_m(j,s)} y^{(i)} \, m = 1,2

\begin{cases} R_1 = \{ 0 \} & c_1 = 0\\ R_2 = \{ 1,2,\dots,9 \} & c_2 = 5 \end{cases}

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
import pandas as pd
import requests
import os
from PIL import Image
import sklearn
from sklearn.datasets import make_hastie_10_2
%matplotlib inline
np.random.seed(0)
X = np.random.normal(0,1,10)
y = X * 2 + 3 + np.random.normal(0,0.5,10)
plt.scatter(X,y)
plt.axvline(x=0,color='r',linestyle='dashed')
<matplotlib.lines.Line2D at 0x129d5df50>
output_13_1.png
# X
X_copy = X.copy()
X_copy = X_copy[X_copy<0]
len = X_copy.size
X_R_1,X_R_2 = np.split(np.sort(X),[len])

X_R_1,X_R_2
(array([-0.97727788, -0.15135721, -0.10321885]),
 array([0.40015721, 0.4105985 , 0.95008842, 0.97873798, 1.76405235,
        1.86755799, 2.2408932 ]))
y_R_1,y_R_2 = np.split(np.sort(y),[len])
y_R_1,y_R_2

(array([1.2122814 , 2.59470645, 2.95009615]),
 array([3.39414913, 4.52745117, 5.33799483, 5.64721637, 6.60012648,
        6.9570476 , 7.54262391]))
c_1 = 1/len * np.sum(y_R_1)
c_2 = 1/(10 - len) * np.sum(y_R_2)
c_1
2.2523613342314497
plt.scatter(X,y)
plt.axvline(x=0,color='r',linestyle='dashed')
plt.axhline(y=c_1,color='b',linestyle='dashed')
plt.axhline(y=c_2,color='b',linestyle='dashed')
plt.grid()
output_17_0.png
y_R_2 - c_2
array([-2.32108079, -1.18777876, -0.3772351 , -0.06801356,  0.88489655,
        1.24181767,  1.82739398])

Loss(1.5) = \sum_{i=1}^1 (y_i - c_1)^2 + \sum_{i=1}^9 (y_i - c_2)^2

L = np.sum((y_R_1 - c_1 )**2) + np.sum((y_R_2 - c_2)**2)
L
14.29548867988851
plt.scatter(X,y)
plt.axvline(x=0.55,color='r',linestyle='dashed')
<matplotlib.lines.Line2D at 0x126335fd0>
output_21_1.png
# X
X_copy_1 = X.copy()
X_copy_1 = X_copy_1[X_copy_1<0.55]
len = X_copy_1.size
len
5
X_R_1,X_R_2 = np.split(np.sort(X),[len])
y_R_1,y_R_2 = np.split(np.sort(y),[len])
c_1 = 1/len * np.sum(y_R_1)
c_2 = 1/(10 - len) * np.sum(y_R_2)

plt.scatter(X,y)
plt.axvline(x=0.55,color='r',linestyle='dashed')
plt.axhline(y=c_1,color='b',linestyle='dashed')
plt.axhline(y=c_2,color='b',linestyle='dashed')
plt.grid()
output_23_0.png
L = np.sum((y_R_1 - c_1 )**2) + np.sum((y_R_2 - c_2)**2)
L
9.179537778528713

上一篇下一篇

猜你喜欢

热点阅读