Python数据分析与机器学习23-决策树项目实战

2022-07-21  本文已影响0人  只是甲

一. 数据集介绍

我们使用sklearn官方的数据集: California housing dataset

代码:

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn import tree
import pydotplus
from IPython.display import Image

housing = fetch_california_housing()
print("#######################################")
print(housing.DESCR)
print("#######################################")
print(housing.data.shape)
print("#######################################")
print(housing.data[0:10])

测试记录:

#######################################
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

An household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surpinsingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297

#######################################
(20640, 8)
#######################################
[[ 8.32520000e+00  4.10000000e+01  6.98412698e+00  1.02380952e+00
   3.22000000e+02  2.55555556e+00  3.78800000e+01 -1.22230000e+02]
 [ 8.30140000e+00  2.10000000e+01  6.23813708e+00  9.71880492e-01
   2.40100000e+03  2.10984183e+00  3.78600000e+01 -1.22220000e+02]
 [ 7.25740000e+00  5.20000000e+01  8.28813559e+00  1.07344633e+00
   4.96000000e+02  2.80225989e+00  3.78500000e+01 -1.22240000e+02]
 [ 5.64310000e+00  5.20000000e+01  5.81735160e+00  1.07305936e+00
   5.58000000e+02  2.54794521e+00  3.78500000e+01 -1.22250000e+02]
 [ 3.84620000e+00  5.20000000e+01  6.28185328e+00  1.08108108e+00
   5.65000000e+02  2.18146718e+00  3.78500000e+01 -1.22250000e+02]
 [ 4.03680000e+00  5.20000000e+01  4.76165803e+00  1.10362694e+00
   4.13000000e+02  2.13989637e+00  3.78500000e+01 -1.22250000e+02]
 [ 3.65910000e+00  5.20000000e+01  4.93190661e+00  9.51361868e-01
   1.09400000e+03  2.12840467e+00  3.78400000e+01 -1.22250000e+02]
 [ 3.12000000e+00  5.20000000e+01  4.79752705e+00  1.06182380e+00
   1.15700000e+03  1.78825348e+00  3.78400000e+01 -1.22250000e+02]
 [ 2.08040000e+00  4.20000000e+01  4.29411765e+00  1.11764706e+00
   1.20600000e+03  2.02689076e+00  3.78400000e+01 -1.22260000e+02]
 [ 3.69120000e+00  5.20000000e+01  4.97058824e+00  9.90196078e-01
   1.55100000e+03  2.17226891e+00  3.78400000e+01 -1.22250000e+02]]

二. 使用sklearn构建决策树

代码:

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn import tree
import pydotplus
from IPython.display import Image

# 读取数据集
housing = fetch_california_housing()
#print(housing.DESCR)
#print(housing.data.shape)

# 指定参数 最大深度为2
dtr = tree.DecisionTreeRegressor(max_depth = 2)
# fit传入X 和 y,此处我们只选择第6和7列,精度和纬度
dtr.fit(housing.data[:, [6, 7]], housing.target)

# 将决策树模型画出来
dot_data = \
    tree.export_graphviz(
        dtr, #决策树模型
        out_file = None,
        feature_names = housing.feature_names[6:8], # 传入的X和y
        filled = True,
        impurity = False,
        rounded = True
    )

graph = pydotplus.graph_from_dot_data(dot_data)
graph.get_nodes()[7].set_fillcolor("#FFF2DD")
Image(graph.create_png())
graph.write_png("dtr_white_background.png")

测试记录:

image.png

三. 调参

3.1 树模型参数

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# 读取数据集
housing = fetch_california_housing()
#print(housing.DESCR)
#print(housing.data.shape)

# 划分训练集和测试集
data_train, data_test, target_train, target_test = \
    train_test_split(housing.data, housing.target, test_size=0.1, random_state=42)

# 使用GridSearchCV记性交叉验证,验证最佳参数
tree_param_grid = {'min_samples_split': list((3, 6, 9)), 'n_estimators': list((10, 50, 100))}
grid = GridSearchCV(RandomForestRegressor(), param_grid=tree_param_grid, cv=5)
grid.fit(data_train, target_train)

print(grid.best_params_, grid.best_score_)
print("###########################################")
print(grid.cv_results_)
print("******************************************")

# 使用上一步验证的参数来训练模型
rfr = RandomForestRegressor(min_samples_split=3, n_estimators=100, random_state=42)
rfr.fit(data_train, target_train)
print(rfr.score(data_test, target_test))

print("###########################################")
# 输出各参数的权重值
result_1 = pd.Series(rfr.feature_importances_, index=housing.feature_names).sort_values(ascending=False)
print(result_1)

测试记录:

{'min_samples_split': 3, 'n_estimators': 100} 0.8068921554242904
###########################################
{'mean_fit_time': array([0.8004458 , 4.03023052, 7.95025473, 0.74344239, 3.7716157 ,
       7.44682593, 0.71324077, 3.56500397, 7.17021012]), 'std_fit_time': array([0.00475824, 0.07281077, 0.04690861, 0.00344107, 0.0298989 ,
       0.07583201, 0.00172054, 0.01466244, 0.06738456]), 'mean_score_time': array([0.00940046, 0.04360251, 0.08440475, 0.00740047, 0.03520207,
       0.0692039 , 0.00680041, 0.03140182, 0.06140351]), 'std_score_time': array([0.00079997, 0.00119998, 0.00079999, 0.00048992, 0.00146967,
       0.0014698 , 0.00040007, 0.00101995, 0.00048998]), 'param_min_samples_split': masked_array(data=[3, 3, 3, 6, 6, 6, 9, 9, 9],
             mask=[False, False, False, False, False, False, False, False,
                   False],
       fill_value='?',
            dtype=object), 'param_n_estimators': masked_array(data=[10, 50, 100, 10, 50, 100, 10, 50, 100],
             mask=[False, False, False, False, False, False, False, False,
                   False],
       fill_value='?',
            dtype=object), 'params': [{'min_samples_split': 3, 'n_estimators': 10}, {'min_samples_split': 3, 'n_estimators': 50}, {'min_samples_split': 3, 'n_estimators': 100}, {'min_samples_split': 6, 'n_estimators': 10}, {'min_samples_split': 6, 'n_estimators': 50}, {'min_samples_split': 6, 'n_estimators': 100}, {'min_samples_split': 9, 'n_estimators': 10}, {'min_samples_split': 9, 'n_estimators': 50}, {'min_samples_split': 9, 'n_estimators': 100}], 'split0_test_score': array([0.79160023, 0.81033663, 0.81080664, 0.79252505, 0.80765124,
       0.8117226 , 0.79482236, 0.80654121, 0.81009268]), 'split1_test_score': array([0.78890171, 0.79928278, 0.80141166, 0.78598424, 0.79600627,
       0.79906143, 0.77847628, 0.80139533, 0.80106729]), 'split2_test_score': array([0.78774617, 0.80061737, 0.80477793, 0.7863239 , 0.80015027,
       0.80422593, 0.78908726, 0.7983616 , 0.80104587]), 'split3_test_score': array([0.78962335, 0.80730398, 0.8103788 , 0.791012  , 0.80609438,
       0.81097377, 0.79964151, 0.80805967, 0.81052746]), 'split4_test_score': array([0.786275  , 0.80545325, 0.8070847 , 0.7935861 , 0.80582379,
       0.80788122, 0.78913747, 0.80539517, 0.80880244]), 'mean_test_score': array([0.78882944, 0.80459911, 0.80689216, 0.7898864 , 0.80314543,
       0.80677326, 0.79023322, 0.80395074, 0.80630735]), 'std_test_score': array([0.00178956, 0.00412521, 0.00352203, 0.00315705, 0.00438431,
       0.0046761 , 0.00707542, 0.00356223, 0.00432444]), 'rank_test_score': array([9, 4, 1, 8, 6, 2, 7, 5, 3]), 'split0_train_score': array([0.95768184, 0.96868382, 0.96984789, 0.94577487, 0.95695691,
       0.95793227, 0.93473427, 0.94331768, 0.94445909]), 'split1_train_score': array([0.95861403, 0.96806912, 0.96997044, 0.94650039, 0.95641239,
       0.95734931, 0.93327575, 0.9443791 , 0.94508244]), 'split2_train_score': array([0.9597377 , 0.96897156, 0.97018228, 0.94723203, 0.9570711 ,
       0.95760447, 0.93364428, 0.94375518, 0.94538129]), 'split3_train_score': array([0.9594541 , 0.96937818, 0.96999027, 0.94439822, 0.95580838,
       0.95731595, 0.93525466, 0.94336581, 0.94493027]), 'split4_train_score': array([0.95786358, 0.96852439, 0.96974745, 0.94598214, 0.95641961,
       0.9580851 , 0.93313765, 0.94409264, 0.94537141]), 'mean_train_score': array([0.95867025, 0.96872541, 0.96994767, 0.94597753, 0.95653368,
       0.95765742, 0.93400932, 0.94378208, 0.9450449 ]), 'std_train_score': array([0.00082276, 0.00043807, 0.00014658, 0.00093621, 0.00045204,
       0.0003075 , 0.00083757, 0.0004105 , 0.00033985])}
******************************************
0.8090829049653158
###########################################
MedInc        0.524257
AveOccup      0.137947
Latitude      0.090622
Longitude     0.089414
HouseAge      0.053970
AveRooms      0.044443
Population    0.030263
AveBedrms     0.029084
dtype: float64

参考:

  1. https://study.163.com/course/introduction.htm?courseId=1003590004#/courseDetail?tab=1
上一篇下一篇

猜你喜欢

热点阅读