Kaggle

Learn专题二——机器学习简介

2019-05-28  本文已影响0人  Python解决方案

第一课《模型如何工作》


1.以放假预测为例。在收集了某市放假的历史数据及不同房源信息后,对该市的未来房价做出预测。首先想到的是最简单的方法决策树。 简单决策树.png

2.遵循的大致流程:


第二课《简单的数据探索》


1.通过pandas对数据进行观察,熟悉数据

import pandas as pd

2.pandas最重要的数据类型Dataframes,还有一个Series

# save filepath to variable for easier access
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'

# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path) 

# print a summary of the data in Melbourne data
melbourne_data.describe()
    Rooms   Price   Distance    Postcode    Bedroom2    Bathroom    Car Landsize    BuildingArea    YearBuilt   Lattitude   Longtitude  Propertycount
count   13580.000000    1.358000e+04    13580.000000    13580.000000    13580.000000    13580.000000    13518.000000    13580.000000    7130.000000 8205.000000 13580.000000    13580.000000    13580.000000
mean    2.937997    1.075684e+06    10.137776   3105.301915 2.914728    1.534242    1.610075    558.416127  151.967650  1964.684217 -37.809203  144.995216  7454.417378
std 0.955748    6.393107e+05    5.868725    90.676964   0.965921    0.691712    0.962634    3990.669241 541.014538  37.273762   0.079260    0.103916    4378.581772
min 1.000000    8.500000e+04    0.000000    3000.000000 0.000000    0.000000    0.000000    0.000000    0.000000    1196.000000 -38.182550  144.431810  249.000000
25% 2.000000    6.500000e+05    6.100000    3044.000000 2.000000    1.000000    1.000000    177.000000  93.000000   1940.000000 -37.856822  144.929600  4380.000000
50% 3.000000    9.030000e+05    9.200000    3084.000000 3.000000    1.000000    2.000000    440.000000  126.000000  1970.000000 -37.802355  145.000100  6555.000000
75% 3.000000    1.330000e+06    13.000000   3148.000000 3.000000    2.000000    2.000000    651.000000  174.000000  1999.000000 -37.756400  145.058305  10331.000000
max 10.000000   9.000000e+06    48.100000   3977.000000 20.000000   8.000000    10.000000   433014.000000   44515.000000    2018.000000 -37.408530  145.526350  21650.000000

第三课《探索自己的数据》


本课目的提高你读取数据文件及理解数据统计的能力


第四课《第一个机器学习模型》


对数据进行预处理

import pandas as pd

melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns

处理缺失数据

# dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)

选择部分数据的方法
1)Dot notation, which we use to select the "prediction target"
2)Selecting with a column list, which we use to select the "features"
选择预测目标

y = melbourne_data.Price

选择特征

melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

观察training data

X.describe()
X.head()

建立模型

The steps to building and using a model are:

一个决策树的例子

from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

random_state ensures you get the same results in each run.

模型建立后,先用训练数据的前五个来测试一下

print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

结果如下

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]

第五课《练习:第一个机器学习模型》


# Code you have previously used to load data
import pandas as pd

# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

home_data = pd.read_csv(iowa_file_path)

# print the list of columns in the dataset to find the name of the prediction target
home_data.columns

y=home_data.SalePrice

# Create the list of features below
feature_names = ['LotArea','YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd']

# select data corresponding to features in feature_names
X = home_data[feature_names]

# Review data
# print description or statistics from X
print(X.describe())

# print the top few lines
print(X.head())

from sklearn.tree import DecisionTreeRegressor
#specify the model. 
#For model reproducibility, set a numeric value for random_state when specifying the model
iowa_model = DecisionTreeRegressor(random_state=1)

# Fit the model
iowa_model.fit(X,y)

predictions = iowa_model.predict(X.head())
print(predictions)
step_4.check()

第六课《模型验证》


There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE).

from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

先把数据进行分割,得到训练数据、验证数据

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

第七课《练习:模型验证》



第八课《过拟合与欠拟合》


过拟合与欠拟合.png
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

输出结果

Max leaf nodes: 5        Mean Absolute Error:  347380
Max leaf nodes: 50           Mean Absolute Error:  258171
Max leaf nodes: 500          Mean Absolute Error:  243495
Max leaf nodes: 5000         Mean Absolute Error:  254983

第九课《练习:过拟合与欠拟合》



第十课《随机森林》


from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

第十一课《练习:随机森林》



上一篇 下一篇

猜你喜欢

热点阅读