Learn专题二——机器学习简介
2019-05-28 本文已影响0人
Python解决方案
第一课《模型如何工作》
1.以放假预测为例。在收集了某市放假的历史数据及不同房源信息后,对该市的未来房价做出预测。首先想到的是最简单的方法决策树。

2.遵循的大致流程:
- Loading Raw Data
- Obeserving Data
- Preprocessing Data
- Splitting Data
- Fitting /Training
- Predict
- Validating
- Evaluating
- Modifing and Tuning
- Retraining
第二课《简单的数据探索》
1.通过pandas对数据进行观察,熟悉数据
import pandas as pd
2.pandas最重要的数据类型Dataframes,还有一个Series
# save filepath to variable for easier access
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path)
# print a summary of the data in Melbourne data
melbourne_data.describe()
Rooms Price Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt Lattitude Longtitude Propertycount
count 13580.000000 1.358000e+04 13580.000000 13580.000000 13580.000000 13580.000000 13518.000000 13580.000000 7130.000000 8205.000000 13580.000000 13580.000000 13580.000000
mean 2.937997 1.075684e+06 10.137776 3105.301915 2.914728 1.534242 1.610075 558.416127 151.967650 1964.684217 -37.809203 144.995216 7454.417378
std 0.955748 6.393107e+05 5.868725 90.676964 0.965921 0.691712 0.962634 3990.669241 541.014538 37.273762 0.079260 0.103916 4378.581772
min 1.000000 8.500000e+04 0.000000 3000.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1196.000000 -38.182550 144.431810 249.000000
25% 2.000000 6.500000e+05 6.100000 3044.000000 2.000000 1.000000 1.000000 177.000000 93.000000 1940.000000 -37.856822 144.929600 4380.000000
50% 3.000000 9.030000e+05 9.200000 3084.000000 3.000000 1.000000 2.000000 440.000000 126.000000 1970.000000 -37.802355 145.000100 6555.000000
75% 3.000000 1.330000e+06 13.000000 3148.000000 3.000000 2.000000 2.000000 651.000000 174.000000 1999.000000 -37.756400 145.058305 10331.000000
max 10.000000 9.000000e+06 48.100000 3977.000000 20.000000 8.000000 10.000000 433014.000000 44515.000000 2018.000000 -37.408530 145.526350 21650.000000
第三课《探索自己的数据》
本课目的提高你读取数据文件及理解数据统计的能力
第四课《第一个机器学习模型》
对数据进行预处理
import pandas as pd
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
melbourne_data.columns
处理缺失数据
# dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)
选择部分数据的方法
1)Dot notation, which we use to select the "prediction target"
2)Selecting with a column list, which we use to select the "features"
选择预测目标
y = melbourne_data.Price
选择特征
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
观察training data
X.describe()
X.head()
建立模型
The steps to building and using a model are:
- Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
- Fit: Capture patterns from provided data. This is the heart of modeling.
- Predict: Just what it sounds like.
- Evaluate: Determine how accurate the model's predictions are.
一个决策树的例子
from sklearn.tree import DecisionTreeRegressor
# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)
# Fit model
melbourne_model.fit(X, y)
random_state ensures you get the same results in each run.
模型建立后,先用训练数据的前五个来测试一下
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))
结果如下
Making predictions for the following 5 houses:
Rooms Bathroom Landsize Lattitude Longtitude
1 2 1.0 156.0 -37.8079 144.9934
2 3 2.0 134.0 -37.8093 144.9944
4 4 1.0 120.0 -37.8072 144.9941
6 3 2.0 245.0 -37.8024 144.9993
7 2 1.0 256.0 -37.8060 144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]
第五课《练习:第一个机器学习模型》
# Code you have previously used to load data
import pandas as pd
# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
home_data = pd.read_csv(iowa_file_path)
# print the list of columns in the dataset to find the name of the prediction target
home_data.columns
y=home_data.SalePrice
# Create the list of features below
feature_names = ['LotArea','YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd']
# select data corresponding to features in feature_names
X = home_data[feature_names]
# Review data
# print description or statistics from X
print(X.describe())
# print the top few lines
print(X.head())
from sklearn.tree import DecisionTreeRegressor
#specify the model.
#For model reproducibility, set a numeric value for random_state when specifying the model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit the model
iowa_model.fit(X,y)
predictions = iowa_model.predict(X.head())
print(predictions)
step_4.check()
第六课《模型验证》
There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE).
from sklearn.metrics import mean_absolute_error
predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)
先把数据进行分割,得到训练数据、验证数据
from sklearn.model_selection import train_test_split
# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)
# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))
第七课《练习:模型验证》
略
第八课《过拟合与欠拟合》

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))
输出结果
Max leaf nodes: 5 Mean Absolute Error: 347380
Max leaf nodes: 50 Mean Absolute Error: 258171
Max leaf nodes: 500 Mean Absolute Error: 243495
Max leaf nodes: 5000 Mean Absolute Error: 254983
第九课《练习:过拟合与欠拟合》
略
第十课《随机森林》
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))
第十一课《练习:随机森林》
略