Chapter 2 End-to-End Machine Lea

2018-07-20 本文已影响63人闫_锋

Main Steps:

Look at the big picture.

Frame the Problem - business objective (expectation and benefits)
Current Solutions: supervised? regression? batch learning?

Select a Performance Measure

Root Mean Square Error (RMSE) - standard deviation
Euclidian norm - L2 norm
Mean Absolute Error (MAE)
Manhattan norm - L1 norm

RMSE is more sensitive to outliers than MAE, (cos the square amplifies the outlier error); But when outliers are exponentially rare (like in bell-shaped curve), the RMSE performs very well and is generally preferred.

Check the Assumptions

How the downstream is going to use the data, prediction vs classification.

Get the Data

Relational Database
data schema
CSV files

Automating Data Collection:

import os
import tarfile
from six.moves import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
if not os.path.isdir(housing_path):
os.makedirs(housing_path)
tgz_path = os.path.join(housing_path, "housing.tgz")
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()

Loading Data:

import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)

Take a Quick Look at the Data Structure
head()
info()
value_counts()
describe()
hist()

%matplotlib inline # only in a Jupyter notebook
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()

->
Working with preprocessed attributes is common in Machine Learning , and it is not necessarily a problem, but you should try to understand how the data was computed.
(在直方图中思考数据的意义和范围)

->
Feature Scaling
->
Tail Heavy

Create a Test Set
Data Snooping bias (创建测试集时过度查看数据)

Id尾部添加，中间行不删除
Use the most features as unique identifier (例如经纬度)

housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

Stratified sampling(保持数据集中各部分的比例-确保测试机的代表性)

housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]

housing["income_cat"].value_counts() / len(housing)

Discover and Visualize the Data to Gain Insights

Visualizing Geographical Data
Looking for Correlations
Standard correlation coefficient (Pearson's r, Linear Correlation)

corr_matrix = housing.corr()

>>>corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value 1.000000
median_income 0.687170
total_rooms 0.135231
housing_median_age 0.114220
households 0.064702
total_bedrooms 0.047865
population -0.026699
longitude -0.047279
latitude -0.142826
Name: median_house_value, dtype: float64

scatter_matrix

from pandas.tools.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

Attribute combinations:

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

>>> corr_matrix = housing.corr()
>>> corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value 1.000000
median_income 0.687170
rooms_per_household 0.199343
total_rooms 0.135231
housing_median_age 0.114220
households 0.064702
total_bedrooms 0.047865
population_per_household -0.021984
population -0.026699
longitude -0.047279
latitude -0.142826
bedrooms_per_room -0.260070
Name: median_house_value, dtype: float64

Prepare the Data for Machine Learning Algorithms

housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

Date Cleaning
Handle the missing data
sklearn Imputer (Only on numerical attributes)

from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median")
housing_num = housing.drop("ocean_proximity", axis=1)

imputer.fit(housing_num)

>>> imputer.statistics_
array([ -118.51 , 34.26 , 29. , 2119. , 433. , 1164. , 408. , 3.5414])
>>> housing_num.median().values
array([ -118.51 , 34.26 , 29. , 2119. , 433. , 1164. , 408. , 3.5414])


X = imputer.transform(housing_num)

housing_tr = pd.DataFrame(X, columns=housing_num.columns)

Scikit-Learn Design

Estimators: estimate parameters - fit()
Transformer: transform(), fit_transform()
Predictor: predict() - score()

Text and Categorical Attributes

LabelEncoder
one-hot encoding

>>> from sklearn.preprocessing import LabelEncoder
>>> encoder = LabelEncoder()
>>> housing_cat = housing["ocean_proximity"]
>>> housing_cat_encoded = encoder.fit_transform(housing_cat)
>>> housing_cat_encoded
array([1, 1, 4, ..., 1, 0, 3])

OneHotEncoder

>>> from sklearn.preprocessing import OneHotEncoder
>>> encoder = OneHotEncoder()
>>> housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
>>> housing_cat_1hot
<16513x5 sparse matrix of type '<class 'numpy.float64'>'
with 16513 stored elements in Compressed Sparse Row format>

Two in one:

>>> from sklearn.preprocessing import LabelBinarizer
>>> encoder = LabelBinarizer()
>>> housing_cat_1hot = encoder.fit_transform(housing_cat)
>>> housing_cat_1hot
array([[0, 1, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 0, 0, 1],
...,
[0, 1, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 1, 0]])

Custom Transformers
Attributes combination
Feature Scaling
Scaling the target values is generally not required

min-mix: MinMaxScaler (0~1)
standardization: StandardScaler (No specific range)

To fit the scalers to the training data only, not to the full dataset (including the test set). Only then can you use them to transform the training set and the test set. (and new data)

Transformation Pipelines

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('imputer', Imputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
housing_num_tr = num_pipeline.fit_transform(housing_num)

FeatureUnion

from sklearn.pipeline import FeatureUnion
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer()),
])
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])

>>> housing_prepared = full_pipeline.fit_transform(housing)
>>> housing_prepared
array([[ 0.73225807, -0.67331551, 0.58426443, ..., 0. ,
0. , 0. ],
[-0.99102923, 1.63234656, -0.92655887, ..., 0. ,
0. , 0. ],
[...]
>>> housing_prepared.shape
(16513, 17)

Select and Train a Model

Training and Evaluating on the Training Set
Better Evaluation Using Cross-Validation

from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)

Ensemble Learning

>>> from sklearn.ensemble import RandomForestRegressor
>>> forest_reg = RandomForestRegressor()
>>> forest_reg.fit(housing_prepared, housing_labels)
>>> [...]
>>> forest_rmse
22542.396440343684
>>> display_scores(forest_rmse_scores)
Scores: [ 53789.2879722 50256.19806622 52521.55342602 53237.44937943
52428.82176158 55854.61222549 52158.02291609 50093.66125649
53240.80406125 52761.50852822]
Mean: 52634.1919593
Standard deviation: 1576.20472269

Fine-Tune Your Model

Grid Search
Random Search

RandomForestRegressor

Evaluate Your System on the Test Set
The performance will usually be slightly worse than what you measured using crossvalidation if you did a lot of hyperparameter tuning (because your system ends up fine-tuned to perform well on the validation data, and will likely not perform as well on unknown datasets).

Chapter 2 End-to-End Machine Lea

猜你喜欢

热点阅读