关于OrdinalEncoder 、OneHotEncoder

2019-04-29  本文已影响0人  SeekerLinJunYu

OrdinalEncoder / OneHotEncoder /get_dummies 三个方法都能够将离散的类别特征转换成由数字代表的类别特征.但是三者的特征又不尽相同

不扩展特征个数

OrdinalEncoder (与LabelEncoder用法 效果都是一致的,这里就不再单独说明LabelEncoder)
In: MSSubClass_data = train_df.MSSubClass.astype(str)    #在所有操作前,将特征转换成字符串是必须的操作
In: label_encoder = preprocessing.LabelEncoder()
In: MSSubClass_data_encoded = label_encoder.fit_transform(MSSubClass_data)

扩展特征个数

OneHotEncoder
>>> genders = ['female', 'male']
>>> locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
>>> browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
>>> enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
>>> # Note that for there are missing categorical values for the 2nd and 3rd
>>> # feature
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) 
OneHotEncoder(categorical_features=None,
       categories=[...],
       dtype=<... 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])
最基本的用法:
In: enc = preprocessing.OneHotEncoder()
In: result = enc.fit_transform(MSSubClass_data.values.reshape(-1,1))
get_dummies
最基本的用法:
In: all_df.MSSubClass = pd.get_dummies(all_df['MSSubClass'],prefix='MSSubClass')
转换前.png 转换后.png

关于怎么使用Encoder方法改变原数据集?

在写这篇文章的时候,其实最困扰我的问题是如何利用Encoder接口实现对原数据集的有针对性的更改.后来在Scikit-Learn官网上找到一段代码,也算是能解答这个问题.下面贴出来:

from __future__ import print_function

import pandas as pd
import numpy as np


from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,GridSearchCV


np.random.seed(0)

# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)

# We will train our classifier with the following features:
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant',fill_value = 'missing')),
    ('onehot',OneHotEncoder(handle_unknown = 'ignore'))]

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(                  # 这一步实现对数值型数据和类别数据的分别更改
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

X = data.drop('survived', axis=1)
y = data['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))</pre>

源代码网址:https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py

上一篇下一篇

猜你喜欢

热点阅读