特征工程

2019-10-13  本文已影响0人  陈文瑜

数据预处理方法

降维模块 Dimensionality reduction (decomposition)
数据预处理模块 Preprocessing
填补缺失值 impute
特征选择 feature_selection

数据无量纲化

通过 中心化(平移)缩放处理 ,MinMaxScaler参数 feature_range 默认参数[0,1],使得数据收敛到(0,1)
极易受异常值的影响

from sklearn.preprocessing import MinMaxScaler
import pandas as pd
data = [[-1,2],[-0.5,6],[0,10],[1,18]]
#实现归一化
scaler = MinMaxScaler(feature_range=[0,1])
result = scaler.fit_transform(data)
# 复原数据
scaler.inverse_transform(result)

标准化后,数据会服从均值为0,方差为1的正态分布

from sklearn.preprocessing import StandardScaler
data = [[-1,2],[-0.5,6],[0,10],[1,18]]
# 数据标准化
scaler = StandardScaler(copy=True,with_mean=True,with_std=True)
x_std = scaler.fit_transform(data)
# 对比 前后标准差 和方差
x_std.mean()
x_std.std()
scaler.mean_
scaler.var_

缺失值处理

# 参数 :
missing_values   
strategy(mean median most_frequent constant)
fill_value
copy
import pandas as pd
data = pd.read_csv(r"./train.csv",index_col=0)
#  分析数据
data.head()
data.info()
# 提取数据
from sklearn.impute import SimpleImputer
# 填充缺失值策略
Age = data.loc[:,"Age"].values.reshape(-1,1)
imp_median = SimpleImputer(strategy="median")  # 中位数填补
data.loc[:,"Age"] = imp_median.fit_transform(Age)

Embarked = data.loc[:,"Embarked"].values.reshape(-1,1)
imp_most = SimpleImputer(strategy="most_frequent") # 众数填补
data.loc[:,"Embarked"] = imp_most.fit_transform(Embarked)

data.info()

特征、标签 数值化

from sklearn.preprocessing import LabelEncoder
y = data.iloc[:,-1]
le = LabelEncoder()
data.iloc[:,-1] = le.fit_transform(y) # 标签 数值化
le.classes_ # 查看 分类情况
data.head(10)

# 特征专用 preprocessing.OrdinalEncode  
from sklearn.preprocessing import OrdinalEncoder
data_ = data.copy()
OrdinalEncoder().fit(data_.iloc[:,3:4]).categories_
data_.iloc[:,3:4] = OrdinalEncoder().fit_transform(data_.iloc[:,3:4])
data_.head()
from sklearn.preprocessing import OneHotEncoder
X = data.iloc[:,3:4]
enc = OneHotEncoder(categories='auto').fit(X)

result =  OneHotEncoder(categories='auto').fit_transform(X).toarray()
#看看情况
pd.DataFrame(result)
enc.get_feature_names()

newdata = pd.concat([data,pd.DataFrame(result)],axis=1)
newdata.drop(["Sex"],axis=1,inplace=True)
newdata.columns = [ "Survived","Pclass","Name","Age","SibSp","Parch","Ticket","Fare","Cabin","Embarked","x0_female", "x0_male"]
from sklearn.preprocessing import Binarizer
data_2 = data.copy()
X = data_2.iloc[:,4].values.reshape(-1,1)
transformer = Binarizer(threshold=30).fit_transform(X)
transformer

参数:n_bins encode strategy

from sklearn.preprocessing import KBinsDiscretizer
X = data.iloc[:,4].values.reshape(-1,1)

est = KBinsDiscretizer(n_bins=3,encode='ordinal',strategy='uniform')
est.fit_transform(X)
#查看一下
set(est.fit_transform(X).ravel())
# {0.0, 1.0, 2.0}

est = KBinsDiscretizer(n_bins=3,encode='onehot',strategy='uniform')
est.fit_transform(X).toarray()
# 后续步骤同上
上一篇 下一篇

猜你喜欢

热点阅读