机器学习与数据挖掘大数据挖掘杂谈机器学习

特征工程

2018-10-27  本文已影响30人  c6ad47dbfc82

数据和特征的质量决定了机器学习的上限,而模型和算法只是不断逼近这个上限而已

特征工程的四个步骤

数据清洗


数据样本采集(抽样)

异常值(空值处理)

下面模拟了一些数据,来进行常见的异常值处理

import pandas as pd

df = pd.DataFrame({"A":["a0","a1","a1","a2","a3","a4"],"B":["b0","b1","b2","b2","b3",None],
                 "C":[1,2,None,3,4,5],"D":[0.1,10.2,11.4,8.9,9.1,12],"E":[10,19,32,25,8,None],
                 "F":["f0","f1","g2","f3","f4","f5"]})

df

df.isnull()

df.dropna()

# 删除某一列里空值的行
df.dropna(subset=["B"])

df.duplicated(["A"])

# 填入两个属性的情况,必须两个组合值为重复的,才会判断为重复
df.duplicated(["A","B"])

df.drop_duplicates(["A"],keep="first")

df.fillna(df["E"].mean())

df["E"].interpolate()

pd.Series([1,None,4,5,20]).interpolate()


# 利用四分位数去除异常值
upper_q = df["D"].quantile(0.75)
lower_q = df["D"].quantile(0.25)
q_int = upper_q - lower_q
k = 1.5

df[df["D"]>lower_q-q_int*k][df["D"]<upper_q+q_int*k]

df[[True if item.startswith("f") else False for item in list(df["F"].values)]]

特征预处理

  • 特征选择
  • 特征变换
    对指化、离散化、数据平滑、归一化(标准化)、数值化、正规化
  • 特征降维
  • 特征衍生
特征选择
数据规约的常用思路
import numpy as np
import pandas as pd
import scipy.stats as ss

df = pd.DataFrame({"A":ss.norm.rvs(size=10),"B":ss.norm.rvs(size=10),"C":ss.norm.rvs(size=10)
                  ,"D":np.random.randint(low=0,high=2,size=10)})

df

from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
X = df.loc[:,["A","B","C"]]
y = df.loc[:,"D"]

from sklearn.feature_selection import SelectKBest,RFE,SelectFromModel

skb = SelectKBest(k=2)
skb.fit(X,y)
skb.transform(X)

rfe = RFE(estimator=SVR(kernel="linear"),n_features_to_select=2,step=1)
rfe.fit_transform(X,y)

sfm = SelectFromModel(estimator=DecisionTreeRegressor(),threshold=0.1)
sfm.fit_transform(X,y)

特征变换

lst = [6,8,10,15,16,24,25,40,67]

# 等深分箱
pd.qcut(lst,q=3)

pd.qcut(lst,q=3,labels=["low","medium","high"])

# 等宽分箱
pd.cut(lst,bins=3)

pd.cut(lst,bins=3,labels=["low","medium","high"])

from sklearn.preprocessing import MinMaxScaler,StandardScaler

MinMaxScaler().fit_transform(np.array([1,4,10,20,30]).reshape(-1,1))

StandardScaler().fit_transform(np.array([1,1,1,1,0,0,0,0]).reshape(-1,1))

StandardScaler().fit_transform(np.array([1,0,0,0,0,0,0,0]).reshape(-1,1))

from sklearn.preprocessing import LabelEncoder,OneHotEncoder

# 这里high是0的原因是,这里变换过程实际上是对首字母进行的升序处理
LabelEncoder().fit_transform(np.array(["low","medium","high","low"]))

LabelEncoder().fit_transform(np.array(["up","down","down"]))

le = LabelEncoder()
lb_tran_f = le.fit_transform(np.array(["Red","Yellow","Blue","Green"]))

oht_encoder = OneHotEncoder().fit(lb_tran_f.reshape(-1,1))

oht_encoder.transform(LabelEncoder().fit_transform(np.array(["Red","Yellow","Blue","Green"])).reshape(-1,1)).toarray()
特征衍生
特征衍生是在现有特征的基础上衍生出其他合理有效的特征

对HR表的特征预处理

import pandas as pd
from sklearn.preprocessing import MinMaxScaler,StandardScaler,OneHotEncoder,LabelEncoder
from sklearn.decomposition import PCA

d = dict([("low",0),("medium",1),("high",2)])
def map_salary(s):
    return d.get(s,0)

# sl:satisfacation_level --- False:MinMaxScaler;True:StandardScaler
# le:last_evaluation --- False:MinMaxScaler;True:StandardScaler
def hr_preprocessing(sl=False,le=False,npr=False,amh=False,tsc=False,wa=False,pl5=False,dp=False,slr=False,lower_d=False,ld_n=1):
    df = pd.read_csv("./data/HR.csv")

    # 清洗数据
    df = df.dropna(subset=["satisfaction_level","last_evaluation"])
    df = df[df["satisfaction_level"]<=1][df["salary"]!="nme"]
    
    # 获得标记值
    label = df["left"]
    df = df.drop("left",axis=1)
    
    # 特征选择
    # 特征处理,主要是归一化、标准化、独热、标签化的处理
    scaler_list = [sl,le,npr,amh,tsc,wa,pl5]
    column_list = ["satisfaction_level","last_evaluation","number_project",
                  "average_monthly_hours","time_spend_company","Work_accident",
                  "promotion_last_5years"]
    for i in range(len(scaler_list)):
        if not scaler_list[i]:
            df[column_list[i]] = MinMaxScaler().fit_transform(df[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]  
        else:
            df[column_list[i]] = StandardScaler().fit_transform(df[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]
            
    scaler_list = [slr,dp]
    column_list = ["salary","department"]
    for i in range(len(scaler_list)):
        if not scaler_list[i]:
            if column_list[i] == "salary":
                df[column_list[i]] = [map_salary(s) for s in df["salary"].values]
            else:
                df[column_list[i]] = LabelEncoder().fit_transform(df[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]
            df[column_list[i]] = MinMaxScaler().fit_transform(df[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]
        else:
            # 这里使用processing中的OneHotEncoder直接处理dataframe比较费力,所以用pandas带的getdummies来处理
            df = pd.get_dummies(df,columns=[column_list[i]])
            
    # 特征降维
    if lower_d:
        return PCA(n_components=ld_n).fit_transform(df.values)
    return df,label
上一篇 下一篇

猜你喜欢

热点阅读