Note: 一定要注意归一化是归一化什么,归一化features还是samples。
Standardization: mean removal and variance scaling
Note: test set要和training set做相同的预处理操作(standardization、data transformation、etc)。
from sklearn import preprocessing
def scale(X,axis=0,with_mean=True,with_std=True,copy=True)
注意,scikit-learn中assume that all features are centered around zero and have variance in the same order.同时这个默认操作是对features进行的(如mean removal),所以操作都是针对axis=0的操作,如果数据不是这样的要注意!公式为:(X-X_mean)/X_std 计算时对每个属性/每列分别进行。
X:{array-like, sparse matrix} 数组或者矩阵,一维的数据都可以(但是在0.19版本后一维的数据会报错了!)
axis:int类型,初始值为0,axis用来计算均值 means 和标准方差 standard deviations. 如果是0,则单独的标准化每个特征(列),如果是1,则标准化每个观测样本(行)。
with_mean: boolean类型,默认为True,表示将数据均值规范到0
with_std: boolean类型,默认为True,表示将数据方差规范到1
这种标准化相当于z-score 标准化(zero-mean normalization)
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
>>> X_scaled = preprocessing.scale(X)
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
cn= preprocessing.scale([[p]for _, _, pin cn]).reshape(-1)
转换后的数据有0均值(zero mean)和单位方差(unit variance,方差为1)
>>> X_scaled.mean(axis=0)
array([ 0., 0., 0.])
>>> X_scaled.std(axis=0)
array([ 1., 1., 1.])
The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to computethe mean and standard deviation on a training set so as to beable to later reapply the same transformation on the testing set.This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline:
>>> scaler = preprocessing.StandardScaler().fit(X)
>>> scaler
StandardScaler(copy=True, with_mean=True, with_std=True)
>>> scaler.mean_
array([ 1. ..., 0. ..., 0.33...])
>>> scaler.scale_
array([ 0.81..., 0.81..., 1.24...])
>>> scaler.transform(X)
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
The scaler instance can then be used on new data to transform it thesame way it did on the training set:
>>> scaler.transform([[-1., 1., 0.]])
array([[-2.44..., 1.22..., -0.26...]])
It is possible to disable either centering or scaling by eitherpassing with_mean=False or with_std=False to the constructorof StandardScaler.[StandardScaler]
[Standardization, or mean removal and variance scaling]
def preprocess(): if not os.path.exists(os.path.join(DIR, train_file1))or not os.path.exists(os.path.join(DIR, test_file1))or 0: xy= np.loadtxt(os.path.join(DIR, train_file),delimiter=',',dtype=float) x, y= xy[:,0:-1], xy[:,-1] scaler= preprocessing.StandardScaler().fit(x) xy= np.hstack([scaler.transform(x), y]) np.savetxt(os.path.join(DIR, train_file1), xy,fmt='%.7f') x_test= np.loadtxt(os.path.join(DIR, test_file),delimiter=',',dtype=float) x_test= scaler.transform(x_test) np.savetxt(os.path.join(DIR, test_file1), x_test,fmt='%.7f')else: print('data loading...') xy= np.loadtxt(os.path.join(DIR, train_file1),dtype=float) x_test= np.loadtxt(os.path.join(DIR, test_file1),dtype=float)return xy[:,0:-1], xy[:,-1], x_test
pipeline能简化该过程( See Pipeline and FeatureUnion: combining estimators ,翻译后的文章:http://www.voidcn.com/blog/mmc2015/article/p-3379231.html):
>>> from sklearn.pipeline import make_pipeline
>>> clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
>>> cross_validation.cross_val_score(clf, iris.data, iris.target, cv=cv)
min_max_scaler = preprocessing.MinMaxScaler()
X_minMax = min_max_scaler.fit_transform(X)
sklearn.preprocessing.robust_scale(X, axis=0, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True)
Center to the median and component wise scaleaccording to the interquartile range.
Constructs a transformer from an arbitrary callable.
from sklearn.preprocessingimport FunctionTransformerimport numpyas npdef scalerFunc(x,maxv,minv,THRESHOLD=200): ''' :param x: (n_samples, n_features)!! ''' label= x >= THRESHOLD result= 0.5 * (1 + (x - THRESHOLD)* (label/ (maxv - THRESHOLD)+ (label- 1)/ (minv - THRESHOLD)))# print(result) return resultx= np.array([100,150,201,250,300]).reshape(-1,1)scaler= FunctionTransformer(func=scalerFunc,kw_args={'maxv': x.max(),'minv': x.min()}).fit(x)print(scaler.transform(x))[[ 0. ] [ 0.25 ] [ 0.505] [ 0.75 ] [ 1. ]]
Note: 自定义函数的参数由FunctionTransformer中的kw_args指定,是字典类型,key必须是字符串。
[preprocessing.FunctionTransformer([func, ...])]
[sklearn.preprocessing: Preprocessing and Normalization¶]
该方法是文本分类和聚类分析中经常使用的向量空间模型(Vector Space Model)的基础.
Normalization is the process of scaling individual samples to haveunit norm.This process can be useful if you plan to use a quadratic formsuch as the dot-product or any other kernel to quantify the similarityof any pair of samples.This assumption is the base of the Vector Space Model often used in textclassification and clustering contexts.
def normalize(X,norm='l2',axis=1,copy=True)
>>> X = [[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]]
>>> X_normalized = preprocessing.normalize(X, norm='l2')
>>> X_normalized
array([[ 0.40..., -0.40..., 0.81...],
[ 1. ..., 0. ..., 0. ...],
[ 0. ..., 0.70..., -0.70...]])
>>> X = [[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]]
>>> binarizer = preprocessing.Binarizer().fit(X) # fit does nothing
>>> binarizer
Binarizer(copy=True, threshold=0.0)
>>> binarizer.transform(X)
array([[ 1., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]])
[Encoding categorical features]
缺失值处理Imputation of missing values
>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
>>> X = [[np.nan, 2], [6, np.nan], [7, 6]]
>>> print(imp.transform(X))
[[ 4. 2. ]
[ 6. 3.666...]
[ 7. 6. ]]
不过lz更倾向于使用pandas进行数据的这种处理[pandas小记:pandas高级功能 ]。
[Imputation of missing values]
[Generating polynomial features]
