imblearn 使用笔记

2020-04-14  本文已影响0人  走在成长的道路上

在做机器学习相关项目时,通常会出现样本数据量不均衡操作,这时可以使用 imblearn 包进行重采样操作,可通过 pip install imbalanced-learn 命令进行安装。

imblearn 包使用过程中,通常输入项 x 多为 2D 的结构。否则会包 ``

不均衡分析

在数据化运营过程中,以下场景会经常产生样本分布不均衡的问题:

抽样类别

抽样是解决样本分布不均衡相对简单且常用的方法,包括过抽样和欠抽样两种。

过抽样

过抽样(也叫上采样、over-sampling)方法通过增加分类中少数类样本的数量来实现样本均衡,最直接的方法是简单复制少数类样本形成多条记录,这种方法的缺点是如果样本特征少而可能导致过拟合的问题;经过改进的过抽样方法通过在少数类中加入随机噪声、干扰数据或通过一定规则产生新的合成样本,例如SMOTE算法。

欠抽样

欠抽样(也叫下采样、under-sampling)方法通过减少分类中多数类样本的样本数量来实现样本均衡,最直接的方法是随机地去掉一些多数类样本来减小多数类的规模,缺点是会丢失多数类样本中的一些重要信息。

总体上,过抽样和欠抽样更适合大数据分布不均衡的情况,尤其是第一种(过抽样)方法应用更加广泛。

实现方式

本文中使用开放的微博4种情绪数据集 simplifyweibo_4_modes.csv 作为样本数据进行数据处理操作,其中所有的预操作如下:

import sys
import os
import pandas as pd

import jieba

from keras_preprocessing.sequence import pad_sequences
# keras 里类似
from preprocessing.text import Tokenizer
# 模型评价工具
from sklearn import metrics

# xgboost / lightgbm 模型
import xgboost as xgb
import lightgbm as lbm

# 多模型投票
from sklearn.ensemble import VotingClassifier
from collections import Counter
# 重采样
from imblearn.under_sampling import RandomUnderSampler
sys.path.extend([os.path.dirname(os.getcwd())])

dataset = pd.read_csv("data/samples/simplifyweibo_4_moods.csv", header=0)
moods = {0: '喜悦', 1: '愤怒', 2: '厌恶', 3: '低落'}
x = dataset['review'].values

def transform(a):
    return moods[a]

# 将数字标签转为文字标签
y = dataset['label'].apply(transform).values

# 进行结巴分词并添加 padding
data_tokenizer = Tokenizer(split=jieba.cut)
data_tokenizer.fit_on_texts(x)
data_seq = data_tokenizer.texts_to_sequences(x)
data_seq = pad_sequences(data_seq, maxlen=200)

1. SMOTE 抽样

print('Original dataset shape %s' % Counter(y))
# 建立 SMOTE模型
smote = SMOTE()
# 对x和y过抽样处理后的数据集,将两份数据集转换为数据框然后合并为一个整体数据框
data_seq, y = smote.fit_resample(data_seq, y)
print('resampled dataset shape %s' % Counter(y))

# 建立 SMOTE + ENN
smote = SMOTEENN()
# 对x和y过抽样处理后的数据集,将两份数据集转换为数据框然后合并为一个整体数据框
data_seq, y = smote.fit_resample(data_seq, y)
print('resampled dataset shape %s' % Counter(y))

# 建立 SMOTE + Tomek
smote = SMOTETomek()
# 对x和y过抽样处理后的数据集,将两份数据集转换为数据框然后合并为一个整体数据框
data_seq, y = smote.fit_resample(data_seq, y)
print('resampled dataset shape %s' % Counter(y))

过抽样方法通过在少数类中加入随机噪声、干扰数据或通过一定规则产生新的合成样本。

2. 随机抽样

print('Original dataset shape %s' % Counter(y))
# 数据样本会以最小样本数在多样本中进行随机采样
rus = RandomUnderSampler()
data_seq, y = rus.fit_resample(data_seq, y)
print('resampled dataset shape %s' % Counter(y))

# 随机选取数据的子集
ros = RandomOverSampler(random_state=0)
data_seq, y = ros.fit_resample(data_seq, y)
print('resampled dataset shape %s' % Counter(y))

经过 RandomUnderSampler 重采样之后,。

3. ADASYN 抽样

from imblearn.over_sampling import ADASYN

print('Original dataset shape %s' % Counter(y))
adasyn = ADASYN()
data_seq, y = adasyn.fit_resample(data_seq, y)
print('resampled dataset shape %s' % Counter(y))

SMOTE 算法与 ADASYN 都是基于同样的算法来合成新的少数类样本

4. 原型生成

from imblearn.under_sampling import ClusterCentroids
 
print('Original dataset shape %s' % Counter(y))
cc = ClusterCentroids(random_state=0)
data_seq, y = cc.fit_resample(data_seq, y)
print('resampled dataset shape %s' % Counter(y))

每一个类别的样本都会用K-Means算法的中心点来进行合成, 而不是随机从原始样本进行抽取.

5. 最近邻算法下采样

应用最近邻算法来编辑(edit)数据集, 找出那些与邻居不太友好的样本然后移除. 对于每一个要进行下采样的样本, 那些不满足一些准则的样本将会被移除; 他们的绝大多数(kind_sel='mode')或者全部(kind_sel='all')的近邻样本都属于同一个类, 这些样本会被保留在数据集中.

from imblearn.under_sampling import EditedNearestNeighbours
from imblearn.under_sampling import RepeatedEditedNearestNeighbours
from imblearn.under_sampling import AllKNN

print('Original dataset shape %s' % Counter(y))
enn = EditedNearestNeighbours(random_state=0)
data_seq, y = enn.fit_resample(data_seq, y)
print('resampled dataset shape %s' % Counter(y))

# 多次执行 EditedNearestNeighbours
renn = RepeatedEditedNearestNeighbours(random_state=0)
data_seq, y = renn.fit_resample(data_seq, y)
print('resampled dataset shape %s' % Counter(y))

# ALLKNN算法在进行每次迭代的时候, 最近邻的数量都在增加
allknn = AllKNN(random_state=0)
data_seq, y = allknn.fit_resample(data_seq, y)
print('resampled dataset shape %s' % Counter(y))

# KMeans + SMOTE
sm = KMeansSMOTE(random_state=0)
data_seq, y = sm.fit_resample(data_seq, y)
print('resampled dataset shape %s' % Counter(y))

构建不均衡样本

from sklearn.datasets import load_iris
from imblearn.datasets import make_imbalance
iris = load_iris()

# 指定数量生成
ratio = {0: 20, 1: 30, 2: 40}
x_imb, y_imb = make_imbalance(iris.data, iris.target, sampling_strategy=ratio)
# Out[37]: [(0, 20), (1, 30), (2, 40)]
sorted(Counter(y_imb).items()) 
 
# 当类别不指定时, 所有的数据集均导入
ratio = {0: 10}
x_imb, y_imb = make_imbalance(iris.data, iris.target, sampling_strategy=ratio)
# Out[38]: [(0, 10), (1, 50), (2, 50)]
sorted(Counter(y_imb).items())
 
# 同样亦可以传入自定义的比例函数
def ratio_multiplier(y):
    multiplier = {0: 0.5, 1: 0.7, 2: 0.95}
    target_stats = Counter(y)
    for key, value in target_stats.items():
        target_stats[key] = int(value * multiplier[key])
    return target_stats
x_imb, y_imb = make_imbalance(iris.data, iris.target, sampling_strategy=ratio_multiplier)
# Out[39]: [(0, 25), (1, 35), (2, 47)]
sorted(Counter(y_imb).items())

常见问题

  1. 安装 imblearn 包之后,默认会更新 sklearn 包,这时候会导致 sklearn2pmml 报如下错误:
Standard output is empty
Standard error:
Apr 15, 2020 9:21:53 AM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Apr 15, 2020 9:21:53 AM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 17 ms.
Apr 15, 2020 9:21:53 AM org.jpmml.sklearn.Main run
INFO: Converting..
Apr 15, 2020 9:21:53 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: The transformer object (Python class sklearn.ensemble._voting.VotingClassifier) is not a supported Transformer
    at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
    at sklearn.pipeline.Pipeline$1.apply(Pipeline.java:121)
    at sklearn.pipeline.Pipeline$1.apply(Pipeline.java:112)
    at com.google.common.collect.Lists$TransformingRandomAccessList.get(Lists.java:599)
    at sklearn.TransformerUtil.getHead(TransformerUtil.java:35)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:189)
    at org.jpmml.sklearn.Main.run(Main.java:145)
    at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.Transformer
    at java.lang.Class.cast(Class.java:3369)
    at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
    ... 7 more

Exception in thread "main" java.lang.IllegalArgumentException: The transformer object (Python class sklearn.ensemble._voting.VotingClassifier) is not a supported Transformer
    at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
    at sklearn.pipeline.Pipeline$1.apply(Pipeline.java:121)
    at sklearn.pipeline.Pipeline$1.apply(Pipeline.java:112)
    at com.google.common.collect.Lists$TransformingRandomAccessList.get(Lists.java:599)
    at sklearn.TransformerUtil.getHead(TransformerUtil.java:35)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:189)
    at org.jpmml.sklearn.Main.run(Main.java:145)
    at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.Transformer
    at java.lang.Class.cast(Class.java:3369)
    at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
    ... 7 more

更新 sklearn 版本即可 pip install --upgrade git+https://github.com/jpmml/sklearn2pmml.git

上一篇 下一篇

猜你喜欢

热点阅读