Python数据分析与机器学习18- 逻辑回归项目实战2-样本不

2022-07-19  本文已影响0人  只是甲

一. 样本不均匀带来的影响

我们从样本数据中知道,正常的交易数据有2.8w左右数据,异常的交易数据有492,正常的交易数据与异常交易数据差距非常大,这样会导致我们模型的效果不佳。

下面我们来列举一个案例:
代码:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report

# 实现主要功能的函数
def printing_Kfold_scores(x_train_data, y_train_data):
    # 第一个参数是自己指定将训练集划分为多少个
    # 一个训练集容易出现较多问题,多个训练集可以进行交叉验证
    fold = KFold(5, shuffle=False)

    # 正则化惩罚
    # 用于惩罚那些最终评分高,但是不稳定的模型,浮动更小越稳定,可以规避过拟合问题
    # 过拟合是模型在训练集OK,但是在测试集表现不佳的情况
    c_param_range = [0.01, 0.1, 1, 10, 100]

    results_table = pd.DataFrame(index=range(len(c_param_range), 2), columns=['C_parameter', 'Mean recall score'])
    results_table['C_parameter'] = c_param_range
    # k-fold 表示K折的交叉验证, 这里会得到两个索引集合: 训练集 = indices[0], 验证集 = indices[1]

    j = 0
    for c_param in c_param_range:
        print('-------------------------------------------')
        print('正则化惩罚力度: ', c_param)
        print('-------------------------------------------')
        print('')

        recall_accs = []
        # KFold.split :生成索引,将数据分割为训练集和测试集
        for iteration, indices in enumerate(fold.split(y_train_data),start=1):
            # 惩罚权重参数
            # l2代表 loss(损失函数值) + 1/2*power(w,2)
            # l1代表 loss(损失函数值) + |w|  新版本已弃用
            # 指定算法模型, 并且给定参数
            lr = LogisticRegression(C=c_param, penalty='l2')

            # 训练模型, 注意不要给错索引, 训练的时候传入的一定是训练集, 所以X和Y的索引都是0
            lr.fit(x_train_data[indices[0], :], y_train_data[indices[0], :].ravel())
            # 输出系数
            #print(lr.coef_)

            # 建立好模型后, 预测模型结果, 这里用的是验证集, 索引为1
            y_pred_undersample = lr.predict(x_train_data[indices[1], :])

            # 预测结果明确后, 就可以进行评估, 这里recall_score需要传入预测值和真实值
            recall_acc = recall_score(y_train_data[indices[1], :], y_pred_undersample)
            # 将得到的值平均,所以要将其保存起来
            recall_accs.append(recall_acc)
            print('Iteration ', iteration, ': recall score = ', recall_acc)


        # 计算完所有的交叉验证后, 计算平均结果
        results_table.loc[j,'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('平均召回率: ', np.mean(recall_accs))
        print('')

    #print(results_table['Mean recall score'])
    # best_c = results_table[(results_table['Mean recall score'].astype(float)).idxmax()]['C_parameter']
    # 获取 Mean recall score 列  值最大的 那个C_parameter参数
    r_index = results_table['Mean recall score'].astype(float).idxmax()
    best_c = results_table['C_parameter'].loc[r_index]


    #Finally, we can check which C parameter is the best amongst the chosen.
    print('*********************************************************************************')
    print('效果最好的模型所选的参数 =  ', best_c)
    print('*********************************************************************************')

    return best_c


data = pd.read_csv("E:/file/creditcard.csv")

# 将金额数据处理成 范围为[-1,1] 之间的数值
# 机器学习默认数值越大,特征就越重要,不处理容易造成的问题是 金额这个特征值的重要性远大于V1-V28特征
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
# 删除暂时不用的特征值
data = data.drop(['Time','Amount'],axis=1)

X = data.values[:, data.columns != 'Class']
y = data.values[:, data.columns == 'Class']

# 划分训练集和测试集
# 测试集比例为0.3,也可以根据时间情况进行调整
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)

# 调用函数,传入数据集
best_c = printing_Kfold_scores(X_train,y_train)

测试记录:

E:\python\数据分析_new4\数据分析\Scripts\python.exe E:/python/数据分析/机器学习/回归/logic7.py
-------------------------------------------
正则化惩罚力度:  0.01
-------------------------------------------

Iteration  1 : recall score =  0.5373134328358209
Iteration  2 : recall score =  0.6164383561643836
Iteration  3 : recall score =  0.6666666666666666
Iteration  4 : recall score =  0.6
Iteration  5 : recall score =  0.5

平均召回率:  0.5840836911333742

-------------------------------------------
正则化惩罚力度:  0.1
-------------------------------------------

Iteration  1 : recall score =  0.5522388059701493
Iteration  2 : recall score =  0.6164383561643836
Iteration  3 : recall score =  0.7166666666666667
Iteration  4 : recall score =  0.6153846153846154
Iteration  5 : recall score =  0.5625

平均召回率:  0.612645688837163

-------------------------------------------
正则化惩罚力度:  1
-------------------------------------------

Iteration  1 : recall score =  0.5522388059701493
Iteration  2 : recall score =  0.6164383561643836
Iteration  3 : recall score =  0.7333333333333333
Iteration  4 : recall score =  0.6153846153846154
Iteration  5 : recall score =  0.575

平均召回率:  0.6184790221704963

-------------------------------------------
正则化惩罚力度:  10
-------------------------------------------

Iteration  1 : recall score =  0.5522388059701493
Iteration  2 : recall score =  0.6164383561643836
Iteration  3 : recall score =  0.7333333333333333
Iteration  4 : recall score =  0.6153846153846154
Iteration  5 : recall score =  0.575

平均召回率:  0.6184790221704963

-------------------------------------------
正则化惩罚力度:  100
-------------------------------------------

Iteration  1 : recall score =  0.5522388059701493
Iteration  2 : recall score =  0.6164383561643836
Iteration  3 : recall score =  0.7333333333333333
Iteration  4 : recall score =  0.6153846153846154
Iteration  5 : recall score =  0.575

平均召回率:  0.6184790221704963

*********************************************************************************
效果最好的模型所选的参数 =   1.0
*********************************************************************************

Process finished with exit code 0

结论:
我们可以看到,由于正常值与异常值差距太大,最好的模型评分才0.618左右,远远达不到我们的预期。

二. 处理样本不均衡问题的方法

2.1 权重法

  1. 类别权重法class weight
    权重加在类别上,若类别的样本量多,则类别的权重设低一些,反之类别的权重设高些

  2. 样本权重法sample weight
    权重加在样本上,若类别的样本量多,则其每个样本的权重低,反之样本的权重高

2.2 采样法

  1. 上采样(或 过采样)
    对样本量少的类别进行过采样,直到和样本量多的类别量级差不多

  2. 下采样(或 子采样)
    对样本量多的类别进行子采样,直到和样本量少的类别量级差不多

  3. 人工合成样本 (SMOTE采样)
    为了解决过/子采样对样本分布造成改变的影响

三. 实例

3.1 下采样

代码:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# 读取数据集并处理
data = pd.read_csv("E:/file/creditcard.csv")
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(['Time','Amount'],axis=1)

X = data.loc[:, data.columns != 'Class']
y = data.loc[:, data.columns == 'Class']

# 获取异常交易数据的总行数及索引
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)

# 获取正常交易数据的索引值
normal_indices = data[data.Class == 0].index

# 在正常样本当中, 随机采样得到指定个数的样本, 并取其索引
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)

# 有了正常和异常的样本后把他们的索引都拿到手
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

# 根据索引得到下采样的所有样本点
under_sample_data = data.iloc[under_sample_indices,:]

X_undersample = under_sample_data.loc[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.loc[:, under_sample_data.columns == 'Class']

# 打印下采样测略后正负样本比例
print('正常样本所占整体比例:', len(under_sample_data[under_sample_data.Class == 0]) / len(under_sample_data))
print('负样本所占整体比例:', len(under_sample_data[under_sample_data.Class == 1]) / len(under_sample_data))
print('下采样测略总体样本数量:', len(under_sample_data))

# 对整个数据集进行划分, X为特征数据, Y为标签, test_size为测试集比列, random_state 为随机种子, 目的是使得每次随机的结果都一样
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)

print('原始训练集包含的样本数量:', len(X_train))
print('原始测试集包含的样本数量:', len(X_test))
print('原始样本总数:', len(X_train) + len(X_test))

# 下采样数据集进行划分
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample
                                                                                                   ,y_undersample
                                                                                                   ,test_size = 0.3
                                                                                                   ,random_state = 0)
print("")

print('下采样训练集包含的样本数量:', len(X_train_undersample))
print('下采样测试集包含的样本数量:', len(X_test_undersample))
print('下采样本总数:', len(X_train_undersample) + len(X_test_undersample))

测试记录:

正常样本所占整体比例: 0.5
负样本所占整体比例: 0.5
下采样测略总体样本数量: 984
原始训练集包含的样本数量: 199364
原始测试集包含的样本数量: 85443
原始样本总数: 284807

下采样训练集包含的样本数量: 688
下采样测试集包含的样本数量: 296
下采样本总数: 984

3.2 SMOTE方法

代码:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

# 读取数据集并处理
data = pd.read_csv("E:/file/creditcard.csv")
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
credit_cards = data.drop(['Time','Amount'],axis=1)

X = data.values[:, data.columns != 'Class']
y = data.values[:, data.columns == 'Class']

columns=credit_cards.columns

features_columns=columns.delete(len(columns)-1)
features=credit_cards[features_columns]
labels=credit_cards['Class']

features_train, features_test, labels_train, labels_test = train_test_split(features,
                                                                            labels,
                                                                            test_size=0.2,
                                                                            random_state=0)

oversampler=SMOTE(random_state=0)
os_features,os_labels=oversampler.fit_resample(features_train,labels_train)

len(os_labels[os_labels==1])

os_features = pd.DataFrame(os_features)
os_labels = pd.DataFrame(os_labels)

# 打印SMOTE后正负样本比例
print('正常样本所占整体比例:', len(os_features[os_features.Class == 0]) / len(os_features))
print('负样本所占整体比例:', len(os_features[os_features.Class == 1]) / len(os_features))
print('SMOTE测略总体样本数量:', len(os_features))

# 对整个数据集进行划分, X为特征数据, Y为标签, test_size为测试集比列, random_state 为随机种子, 目的是使得每次随机的结果都一样
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)

print('原始训练集包含的样本数量:', len(X_train))
print('原始测试集包含的样本数量:', len(X_test))
print('原始样本总数:', len(X_train) + len(X_test))

# 下采样数据集进行划分
X_train_smote_sample, X_test_smote_sample, y_train_smote_sample, y_test_smote_sample = train_test_split(os_features
                                                                                                   ,os_labels
                                                                                                   ,test_size = 0.3
                                                                                                   ,random_state = 0)
print("")

print('smote训练集包含的样本数量:', len(X_train_smote_sample))
print('smote测试集包含的样本数量:', len(X_test_smote_sample))
print('smote测试集样本总数:', len(X_train_smote_sample) + len(X_test_smote_sample))

测试记录:

正常样本所占整体比例: 0.5
负样本所占整体比例: 0.5
SMOTE测略总体样本数量: 454908
原始训练集包含的样本数量: 199364
原始测试集包含的样本数量: 85443
原始样本总数: 284807

smote训练集包含的样本数量: 318435
smote测试集包含的样本数量: 136473
smote测试集样本总数: 454908

参考:

  1. https://study.163.com/course/introduction.htm?courseId=1003590004#/courseDetail?tab=1
  2. https://blog.csdn.net/weixin_56636204/article/details/122418541
  3. https://blog.csdn.net/eylier/article/details/119027871
上一篇下一篇

猜你喜欢

热点阅读