信用卡欺诈检测（机器学习）

2019-03-24 本文已影响0人 Radiance_sty

练习：信用卡欺诈检测

我们拿到的数据都是经过帅选拿到的数据集，这是因为这些数据涉及到相关的隐私，但是这并不妨碍我们测试模型和预测。

数据导入

import pandas as pd
import matplotlib.pyplot as plt

# 注意：pandas 通常不会完全显示
# pd.set_option('display.max_columns', None)  # 显示所有列
# pd.set_option('display.max_rows', None)  # 显示所有行
# pd.set_option('max_colwidth', 100)  # 设置 value 的显示长度为100，默认为50
# pd.set_option('display.width', 1000)  # 当 console 中输出的列数超过1000的时候才会换行

# import data
data = pd.read_csv('creditcard.csv')
print(data.head())

运行结果为：

可以看出前6行的结果，然而这并不能看出什么，其实这些都是提取好的特征，可以方便我们进行建模

拿到数据后，先将数据分成两类：0（正常数据），1（异常数据），注意正常的样本数据一定远大于异常数据。在class列中，0表示没有被诈骗，1表示被诈骗过

# import data
data = pd.read_csv('creditcard.csv')
print(data.head())

count_classes = pd.value_counts(data['Class'], sort=True).sort_index()          # 查看该列有多少种不同的属性值
count_classes.plot(kind = 'bar')

# 作图
plt.title('Fraud class histogram')          # 标题
plt.xlabel('Class')                         # x轴添加文字
plt.xticks(rotation=45)                     # 将x轴数据旋转45°
plt.ylabel('Frequency')                     # y轴添加文字

plt.show()

运行结果为：

样本不均衡操作
通过柱状图可以发现两个样本不均衡，可以通过上下采样调整样本分布均匀，使得0和1的样本数目一致，再进行分析。再者就是数据中Amount这一列数据值区间较大，机器学习时会认为数值大的数据重要程度偏大，需要对其进行归一化或者标准化处理。
```
from sklearn.preprocessing import StandardScaler

# 标准化处理
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
# fit_transform 对数据进行变换
# 注：这里 reshape 前面要加'.value'

data = data.drop(['Time', 'Amount'], axis=1)            # 去除不需要的特征
print(data.head())
```

运行结果为：

此时，normAmount代替了Amount这一列的数据，数值为经过标准化处理的值。

数据处理
由于异常样本和正常样本的数据量不一样，所以需要对数据进行下采样处理，使得异常样本和正常样本的数据量一致。

 import numpy as np
 import pandas as pd

 data = pd.read_csv('creditcard.csv')

 # 下采样处理，使得0（正样品）和1（负样品）数据一样少
 # 注：ix 已经被弃用，可以使用 loc 或者 iloc
 X = data.loc[:, data.columns != 'Class']
 y = data.loc[:, data.columns == 'Class']

 # 计算出负样品的样本数，并获取它们的索引，转换成 array 格式
 number_records_fraud = len(data[data.Class == 1])
 fraud_indices = np.array(data[data.Class == 1].index)

 # 获取正样品的索引
 normal_indices = data[data.Class == 0].index

 # 在正样品的索引索引中随机选择样本，样本数为 number_records_fraud，然后获取新的索引，转换成 array 格式
 random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace=False)
 random_normal_indices = np.array(random_normal_indices)

 # 将两个样本合并在一起
 under_sample_indices = np.concatenate([fraud_indices, random_normal_indices])

 # 经过下采样所拿到的数据集
 under_sample_data = data.iloc[under_sample_indices]

 # 下采样数据集的数据
 X_under_samples = under_sample_data.loc[:, under_sample_data.columns != 'Class']
 y_under_samples = under_sample_data.loc[:, under_sample_data.columns == 'Class']

 # 正样品数
 print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0]) / len(under_sample_data))
 # 负样品数
 print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1]) / len(under_sample_data))
 # 总样品数
 print("Total number of transactions in resampled data: ", len(under_sample_data))

运行结果为：

交叉验证

from sklearn.model_selection import train_test_split

# 交叉验证，将数据切分成测试集和训练集，测试数据集设为0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print("Number transactions train dataset: ", len(X_train))
print("Number transactions train dataset: ", len(y_test))
print("Total number of transactions: ", len(X_train) + len(X_test))

#下采样数据集
X_train_undersample, X_test_unsersample, y_train_undersample, y_test_undersample = train_test_split(X_under_samples, y_under_samples, test_size = 0.3, random_state = 0)

print('')
print("Number transcations train dataset: ", len(X_train_undersample))
print("Number transcations test dataset: ", len(X_test_unsersample))
print("Total number of transactions: ", len(X_train_undersample)+len(X_test_unsersample))

运行结果为：

交叉验证可以参考：https://www.cnblogs.com/sddai/p/5696834.html

模型评估：
精度：数据样本分布不均，虽然样本数很高，但很可能一个异常样本都没检测出来，所以经常用recall（召回率）来做评估标准。
正则化惩罚：L2正则化，机器学习中几乎都可以看到损失函数后面会添加一个额外项，常用的额外项一般有两种，一般称作L1正则化和L2正则化，L1正则化和L2正则化可以看做是损失函数的惩罚项。所谓惩罚是指对损失函数中的某些参数做一些限制。

# 建模，Recall = TP/(TP+FN)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, recall_score, classification_report
from sklearn.model_selection import KFold, cross_val_score
# 版本不同，KFold 从 model_selection 中导入

import warnings
warnings.filterwarnings("ignore")
# 注：会出现警告，我们可以使用上面的代码来忽视它

# K折交叉验证
def printing_Kfold_scores(x_train_data,y_train_data):
    fold = KFold(5, shuffle=True)                      # KFold 用法

    c_param_range = [0.01,0.1,1,10,100]         # 正则化惩罚项（惩罚力度）

    results_table = pd.DataFrame(index=range(len(c_param_range),2), columns=['C_parameter','Mean recall score'])
    results_table['C_parameter'] = c_param_range

    j = 0
    # 每次循环使用不同的惩罚参数，选出最优的一个
    for c_param in c_param_range:
        print('-'*30)
        print('C parameter: ', c_param)
        print('-'*30)
        print('')

        recall_accs = []
        # 交叉验证
        for iteration,indices in enumerate(fold.split(x_train_data)):

            # 使用C参数调用回归模型
            lr = LogisticRegression(C = c_param, penalty='l1')

            # 使用训练数据来拟合模型
            lr.fit(x_train_data.iloc[indices[0],:], y_train_data.iloc[indices[0],:].values.ravel())

            # 使用训练数据中的测试指数来进行预测
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)

            # 计算召回率并添加到列表中
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values, y_pred_undersample)

            recall_accs.append(recall_acc)
            print('Iteration', iteration,': recall score = ',recall_acc)

        # 召回分数的值是我们想要保存和掌握的指标
        results_table.loc[j,'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')

    best_c = results_table.loc[results_table['Mean recall score'].astype('float64').idxmax()]['C_parameter']
    # 注：idxmax()前要加‘.astype('float64')’

    print('*'*30)
    print('Best model to choose from cross validation is with C parameter =', best_c)
    print('*'*30)
    return best_c

best_c = printing_Kfold_scores(X_train_undersample, y_train_undersample)
print(best_c)

运行结果为：

通过评估后发现 recall 值符合下采样组合要求，但是误杀太大，超过了允许的范围。

混合矩阵

import itertools
import matplotlib.pyplot as plt

# 混合矩阵
def plot_confusion_matrix(cm, classes,title='Confusion matrix',cmap=plt.cm.Blues):

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()

    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

混合矩阵作用于全数据集

lr = LogisticRegression(C = best_c, penalty='l1')
lr.fit(X_train, y_train.values.ravel())
y_pred = lr.predict(X_test.values)

# 计算全数据集混合矩阵
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

print('Recall metric in the testing dataset: ', cnf_matrix)

# 非归一化混合矩阵
class_name = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_name, title='Confusion matrix')
plt.show()

运行结果为：

可以看出，模型中有许多欺诈没有找出来

混合矩阵作用于低采样数据集

lr = LogisticRegression(C = best_c, penalty='l1')
lr.fit(X_train_undersample, y_train_undersample.values.ravel())
y_pred_undersamples = lr.predict(X_test_unsersample.values)

# 计算低采样数据集混合矩阵
cnf_matrix = confusion_matrix(y_test_undersample, y_pred_undersamples)
np.set_printoptions(precision=2)

print('Recall metric in the testing dataset: ', cnf_matrix)

# 非归一化混合矩阵
class_name = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_name, title='Confusion matrix')
plt.show()

运行结果为：

可以看出，有9个欺诈的数据没有查找出来，同时有18个正常数据被误杀。
之前我们使用的是Sigmoid函数中默认的阈值：0.5，如果我们自己指定阈值，会对结果产生什么影响呢？

  lr = LogisticRegression(C = 0.01, penalty='l1')
  lr.fit(X_train_undersample, y_train_undersample.values.ravel())
  y_pred_undersample_proba = lr.predict_proba(X_test_unsersample.values)
  # 这里改成计算结果的概率值

  thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

  plt.figure(figsize=(10,10))
  
  # 将预测的概率值与阈值进行对比
  j = 1
  for i in thresholds:
      y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i

      plt.subplot(3,3,j)
      j += 1

      cnf_matrix = confusion_matrix(y_test_undersample, y_test_predictions_high_recall)
      np.set_printoptions(precision=2)

      print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

      class_names = [0,1]
      plot_confusion_matrix(cnf_matrix, classes=class_names, title='Threshold > %s' %i)
  plt.show()

运行结果为：

从图像中可以看出，当阈值为0.1-0.3时，recall值为1，说明太过严苛。随着阈值越来越大，模型的要求越来越宽松。这里需要根据实际需要，选定一个合适的模型。

过采样：通常进行数据分析时，我们需要有效样本越多越好。过采样就是当目前两个样本的数量不同时，为了让样本一样多，将负样本填充到和正号样本数量一样多的采样方法。
SMOTE算法：扩充少数类样本的算法
参考资料：https://www.cnblogs.com/Determined22/p/5772538.html

import pandas as pd
import numpy as np
import warnings
import itertools
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,recall_score
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import LogisticRegression

from imblearn.over_sampling import SMOTE

warnings.filterwarnings("ignore")
# 注：会出现警告，我们可以使用上面的代码来忽视它

# 数据读取及划分
credit_card = pd.read_csv('creditcard.csv')

# 建模，Recall = TP/(TP+FN)
def printing_Kfold_scores(x_train_data,y_train_data):
    fold = KFold(5, shuffle=True)

    c_param_range = [0.01,0.1,1,10,100]         # 正则化惩罚向（惩罚力度）

    results_table = pd.DataFrame(index=range(len(c_param_range),2),  columns=['C_parameter','Mean recall score'])
    results_table['C_parameter'] = c_param_range

    j = 0
    # 每次循环使用不同的惩罚参数，选出最优的一个
    for c_param in c_param_range:
        print('-'*30)
        print('C parameter: ', c_param)
        print('-'*30)
        print('')

        recall_accs = []
        # 交叉验证
        for iteration,indices in enumerate(fold.split(x_train_data)):

            # 使用C参数调用回归模型
            lr = LogisticRegression(C = c_param, penalty='l1')

            # 使用训练数据来拟合模型
            lr.fit(x_train_data.iloc[indices[0],:], y_train_data.iloc[indices[0],:].values.ravel())

            # 使用训练数据中的测试指数来进行预测
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)

            # 计算召回率并添加到列表中
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values, y_pred_undersample)

            recall_accs.append(recall_acc)
            print('Iteration', iteration,': recall score = ',recall_acc)

        # 召回分数的值是我们想要保存和掌握的指标
        results_table.loc[j,'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')

    best_c = results_table.loc[results_table['Mean recall score'].astype('float64').idxmax()]['C_parameter']
    # 注：idxmax()前要加‘.astype('float64')’

    print('*'*30)
    print('Best model to choose from cross validation is with C parameter =', best_c)
    print('*'*30)
    return best_c

columns = credit_card.columns
feathres_columns = columns.delete(len(columns)-1)

feathres = credit_card[feathres_columns]
labels = credit_card['Class']

feathres_train, feathres_test, labels_train, labels_test = train_test_split(
    feathres, labels, test_size=0.2, random_state=0)

# SMOTE 处理训练集
oversample = SMOTE(random_state=0)
os_feathres, os_labels = oversample.fit_resample(feathres_train, labels_train)
print(len(os_labels[os_labels == 1]))

# 交叉验证
os_feathres = pd.DataFrame(os_feathres)
os_labels = pd.DataFrame(os_labels)
best_c = printing_Kfold_scores(os_feathres, os_labels)

运行结果为：

混合矩阵

# 混合矩阵
def plot_confusion_matrix(cm, classes,title='Confusion matrix',cmap=plt.cm.Blues):

plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=0)
plt.yticks(tick_marks, classes)

thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
    plt.text(j, i, cm[i, j],
             horizontalalignment="center",
             color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')

运行结果为：

如图，过采样的测试结果明显优于下采样的测试结果。

小结

采用下采样分析时，Recall值可以达到较高水平，但是误伤的概率较高，预测出的小概率事件发生量明显上升。采用过采样分析时，可以避免这个问题。

信用卡欺诈检测（机器学习）

练习：信用卡欺诈检测

小结

猜你喜欢

热点阅读