(二十六)项目实战|交易数据异常检测(一)-python数据分析

2018-06-01 本文已影响307人努力奋斗的durian

文章原创,最近更新：2018-06-1

1.案例背景和目标
2.样本不均衡解决方案
3.下采样策略

课程资料:这里所涉及到的练习资料creditcard.csv相关的链接以及密码如下:
链接: https://pan.baidu.com/s/1APgU4cTAaM9zb8_xAIc41Q 密码: xgg7

1.案例背景和目标

这节课主要介绍线性回归以及逻辑回归的算法的运用.首先看一下creditcard.csv里面前5行的数据是怎么样的?

import pandas as pd 
data=pd.read_csv("creditcard.csv")
print(data.head())

输出的结果如下:

   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...         V21       V22       V23       V24  \
0  0.098698  0.363787  ...   -0.018307  0.277838 -0.110474  0.066928   
1  0.085102 -0.255425  ...   -0.225775 -0.638672  0.101288 -0.339846   
2  0.247676 -1.514654  ...    0.247998  0.771679  0.909412 -0.689281   
3  0.377436 -1.387024  ...   -0.108300  0.005274 -0.190321 -1.175575   
4 -0.270533  0.817739  ...   -0.009431  0.798278 -0.137458  0.141267   

        V25       V26       V27       V28  Amount  Class  
0  0.128539 -0.189115  0.133558 -0.021053  149.62      0  
1  0.167170  0.125895 -0.008983  0.014724    2.69      0  
2 -0.327642 -0.139097 -0.055353 -0.059752  378.66      0  
3  0.647376 -0.221929  0.062723  0.061458  123.50      0  
4 -0.206010  0.502292  0.219422  0.215153   69.99      0  

[5 rows x 31 columns]

从输出的结果可以看出,数据共有5行31列:

time这一列表示交易持续的时间,其实没多大的意义.
v1,v2,v3.....v28分别代表特征1,2,3.....28.虽然有数据,但是没表明什么意思,因此这个数据并不是最原始的数据.因为考虑到用户的隐私,对部分数据进行了处理,才进行公开.因此对数据进行压缩,直接变成了v1,v2,v3.....v28可用的特征.
Amount代表交易的金额,这个金额相比特征的数据浮动比较大,待会会对Amount这一列的数据进行预处理.
Class 这列0代表是正常样本,1代表异常样本.通常情况下将数据分成2部分,一部分是x的特征数据;另一部分是y,当作lable数据,相当于Class中的数据.

注意:拿到数据的时候,要明白自己做一件什么样的事情.比如这里就是做信用卡的欺诈检测.在欺诈数据里面,有正常的数据,也有问题的数据.因此对于这样的问题,我们可以将原始问题分为0类是正常的、1类是异常的,接下来就是对样本的数据进行0和1的分类,相当于二分类的问题.

用这些已经提取好的特征,如何进行建模的操作呢?接下来用逻辑回归,建立一个模型.

首先看一下样本数据的分布规则,一般情况下正常数据出现的情况比较多,异常数据出现的情况比较少.一般99.9%的数据都是正常的,只有那么0.1%的数据出现诈骗或者异常或者其他.因此这个样本数据的检测绝大多数都是正常样本,只有极少数的样本是异常样本.

首先看一下正负样本的比例有多大?

import pandas as pd 
import matplotlib.pyplot as plt

data=pd.read_csv("creditcard.csv")
count_classes = pd.value_counts(data['Class'], sort = True).sort_index()
count_classes.plot(kind="bar")
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")

输出结果如下:

从输出的结果可以看出样本数据的差异,正常的样本0大概有28万个;异常的样本1非常少,大概只有那么几百个.

这里涉及到的知识点:
1)value_counts（values，sort = True，ascending = False，normalize = False，bins = None，dropna = True）

pd.value_counts 返回一个Series，其索引为唯一值，其值为频率，按计数值降序排列.value_counts为对data每列中出现的数字进行统计.
相当于统计'Class'这列不同的属性值,就会统计0有多少个?1有多少个?

2)用pandas画简单的图也可以的.count_classes.plot(kind="bar")

相当于画条形图.

2.样本不均衡解决方案

这里需要想一下,样本数据不均衡,应该怎么办?而今天的样本数据是极度不均衡的,应该提出什么样的解决方案?这里有两种解决方案,一是过采样,一种是下采样,这是针对样本不均衡最常使用的两种方案.

这里涉及到的知识点:

下采样定义
就是当数据样本不均衡,想变成均衡的数据,将0和1的数据一样小.之前统计的结果可以看出0的样本有28万个,而1的样本只有几百个.现在将0的数据也变成几百个就可以了.下采样,是使样本的数据同样少.
过采样定义:之前统计的结果可以看出0的样本有28万个,而1的样本只有几百个.0比较多1比较少,对1的样本数据进行生成数列,让生成的数据与0的样本数据一样多.

这两种方案,那种方案稍微的更好一些呢?

之前有提到Amount 这列的数值浮动比较大,有些值比较小,在建机器模型的时候,首先要做一件事情,首先要保证特征分布的差异是差不多的,比如拿v28与Amount 来举例.v28的数据分布的区间是[-1,1],Amount 分布差异比较大.机器学习算法可能有这样的误导,数值比较大的特征,
一些重要程度是偏大一些的;数值比较小的特征,一些重要程度是偏小一些的.因此需要使数据的特征是相当的,避免机器学习算法有误区.

这里并非要求V28与Amount的特征谁更重要一些,因此要对Amount的数据进行规划要么做标准化.可以将Amount的数据做成区间是[0,1]或者[-1,1],这些都是可以的.在sklearn库中有提供好的预处理模块,可以对数据进行预处理操作.

案例代码如下:

from sklearn.preprocessing import StandardScaler
import pandas as pd 

data=pd.read_csv("creditcard.csv")
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1, 1))
data = data.drop(['Time','Amount'],axis=1)
print(data.head())

输出结果如下:

         V1        V2        V3        V4        V5        V6        V7  \
0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9       V10     ...           V21       V22       V23  \
0  0.098698  0.363787  0.090794     ...     -0.018307  0.277838 -0.110474   
1  0.085102 -0.255425 -0.166974     ...     -0.225775 -0.638672  0.101288   
2  0.247676 -1.514654  0.207643     ...      0.247998  0.771679  0.909412   
3  0.377436 -1.387024 -0.054952     ...     -0.108300  0.005274 -0.190321   
4 -0.270533  0.817739  0.753074     ...     -0.009431  0.798278 -0.137458   

        V24       V25       V26       V27       V28  Class  normAmount  
0  0.066928  0.128539 -0.189115  0.133558 -0.021053      0    0.244964  
1 -0.339846  0.167170  0.125895 -0.008983  0.014724      0   -0.342475  
2 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752      0    1.160686  
3 -1.175575  0.647376 -0.221929  0.062723  0.061458      0    0.140534  
4  0.141267 -0.206010  0.502292  0.219422  0.215153      0   -0.073403  

[5 rows x 30 columns]

从输出的结果可以看出,特征数据已经确定好了.

这里涉及到的知识点:

StandardScaler()标准化的模块
fit_transform是对数据进行变化
StandardScaler().fit_transform(data['Amount'].reshape(-1, 1))是对data中的'Amount'这列的数据进行传入到StandardScaler().fit_transform()
reshape()是用法如下
比如[2,3]是2行3列的矩阵,[2,3].reshape(-1,2)会变成[3,2]即就是3行2列的矩阵.因为python会根据3*2共6个数,依据列数(这里2列),会自动计算出行数.因为这里的新列是只有1列,所以要求python自动计算出列的行数.
data = data.drop(['Time','Amount'],axis=1),是因为有了'normAmount'这列,而之前的Time这列是没有用的,Amount已经转换成'normAmount'这列,因此要将这两列数据进行丢弃.

3.下采样策略

接下来就是用下采样以及过采样的两种方式,是样本数据进行均衡处理.首先先用下采样的的策略:使0和1的样本数据一样的少,要使0和1的数据一样的少,那么应该怎么做呢?

步骤一:

首先对数据进行区分,将数据分成一个为x,一个为y.

x代表是特征数据;前面的:(冒号)是将所有的行样本都取进来, data.columns != 'Class'引进所有的列中,但是不包括'Class'这列.这样就构成了特征数据
对于y列:(冒号)是将所有的行样本都取进来,将data.columns = 'Class',引进'Class'这列当成lable值.

X = data.ix[:, data.columns != 'Class']
y = data.ix[:, data.columns == 'Class']

X输出结果为:

  V1        V2        V3        V4        V5        V6        V7  \
0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9       V10     ...           V20       V21       V22  \
0  0.098698  0.363787  0.090794     ...      0.251412 -0.018307  0.277838   
1  0.085102 -0.255425 -0.166974     ...     -0.069083 -0.225775 -0.638672   
2  0.247676 -1.514654  0.207643     ...      0.524980  0.247998  0.771679   
3  0.377436 -1.387024 -0.054952     ...     -0.208038 -0.108300  0.005274   
4 -0.270533  0.817739  0.753074     ...      0.408542 -0.009431  0.798278   

        V23       V24       V25       V26       V27       V28  normAmount  
0 -0.110474  0.066928  0.128539 -0.189115  0.133558 -0.021053    0.244964  
1  0.101288 -0.339846  0.167170  0.125895 -0.008983  0.014724   -0.342475  
2  0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752    1.160686  
3 -0.190321 -1.175575  0.647376 -0.221929  0.062723  0.061458    0.140534  
4 -0.137458  0.141267 -0.206010  0.502292  0.219422  0.215153   -0.073403  

[5 rows x 29 columns]

y输出结果为:

涉及知识点:
ix / loc 可以通过行号和行标签进行索引，比如 df.loc['a'] , df.loc[1], df.ix['a'] , df.ix[1]而iloc只能通过行号索引 , df.iloc[0] 是对的, 而df.iloc['a'] 是错误的.
loc 和 ix 大部分时候行为都差不多, 但是当假如某个行的索引标签就是一个 INT 比如 1 的时候, loc 优先将 df.loc[1] 理解为行标签为 1 的索引, 而 df.ix[1] 优先将其理解为行号为 1 的索引(就是第二行).
为了避免不小心犯错误,建议:
- 1.当用行号索引的时候, 尽量用 iloc 来进行索引;
- 1. 而用标签索引的时候用 loc ,
- 1. ix 就别用了

步骤二:

接下来是让0和1一样的少.因为两者的样本数量超级不均衡,因此需要了解1类的样本数量有多少个?

需要计算Class == 1的样本个数
fraud_indices = np.array(data[data.Class == 1].index)将Class == 1的样本的索引拿出来.

number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)

number_records_fraud输出的结果为:492
fraud_indices 输出的结果为:

[   541    623   4920   6108   6329   6331   6334   6336   6338   6427
   6446   6472   6529   6609   6641   6717   6719   6734   6774   6820
   6870   6882   6899   6903   6971   8296   8312   8335   8615   8617
......
 258403 261056 261473 261925 262560 262826 263080 263274 263324 263877
 268375 272521 274382 274475 275992 276071 276864 279863 280143 280149
 281144 281674]

步骤三:

接下来需要将0类的数据进行随机选择,因此先将将Class == 0的样本的索引拿出来,通过这些索引进行随机的选择.

normal_indices = data[data.Class == 0].index

normal_indices输出的结果为:

Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,
            ...
            284797, 284798, 284799, 284800, 284801, 284802, 284803, 284804,
            284805, 284806],
           dtype='int64', length=284315)

步骤四:

将Class == 0的样本的索引索引进行随机的选择.

np.random.choice(normal_indices, number_records_fraud, replace = False)对数据进行随机的选择,第1个参数是指待选择的数据,而第2个参数是指选择的个数,第3个参数是否选择代替,这里选择不代替.
random_normal_indices = np.array(random_normal_indices)选好了数据之后,将数据(索引)值拿出来并转换成array的格式.

random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)

random_normal_indices输出的结果为:

[116549   8246  58928 195831  24378 155470   2928 196485   1263 154569
 124579  73439 134119 166158 115654   6207 125430 272877 172843 227905
  98385  64250  42926 282628 103786  20825 232324   2970  38334 121091
......
  33946 109597  26740 209036 233898 109418 167046 153046  30701 251630
  95018  59052 231435 212688  47047 201302  82762  30588 132329 228857
 227541  53627]

random_normal_indices输出的结果为:

[156959  95974 213308 194594 161193  85394   8111 251709 225081  78327
 284065 140468  41842  50913 237823 278143 221350 153415 167686 102469
  56223 143083 251190 144994 172825 261161 237031 122866 174099  13173
......
 147350 109878  13833  33751 284747 254240  49823  43045  42083 173280
 204803  33121 247500 205138 107235 150955 107497 124359 133047 106351
 250517 109529]

步骤五:

接下来,对numpy的数据进行一个合并的操作.这里的数据index的值是等于1,还有index的值是等于0,将这两组数据放在在一起

under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

under_sample_indices输出结果为:

[   541    623   4920   6108   6329   6331   6334   6336   6338   6427
   6446   6472   6529   6609   6641   6717   6719   6734   6774   6820
   6870   6882   6899   6903   6971   8296   8312   8335   8615   8617
......
 216734 133129  61895 276555  36672  59561 199743 216279 277491  50834
  78442  28713 128264 202477 202553 111711 239937 113963 273641  37351
 136787 192156 225688 239756]

np.concatenate()所涉及到的知识点:

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
np.concatenate((a, b), axis=0)
    array([[1, 2],
           [3, 4],
           [5, 6]])
np.concatenate((a, b.T), axis=1)
    array([[1, 2, 5],
           [3, 4, 6]])

步骤六:

index的值是等于1,index的值是等于0,将这些数据合并在一起之后,通过index在原始的数据当中进行并位的操作.

under_sample_data是经过完下采样处理之后,可以再将under_sample_data数据分为2部分的数据,一部分是X_undersample数据,一部分是y_undersample数据,分别是代表是特征以及lable.

under_sample_data = data.iloc[under_sample_indices,:]

X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class']

under_sample_data输出的结果为:

              V1        V2         V3        V4        V5        V6  \
541    -2.312227  1.951992  -1.609851  3.997906 -0.522188 -1.426545   
623    -3.043541 -3.157307   1.088463  2.288644  1.359805 -1.064823   
4920   -2.303350  1.759247  -0.359745  2.330243 -0.821628 -0.075788   
......   
224240 -2.706990  0.263766   0.881609  4.615918 -0.637164  1.079700   
99902  -1.284746  0.320887   0.374802 -0.664525  0.433880 -0.653042   
266861 -0.732211  0.644729   1.441944 -0.919455  0.099726  0.190200   

               V7        V8        V9        V10     ...           V21  \
541     -2.537387  1.391657 -2.770089  -2.772272     ...      0.517232   
623      0.325574 -0.067794 -0.270953  -0.838587     ...      0.661696   
4920     0.562320 -0.399147 -0.238253  -1.525412     ...     -0.294166   
..... 
224240   1.342084 -0.001033 -2.052804   2.006310     ...     -0.664714   
99902    1.839785 -0.300557 -0.448643  -0.547502     ...     -0.387583   
266861   0.331085  0.057795  0.726765   0.194685     ...      0.150547   

             V22       V23       V24       V25       V26       V27       V28  \
541    -0.035049 -0.465211  0.320198  0.044519  0.177840  0.261145 -0.143276   
623     0.435477  1.375966 -0.293803  0.279798 -0.145362 -0.252773  0.035764   
4920   -0.932391  0.172726 -0.087330 -0.156114 -0.542628  0.039566 -0.153029   
.....   
224240 -1.410437 -0.740022  0.002338 -0.443197 -0.127586 -0.006104 -0.351641   
99902  -1.019331  0.438280 -0.492716  0.063672  0.038212  0.248317  0.105015   
266861  0.851754 -0.286176 -0.427786 -0.323472  0.528626  0.173683  0.050960   

        Class  normAmount  
541         1   -0.353229  
623         1    1.761758  
4920        1    0.606031  
.....
224240      0    1.082084  
99902       0    0.418201  
266861      0   -0.316447  

[984 rows x 30 columns]

X_undersample输出的结果为:

              V1        V2         V3        V4        V5        V6  \
541    -2.312227  1.951992  -1.609851  3.997906 -0.522188 -1.426545   
623    -3.043541 -3.157307   1.088463  2.288644  1.359805 -1.064823   
.....
244538 -0.454182  1.086067  -2.044538 -1.541637  2.704735  3.207766   
160568  1.994042 -1.245946  -0.750254 -1.051604 -0.787568  0.144334   
161729  2.063641 -0.699814  -1.287431 -1.702303 -0.772118 -1.765008   

               V7        V8        V9        V10     ...           V20  \
541     -2.537387  1.391657 -2.770089  -2.772272     ...      0.126911   
623      0.325574 -0.067794 -0.270953  -0.838587     ...      2.102339   
4920     0.562320 -0.399147 -0.238253  -1.525412     ...     -0.430022   
......
244538   0.045364  1.256304 -0.136446  -0.176591     ...      0.064808   
160568  -1.005092  0.108401 -0.166269   0.835084     ...      0.191989   
161729  -0.018834 -0.374259  2.478122  -1.316819     ...     -0.286771   

             V21       V22       V23       V24       V25       V26       V27  \
541     0.517232 -0.035049 -0.465211  0.320198  0.044519  0.177840  0.261145   
623     0.661696  0.435477  1.375966 -0.293803  0.279798 -0.145362 -0.252773   
4920   -0.294166 -0.932391  0.172726 -0.087330 -0.156114 -0.542628  0.039566   
......
244538  0.377806  1.175611 -0.031673  0.734231 -0.681844 -0.191997  0.509245   
160568  0.023440 -0.178660  0.281183  0.255687 -0.481238 -0.519921 -0.012572   
161729  0.149171  0.853296  0.014144  0.448867  0.360000 -0.665084  0.042658   

             V28  normAmount  
541    -0.143276   -0.353229  
623     0.035764    1.761758  
4920   -0.153029    0.606031  
......
244538  0.313375   -0.349671  
160568 -0.032987    0.026589  
161729 -0.053884   -0.313448  

[984 rows x 29 columns]

y_undersample输出的结果为:

        Class
541         1
623         1
4920        1
......
126998      0
147121      0
22781       0

步骤七:

打印这样的函数,从输出的结果可以看出下采样的数据共有984个,正样本有50%,负样本有50%.通过下采样的方式将0和1不均衡的数据转换成均衡的数据.只是发生了一些代价,产生代价的原因是因为数据是随机生成的,有些数据并非利用到手.

print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))

输出结果为:

Percentage of normal transactions:  0.5
Percentage of fraud transactions:  0.5
Total number of transactions in resampled data:  984

完整的下采样代码如下:

import numpy as np
from sklearn.preprocessing import StandardScaler
import pandas as pd 

data=pd.read_csv("creditcard.csv")
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1, 1))
data = data.drop(['Time','Amount'],axis=1)

X = data.ix[:, data.columns != 'Class']
y = data.ix[:, data.columns == 'Class']
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)
normal_indices = data[data.Class == 0].index
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)
print(random_normal_indices)
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])
under_sample_data = data.iloc[under_sample_indices,:]

X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class']
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))

思考一下,下采样的操作是否会存在潜在的问题呢?数据量拿出来比较少,肯定会存在一些问题,