数据预处理方面的小坑
2019-03-08 本文已影响0人
Flamingjay
今天要做的是从一个网络入侵检测数据集中,读取数据,进行预处理。
首先,看一下数据的类型:
0,tcp,ftp_data,SF,491,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,150,25,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,normal,20
0,udp,other,SF,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,13,1,0.00,0.00,0.00,0.00,0.08,0.15,0.00,255,1,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00,normal,15
0,tcp,private,S0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,123,6,1.00,1.00,0.00,0.00,0.05,0.07,0.00,255,26,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00,neptune,19
0,tcp,http,SF,232,8153,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,5,5,0.20,0.20,0.00,0.00,1.00,0.00,0.00,30,255,1.00,0.00,0.03,0.04,0.03,0.01,0.00,0.01,normal,21
0,tcp,http,SF,199,420,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,30,32,0.00,0.00,0.00,0.00,1.00,0.00,0.09,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal,21
0,tcp,private,REJ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,121,19,0.00,0.00,1.00,1.00,0.16,0.06,0.00,255,19,0.07,0.07,0.00,0.00,0.00,0.00,1.00,1.00,neptune,21
因此,自然想到,要将其中的 文本数据转换成数字,同时将标签整合一下,这里有很多小类,但是我们实际使用中可能只关注大类,比如Dos等。这里列出了对应的大类:
attack_dict = {
'normal': 'normal',
'back': 'DoS', 'land': 'DoS', 'neptune': 'DoS', 'pod': 'DoS', 'smurf': 'DoS', 'teardrop': 'DoS',
'mailbomb': 'DoS', 'apache2': 'DoS', 'processtable': 'DoS', 'udpstorm': 'DoS',
'ipsweep': 'Probe', 'nmap': 'Probe', 'portsweep': 'Probe', 'satan': 'Probe', 'mscan': 'Probe', 'saint': 'Probe',
'ftp_write': 'R2L', 'guess_passwd': 'R2L', 'imap': 'R2L', 'multihop': 'R2L', 'phf': 'R2L', 'spy': 'R2L',
'warezclient': 'R2L', 'warezmaster': 'R2L', 'sendmail': 'R2L', 'named': 'R2L', 'snmpgetattack': 'R2L',
'snmpguess': 'R2L', 'xlock': 'R2L', 'xsnoop': 'R2L', 'worm': 'R2L',
'buffer_overflow': 'U2R', 'loadmodule': 'U2R', 'perl': 'U2R', 'rootkit': 'U2R',
'httptunnel': 'U2R', 'ps': 'U2R', 'sqlattack': 'U2R', 'xterm': 'U2R'
}
为了完成以上内容,执行以下操作:
- 为dataframe添加列名称
csv文件中,没有给出对应的列名,这在很多地方不是那么方便,为了在dataframe中进行操作,首先加上列名。
kdd_train = pd.read_csv('NSLTrain.csv', header=None)
header_names = ['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment',
'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root',
'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login',
'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate',
'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate',
'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'attack_type', 'success_pred']
kdd_train.columns = header_names
kdd_test = pd.read_csv('NSLTest.csv', header=None)
kdd_test.columns = header_names
- 将两个dataframe合并在一起,必须要有列名
为了方便后续使用one-hot编码,将训练集和测试集放在一起,两个dataframe合并,可以使用concat函数。
kdd = pd.concat([kdd_train, kdd_test])
需要注意的是,若只给kdd_train加列名称,而kdd_test没有加,则会出现很大的数据串行等问题,因为kdd_test的列名称默认为0,1,2,3.....m。另外,默认axis=0,按照行的方向进行堆叠。
- apply函数,要有前面的等号
接下来,对label进行替换,使用的是apply函数根据attack_dict键值对进行操作。记得一定要赋值,赋值,赋值。。。
kdd['attack_type'] = kdd['attack_type'].apply(lambda x: attack_dict[x] if (x in attack_dict.keys()) else x)
- 丢掉一些属性
有一些属性值一直为同一个数,对于后续操作无意义,因此选择性地将他们剔除,在这里使用的drop函数。
kdd.drop(['success_pred', 'num_outbound_cmds'], axis=1, inplace=True)
需要注意的是,drop的用法,主要可以分成是否对原数据进行删除两种。
若允许对原始数据进行删除,则 inplace=True, 且不进行赋值,直接在原数据上操作。如上所示。
若不允许, 则需要注意的是,要进行赋值,且被赋值的变量,不能与原数据变量名相同。比如,kdd = kdd.drop('success_pres', axis=1),则结果就是kdd还是原来的那个kdd,没有删除过。因而需要重定义一个,kdd_drop = kdd.drop('success_pres', axis=1)。
- Label形状转换
label往往是一列数据,单独拿出来,转换成numpy之后,大多数是(m, )的vercor形式,最好转换成(m, 1)array形式.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
kdd_label = kdd['attack_type'].values # 转换成numpy
kdd_label = le.fit_transform(kdd_label)
kdd_label = kdd_label.reshape(-1, 1)
以上就是在本次实验过程中,踩的一些小坑,若还有别的坑,哈哈哈,欢迎多多交流。