unit8 骚扰短信识别

2018-06-11 本文已影响0人巴拉巴拉_9515

《web安全之深度学习实战》第八章：骚扰短信识别提供了四种及以上文本特征提取的方法，形成文本特征字典用于进行模型训练。

1、特征提取

（1）词频表

使用CountVectorizer函数提取短信文本每个词出现的频数，形成短信和文本的词频字典。

（2）权重处理

在一个大的文本语料库中，一些词出现频率高但却缺少实际意义（例如，在英语中“A”、“A”、“IS”等）。如果我们直接将直接计数数据馈送到分类器，那么那些非常频繁的术语将遮蔽更稀有但更有趣的术语的频率。
TfidfTransformer在CountVectorizer词频统计的基础上，统计权重。

（3）加入NGram模式

在CountVectorizer函数中，增加3Gram,token_pattern='\b\w+\b',两个因素，使3个单词为一组生成的部分vocabulary如下：

{'u dun wan': 852,
'customer service representative': 323,
'service representative freephone': 742,
'representative freephone 0808': 720,
'won guaranteed ...m': 667,
'nokia 7250i win': 586,
······

（4）进程处理

使用VocabularyProcessor建立词汇表，把文本转为词ID序列。

2、短信分类

（1）贝叶斯分类

贝叶斯分类模型处理只要几行代码就可以了。

def do_nb_wordbag(x_train, x_test, y_train, y_test):
    gnb = GaussianNB()
    gnb.fit(x_train,y_train)
    y_pred=gnb.predict(x_test)
    print(classification_report(y_test, y_pred))
    print(metrics.confusion_matrix(y_test, y_pred))

do_nb_wordbag(x_train, x_test, y_train, y_test)
precision recall f1-score support
0.90 0.79 0.82 2230
[[1471 453]
[ 22 284]]

模型结果显示，贝叶斯分类准确率为90%。
优缺点：速度快

（2）SVM分类

SVM支持向量机训练代码也很简单。

def do_svm_doc2vec(x_train, x_test, y_train, y_test):
    print("SVM and doc2vec")
    clf = svm.SVC()
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(metrics.accuracy_score(y_test, y_pred))
    print(metrics.confusion_matrix(y_test, y_pred))

'precision', 'predicted', average, warn_for)
avg / total 0.74 0.86 0.80 2230
0.862780269058296

优缺点：速度相对贝叶斯分类较慢。虽然其他资料显示SVM在垃圾短信识别的效率比贝叶斯要高，但本次拟合结果并不是很理想。

（3）随机森林分类

def do_rf_word2vec(x_train, x_test, y_train, y_test):
    print("rf and word2vec")
    clf = RandomForestClassifier(n_estimators=50)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(metrics.accuracy_score(y_test, y_pred))
    print(metrics.confusion_matrix(y_test, y_pred))

0.9789237668161435
[[1921 3]
[ 44 262]]

优点：在本次分类中随机森林运行时间短（介于贝叶斯和SVM之间），垃圾短信识别效果高，准确率达97.8%。

（4）XGBoost分类

def do_xgboost_word2vec(x_train, x_test, y_train, y_test):
    print("xgboost and word2vec")
    xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
    y_pred = xgb_model.predict(x_test)
    print(classification_report(y_test, y_pred))
    print(metrics.confusion_matrix(y_test, y_pred))

precision recall f1-score support
0.96 0.95 0.95 2230
[[1923 1]
[ 101 205]]

优缺点：运行速度相对较慢（比SVM还慢），准确率96%，分类效果比较好。

（5）MLP分类

def do_dnn_wordbag(x_train, x_test, y_train, y_test):
    print("MLP and wordbag")
    global max_features
    # Building deep neural network
    clf = MLPClassifier(solver='lbfgs',alpha=1e-5,
                        hidden_layer_sizes = (5, 2),random_state = 1)
    print(clf)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(classification_report(y_test, y_pred))
    print(metrics.confusion_matrix(y_test, y_pred))

precision recall f1-score support
0.97 0.97 0.97 2230
[[1889 35]
[ 27 279]]

优缺点：模型计算速度快（和随机森林运算耗时差不多），准确率达97%。

（6）神经网络分类

模型运算太耗时间，每次运行算法电脑风扇高速运行，不算了。

3、模型比较

贝叶斯分类：速度快，准确率90%；
SVM分类：速度较慢，准确率86%；
随机森林：速度较快（介于贝叶斯和SVM之间）准确率达97.8%；
XGBoost：运行速度相对较慢（比SVM还慢），准确率96%；
MLP分类: 速度快（和随机森林运算耗时差不多），准确率达97.0%;
神经网络: 模型运算太耗时间.

因此采用随机森林/MLP对的大规模垃圾短信进行识别会比较合适。

4、小结

作者将案例相关代码发布在github平台上

unit8 骚扰短信识别

1、特征提取

（1）词频表

（2）权重处理

（3）加入NGram模式

（4）进程处理

2、短信分类

（1）贝叶斯分类

（2）SVM分类

（3）随机森林分类

（4）XGBoost分类

（5）MLP分类

（6）神经网络分类

3、模型比较

4、小结

猜你喜欢

热点阅读

unit8 骚扰短信识别

1、 特征提取

（1）词频表

（2）权重处理

（3）加入NGram模式

（4）进程处理

2、短信分类

（1）贝叶斯分类

（2）SVM分类

（3）随机森林分类

（4）XGBoost分类

（5）MLP分类

（6）神经网络分类

3、 模型比较

4、 小结

猜你喜欢

热点阅读

1、特征提取

3、模型比较

4、小结