玩转大数据大数据Python时空大数据

天池学习赛-NLP新闻文本分类(3/6)-词向量+机器学习模型

2020-07-24  本文已影响0人  粉红狐狸_dhf

1 赛题理解

2 数据分析

3 词向量+机器学习模型

词向量:是文本表示成计算机能都计算的数字或向量的一般方法。将不定长文本转换到定长空间,是文本分类的第一步。

3.1 Count Vecotrs + RidgeClassifier

岭回归器有一个分类器变体:RidgeClassifier,这个分类器有时被称为带有线性核的最小二乘支持向量机。
RidgeClassifier原理相关介绍
RidgeClassifier参数的相关介绍

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score

df = pd.read_csv(r'E:\jupyter_lab\天池\新闻文本分类\data\train\train_set.csv',sep='\t',encoding='utf8')
df.head()
image.png
CountVec = CountVectorizer(max_features = 3000)
train_text = CountVec.fit_transform(df.text)

x_train,x_val,y_train,y_val = train_test_split(train_text,df.label,test_size=0.3,random_state=0 )

clf = RidgeClassifier()
clf.fit(x_train,y_train)

val_pre = clf.predict(x_val)
score_f1 = f1_score(y_val,val_pre,average='macro')

print('CountVectorizer + RidgeClassifier : %.4f' %score_f1 )
image.png

3.2 TF-IDF + RidgeClassifier

TfidfVctorizer参数及属性介绍

from sklearn.feature_extraction.text import TfidfVectorizer
%%time
tfidf = TfidfVectorizer(ngram_range=(1,3),max_features=3000)
train_text_tfidf = tfidf.fit_transform(df.text)
超长运行时间预警!! image.png

划分数据集:

x_train_tfidf,x_val_tfidf,y_train_tfidf,y_val_tfidf = train_test_split(train_text_tfidf,df.label,test_size=0.3,random_state=0 )
%%time
clf = RidgeClassifier()
clf.fit(x_train_tfidf,y_train_tfidf)

val_pre_tfidf = clf.predict(x_val_tfidf)
score_f1_tfidf = f1_score(y_val_tfidf,val_pre_tfidf,average='macro')

print('TF-IDF + RidgeClassifier : %.4f' %score_f1_tfidf )
image.png

3.3 Count Vectors | TFIDF + MultinomialNB

朴素贝叶斯分类(NBC)是以贝叶斯定理为基础并且假设特征条件之间相互独立的方法,先通过已给定的训练集,以特征词之间独立作为前提假设,学习从输入到输出的联合概率分布,再基于学习到的模型,输入X求出使得后验概率最大的输出Y。MultinomialNB 实现了服从多项分布数据的朴素贝叶斯算法。
朴素贝叶斯分类(算法理解)

from sklearn.naive_bayes import MultinomialNB

count vectors 词向量

%%time
clf = MultinomialNB()
clf.fit(x_train,y_train)

val_pre_CountVec_NBC = clf.predict(x_val)
score_f1_CountVec_NBC = f1_score(y_val,val_pre_CountVec_NBC,average='macro')

print('CountVec + MultinomialNB : %.4f' %score_f1_CountVec_NBC )
image.png
TF-IDF
%%time
clf = MultinomialNB()
clf.fit(x_train_tfidf,y_train_tfidf)

val_pre_tfidf_NBC = clf.predict(x_val_tfidf)
score_f1_tfidf_NBC = f1_score(y_val_tfidf,val_pre_tfidf_NBC,average='macro')

print('TF-IDF + MultinomialNB : %.4f' %score_f1_tfidf_NBC )
image.png
比较各个模型的结果
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

scores = [score_f1,score_f1_tfidf,score_f1_CountVec_NBC,score_f1_tfidf_NBC]
x_ticks = np.arange(4)
x_ticks_label = ['CountVec_RidgeClassifier','tfidf_RidgeClassifier','CountVec_NBC','tfidf_NBC']
plt.plot(x_ticks,scores)
plt.xticks(x_ticks, x_ticks_label, fontsize=8) #指定字体
plt.ylabel('F1_score')
plt.show()
image.png
总结:利用不同的分类模型分别在两种词向量上进行了实验,总的来说还是tfidf词向量比count vectors 更有效一点。 所以接下来直接用tfidf实验模型。参考其它参考资料后,直接上性能比较好的模型。

3.5 TFIDF + LinearSVC

svm的原理
svm.LinearSVC 与 svm.SVC 的区别

from sklearn.svm import LinearSVC
%%time
clf = LinearSVC()
clf.fit(x_train_tfidf,y_train_tfidf)

val_pre_tfidf_LSVC = clf.predict(x_val_tfidf)
score_f1_tfidf_LSVC = f1_score(y_val_tfidf,val_pre_tfidf_LSVC,average='macro')

print('TF-IDF + LinearSVC : %.4f' %score_f1_tfidf_LSVC )
image.png

3.6 TFIDF + RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier
%%time
clf = RandomForestClassifier()
clf.fit(x_train_tfidf,y_train_tfidf)

val_pre_tfidf_RFC = clf.predict(x_val_tfidf)
score_f1_tfidf_RFC = f1_score(y_val_tfidf,val_pre_tfidf_RFC,average='macro')
超超长时间预警: image.png
print('TF-IDF + RandomForestClassifier : %.4f' %score_f1_tfidf_RFC )
image.png
模型比较
比较来看,LinearSVC无论是从运行时间还是模型精度上都完美胜出。RandomForestClassifier拟合效果第二,但运行时间实在太长了。其次,就是RidgeClassifier,最后是MultinomialNB。还有其他机器学习模型没有试过,但机器学习模型的上线应该在0.92-0.93上下。我们接下来对TfidfVectorizer进行参数的调优,看看能有多大改进。

3.7 TfidfVectorizer参数调优

(1) N-gram=(1,2)max_fratures=3000

%%time
tfidf_N2 = TfidfVectorizer(ngram_range=(1,2),max_features=3000)
train_text_tfidf_N2 = tfidf_N2.fit_transform(df.text)

x_train_tfidf,x_val_tfidf,y_train_tfidf,y_val_tfidf = train_test_split(train_text_tfidf_N2,df.label,test_size=0.3,random_state=0 )

clf = LinearSVC()
clf.fit(x_train_tfidf,y_train_tfidf)

val_pre_tfidf_LSVC = clf.predict(x_val_tfidf)
score_f1_tfidf_LSVC_N2 = f1_score(y_val_tfidf,val_pre_tfidf_LSVC,average='macro')

print('TF-IDF_N2 + LinearSVC : %.4f' %score_f1_tfidf_LSVC_N2 )
image.png

(2)N-gram=(1,2)max_fratures=4000

%%time
tfidf_mf4000 = TfidfVectorizer(ngram_range=(1,2),max_features=4000)
train_text_tfidf_mf4000 = tfidf_mf4000.fit_transform(df.text)

x_train_tfidf,x_val_tfidf,y_train_tfidf,y_val_tfidf = train_test_split(train_text_tfidf_mf4000,df.label,test_size=0.3,random_state=0 )

clf = LinearSVC()
clf.fit(x_train_tfidf,y_train_tfidf)

val_pre_tfidf_LSVC = clf.predict(x_val_tfidf)
score_f1_tfidf_LSVC_mf4000 = f1_score(y_val_tfidf,val_pre_tfidf_LSVC,average='macro')

print('TF-IDF_LSVC_mf4000 + LinearSVC : %.4f' %score_f1_tfidf_LSVC_mf4000 )
image.png

(3)N-gram=(1,3)max_fratures=4000

%%time
tfidf_N3_mf4000 = TfidfVectorizer(ngram_range=(1,3),max_features=4000)
train_text_tfidf_N3_mf4000 = tfidf_N3_mf4000.fit_transform(df.text)
超长时间预警 image.png
x_train_tfidf,x_val_tfidf,y_train_tfidf,y_val_tfidf = train_test_split(train_text_tfidf_N3_mf4000,df.label,test_size=0.3,random_state=0 )

clf = LinearSVC()
clf.fit(x_train_tfidf,y_train_tfidf)

val_pre_tfidf_LSVC = clf.predict(x_val_tfidf)
score_f1_tfidf_LSVC_N3_mf4000 = f1_score(y_val_tfidf,val_pre_tfidf_LSVC,average='macro')

print('TF-IDF_N2 + LinearSVC : %.4f' %score_f1_tfidf_LSVC_N3_mf4000 )
image.png
参数优化总结
由于中文多以两个字组成词语,所以我尝试了N-gram=(1,2)。
上一篇 下一篇

猜你喜欢

热点阅读