【阿旭机器学习实战】【11】文本分类实战:利用朴素贝叶斯模型进行
2022-11-20 本文已影响0人
阿旭123
【阿旭机器学习实战】系列文章主要介绍机器学习的各种算法模型及其实战案例,欢迎点赞,关注共同学习交流。
本文主要介绍如何使用朴素贝叶斯模型进行邮件分类
,置于朴素贝叶斯模型的原理及分类,可以参考我的上一篇文章《【阿旭机器学习实战】【10】朴素贝叶斯模型原理及3种贝叶斯模型对比:高斯分布朴素贝叶斯、多项式分布朴素贝叶斯、伯努利分布朴素贝叶斯》
。
文本分类实战
读取文本数据
import pandas as pd
# sep参数代表指定的csv的属性分割符号
sms = pd.read_csv("../data/SMSSpamCollection",sep="\t",header=None)
sms
0 | 1 | |
---|---|---|
0 | ham | Go until jurong point, crazy.. Available only ... |
1 | ham | Ok lar... Joking wif u oni... |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
3 | ham | U dun say so early hor... U c already then say... |
4 | ham | Nah I don't think he goes to usf, he lives aro... |
5 | spam | FreeMsg Hey there darling it's been 3 week's n... |
6 | ham | Even my brother is not like to speak with me. ... |
7 | ham | As per your request 'Melle Melle (Oru Minnamin... |
8 | spam | WINNER!! As a valued network customer you have... |
9 | spam | Had your mobile 11 months or more? U R entitle... |
10 | ham | I'm gonna be home soon and i don't want to tal... |
11 | spam | SIX chances to win CASH! From 100 to 20,000 po... |
12 | spam | URGENT! You have won a 1 week FREE membership ... |
13 | ham | I've been searching for the right words to tha... |
14 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! |
15 | spam | XXXMobileMovieClub: To use your credit, click ... |
16 | ham | Oh k...i'm watching here:) |
17 | ham | Eh u remember how 2 spell his name... Yes i di... |
18 | ham | Fine if that’s the way u feel. That’s the way ... |
19 | spam | England v Macedonia - dont miss the goals/team... |
20 | ham | Is that seriously how you spell his name? |
21 | ham | I‘m going to try for 2 months ha ha only joking |
22 | ham | So ü pay first lar... Then when is da stock co... |
23 | ham | Aft i finish my lunch then i go str down lor. ... |
24 | ham | Ffffffffff. Alright no way I can meet up with ... |
25 | ham | Just forced myself to eat a slice. I'm really ... |
26 | ham | Lol your always so convincing. |
27 | ham | Did you catch the bus ? Are you frying an egg ... |
28 | ham | I'm back & we're packing the car now, I'll... |
29 | ham | Ahhh. Work. I vaguely remember that! What does... |
... | ... | ... |
5542 | ham | Armand says get your ass over to epsilon |
5543 | ham | U still havent got urself a jacket ah? |
5544 | ham | I'm taking derek & taylor to walmart, if I... |
5545 | ham | Hi its in durban are you still on this number |
5546 | ham | Ic. There are a lotta childporn cars then. |
5547 | spam | Had your contract mobile 11 Mnths? Latest Moto... |
5548 | ham | No, I was trying it all weekend ;V |
5549 | ham | You know, wot people wear. T shirts, jumpers, ... |
5550 | ham | Cool, what time you think you can get here? |
5551 | ham | Wen did you get so spiritual and deep. That's ... |
5552 | ham | Have a safe trip to Nigeria. Wish you happines... |
5553 | ham | Hahaha..use your brain dear |
5554 | ham | Well keep in mind I've only got enough gas for... |
5555 | ham | Yeh. Indians was nice. Tho it did kane me off ... |
5556 | ham | Yes i have. So that's why u texted. Pshew...mi... |
5557 | ham | No. I meant the calculation is the same. That ... |
5558 | ham | Sorry, I'll call later |
5559 | ham | if you aren't here in the next <#> hou... |
5560 | ham | Anything lor. Juz both of us lor. |
5561 | ham | Get me out of this dump heap. My mom decided t... |
5562 | ham | Ok lor... Sony ericsson salesman... I ask shuh... |
5563 | ham | Ard 6 like dat lor. |
5564 | ham | Why don't you wait 'til at least wednesday to ... |
5565 | ham | Huh y lei... |
5566 | spam | REMINDER FROM O2: To get 2.50 pounds free call... |
5567 | spam | This is the 2nd time we have tried 2 contact u... |
5568 | ham | Will ü b going to esplanade fr home? |
5569 | ham | Pity, * was in mood for that. So...any other s... |
5570 | ham | The guy did some bitching but I acted like i'd... |
5571 | ham | Rofl. Its true to its name |
5572 rows × 2 columns
提取特征与标签
data = sms[[1]]
target = sms[[0]]
data.shape
(5572, 1)
将文本变为稀疏矩阵
对于文本数据,一般情况下会把字符串里面单词转化成浮点数表示稀疏矩阵
from sklearn.feature_extraction.text import TfidfVectorizer
# 这个算法模型用于把一堆字符串处理成稀疏矩阵
tf = TfidfVectorizer()
# 训练特征数:告诉tf模型有那些单词
tf.fit(data[1])
TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
stop_words=None, strip_accents=None, sublinear_tf=False,
token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
vocabulary=None)
# 转化:把数据有5572条语句转化成5572*XX的一个稀疏矩阵
data = tf.transform(data[1])
data
# 此时得到了一个5572*8713的稀疏矩阵,说明这5572条语句中有8713种单词
<5572x8713 sparse matrix of type '<class 'numpy.float64'>'
with 74169 stored elements in Compressed Sparse Row format>
训练模型
b_NB.fit(data,target)
message = ["Confidence doesn't need any specific reason. If you're alive , you should feel 100 percent confident.",
"Avis is only NO.2 in rent a cars.SO why go with us?We try harder.",
"SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info"
]
预测
# 把message转化成稀疏矩阵
x_test = tf.transform(message)
b_NB.predict(x_test)
array(['ham', 'ham', 'spam'],
dtype='<U4')
b_NB.score(data,target)
0.98815506101938266
使用多项式贝叶斯
m_NB = MultinomialNB()
m_NB.fit(data,target)
m_NB.score(data,target)
0.97613065326633164
使用高斯贝叶斯
g_NB = GaussianNB()
g_NB.fit(data.toarray(),target)
g_NB.score(data.toarray(),target)
0.94149318018664752
如果内容对你有帮助,感谢点赞+关注哦!
更多干货内容持续更新中…