中文文本的分词、去标点符号、去停用词、词性标注

2021-02-02 本文已影响0人香菜那么好吃为什么不吃香菜

利用Python代码实现中文文本的自然语言处理，包括分词、去标点符号、去停用词、词性标注&过滤。

在刚开始的每个模块，介绍它的实现。最后会将整个文本处理过程封装成 TextProcess 类。

页面导航

结巴分词

jieba 是比较好的中文分词库，在此之前，需要 pip install jieba

结巴分词有三种模式：

全模式：把句子中所有的可以成词的词语都扫描出来

jieba.cut(text, cut_all=True)

精确模式：将句子最精确地切开，适合文本分析

jieba.cut(text, cut_all=False)  # 默认模式

搜索引擎模式：在精确模式的基础上，对长词再次切分，适合用于搜索引擎分词

jieba.cut_for_search(txt)

三种分词效果如下图所示：

想要进一步了解 jieba 三种模式，请参考 jieba分词。因为我要做的是文本分析，所以选用的是默认的精确模式。

对于一些词，比如“吃鸡”，jieba 往往会将它们分成 “吃” 和 “鸡” ，但是又不太想让它们分开，这该怎么做呢？这时候就需要加载自定义的词典 dict.txt。建立该文档，在其中加入“吃鸡”，执行以下代码：

file_userDict = 'dict.txt'  # 自定义的词典
jieba.load_userdict(file_userDict)

效果对比图：

词性标注

在用 posseg 分词后，结果是一对值，包括 word 和 flag ，可以用 for 循环获取。关于汉语词性对照表，请看词性标注表

import jieba.posseg as pseg
sentence = "酒店就在海边，去鼓浪屿很方便。"
words_pair = pseg.cut(sentence)
result = " ".join(["{0}/{1}".format(word, flag) for word, flag in words_pair])
print(result)

在这里插入图片描述

在此基础上，可以进一步做词性过滤，只保留特定词性的词。首先在 tag_filter 表明想要留下哪些词，接着对于词性标注后的句子中的每一个词，如果词性符合，则加入到 list 中。在这里只保留了名词和动词。

import jieba.posseg as pseg
list = []
sentence = "人们宁愿去关心一个蹩脚电影演员的吃喝拉撒和鸡毛蒜皮，而不愿了解一个普通人波涛汹涌的内心世界"
tag_filter = ['n', 'v']  # 需要保留的词性
seg_result = pseg.cut(sentence)  # 结果是一个pair，有flag和word两种值
list.append([" ".join(s.word for s in seg_result if s.flag in tag_filter)])
print("词性过滤完成")
print(list)

在这里插入图片描述

去停用词

去停用词时，首先要用到停用词表，常见的有哈工大停用词表 && 百度停用词表。

在去停用词之前，首先要通过 load_stopword( ) 方法来加载停用词列表，接着按照上文所示，加载自定义词典，对句子进行分词，然后判断分词后的句子中的每一个词，是否在停用词表内，如果不在，就把它加入 outstr，用空格来区分。

import jieba

#  加载停用词列表
def load_stopword():
    f_stop = open('hit_stopwords.txt', encoding='utf-8')  # 自己的中文停用词表
    sw = [line.strip() for line in f_stop]  # strip() 方法用于移除字符串头尾指定的字符（默认为空格）
    f_stop.close()
    return sw

# 中文分词并且去停用词
def seg_word(sentence):
    file_userDict = 'dict.txt'  # 自定义的词典
    jieba.load_userdict(file_userDict)

    sentence_seged = jieba.cut(sentence.strip())
    stopwords = load_stopword()
    outstr = ''
    for word in sentence_seged:
        if word not in stopwords:
            if word != '/t':
                outstr += word
                outstr += " "
    print(outstr)
    return outstr

if __name__ == '__main__':
    sentence = "人们宁愿去关心一个蹩脚电影演员的吃喝拉撒和鸡毛蒜皮，而不愿了解一个普通人波涛汹涌的内心世界"
    seg_word(sentence)

去标点符号

导入 re 包，定义标点符号，使用 sub( ) 方法将之替换。

import re

sentence = "+蚂=蚁！花!呗/期?免,息★.---《平凡的世界》：了*解一（#@）个“普通人”波涛汹涌的内心世界！"
sentenceClean = []
remove_chars = '[·’!"\#$%&\'()＃！（）*+,-./:;<=>?\@，：?￥★、…．＞【】［］《》？“”‘’\[\\]^_`{|}~]+'
string = re.sub(remove_chars, "", sentence)
sentenceClean.append(string)
print(sentence)
print(sentenceClean)