python3中文jieba分词设置说明

2020-05-25  本文已影响0人  时间煮菜

jieba分词的安装

  1. 在虚拟环境中 / 本地下安装 jieba
pip3 install jieba

jieba分词的配置

  1. jieba中文分词的使用
>>> import jieba
>>> str = '撒糖屑曲奇饼干'
>>> jieba.cut(str, cut_all=True)
<generator object Tokenizer.cut at 0x000001EFB2587E48>
>>> res = jieba.cut(str, cut_all=True)
>>> for val in res:
...     print(val)
...
Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\Public\Documents\Wondershare\CreatorTemp\jieba.cache
Loading model cost 1.405 seconds.
Prefix dict has been built successfully.
撒
糖
屑
曲奇
曲奇饼
饼干
  1. 新建一个ChineseAnalyzer.py,内容如下
import jieba
from whoosh.analysis import Tokenizer, Token

class ChineseTokenizer(Tokenizer):
    def __call__(self, value, positions=False, chars=False,
                 keeporiginal=False, removestops=True,
                 start_pos=0, start_char=0, mode='', **kwargs):
        t = Token(positions, chars, removestops=removestops, mode=mode,
                  **kwargs)
        seglist = jieba.cut(value, cut_all=True)
        for w in seglist:
            t.original = t.text = w
            t.boost = 1.0
            if positions:
                t.pos = start_pos + value.find(w)
            if chars:
                t.startchar = start_char + value.find(w)
                t.endchar = start_char + value.find(w) + len(w)
            yield t

def ChineseAnalyzer():
    return ChineseTokenizer()
  1. 当前目录下复制whoosh_backend.py文件修改为whoosh_cn_backend.py

  2. 修改whoosh_cn_backend.py文件中的内容:

# 26行新增:
from .ChineseAnalyzer import ChineseAnalyzer

# 查找163行 analyzer=StemmingAnalyzer()
schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=ChineseAnalyzer(), field_boost=field_class.boost, sortable=True)
# 改为:
analyzer=ChineseAnalyzer()
修改ChineseAnalyzer
  1. 修改项目中setting.py文件
# 全文搜索应用配置
HAYSTACK_CONNECTIONS = {
    'default': {
        # 使用whoosh引擎
        # 'ENGINE': 'haystack.backends.whoosh_backend.WhooshEngine',
        'ENGINE': 'haystack.backends.whoosh_cn_backend.WhooshEngine',
        # 索引文件路径
        'PATH': os.path.join(BASE_DIR, 'whoosh_index'),  # 保存索引文件的地址,选择主目录下,这个会自动生成
    }
}
  1. 重新建索引
python3 manage.py rebuild_index

jieba查找效果图

可以查找到
上一篇下一篇

猜你喜欢

热点阅读