python3中文jieba分词设置说明
2020-05-25 本文已影响0人
时间煮菜
jieba分词的安装
- 在虚拟环境中 / 本地下安装 jieba
pip3 install jieba
jieba分词的配置
- jieba中文分词的使用
>>> import jieba
>>> str = '撒糖屑曲奇饼干'
>>> jieba.cut(str, cut_all=True)
<generator object Tokenizer.cut at 0x000001EFB2587E48>
>>> res = jieba.cut(str, cut_all=True)
>>> for val in res:
... print(val)
...
Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\Public\Documents\Wondershare\CreatorTemp\jieba.cache
Loading model cost 1.405 seconds.
Prefix dict has been built successfully.
撒
糖
屑
曲奇
曲奇饼
饼干
- 进入到自己虚拟环境下的
Lib/site-package/haystack/backends
目录下 - 或者本地
C:\Python3\Lib\site-packages\haystack\backends
- 新建一个
ChineseAnalyzer.py
,内容如下
import jieba
from whoosh.analysis import Tokenizer, Token
class ChineseTokenizer(Tokenizer):
def __call__(self, value, positions=False, chars=False,
keeporiginal=False, removestops=True,
start_pos=0, start_char=0, mode='', **kwargs):
t = Token(positions, chars, removestops=removestops, mode=mode,
**kwargs)
seglist = jieba.cut(value, cut_all=True)
for w in seglist:
t.original = t.text = w
t.boost = 1.0
if positions:
t.pos = start_pos + value.find(w)
if chars:
t.startchar = start_char + value.find(w)
t.endchar = start_char + value.find(w) + len(w)
yield t
def ChineseAnalyzer():
return ChineseTokenizer()
-
当前目录下复制
whoosh_backend.py
文件修改为whoosh_cn_backend.py
-
修改
whoosh_cn_backend.py
文件中的内容:
# 26行新增:
from .ChineseAnalyzer import ChineseAnalyzer
# 查找163行 analyzer=StemmingAnalyzer()
schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=ChineseAnalyzer(), field_boost=field_class.boost, sortable=True)
# 改为:
analyzer=ChineseAnalyzer()
修改ChineseAnalyzer
- 修改项目中setting.py文件
- 改为
whoosh_cn_backend
# 全文搜索应用配置
HAYSTACK_CONNECTIONS = {
'default': {
# 使用whoosh引擎
# 'ENGINE': 'haystack.backends.whoosh_backend.WhooshEngine',
'ENGINE': 'haystack.backends.whoosh_cn_backend.WhooshEngine',
# 索引文件路径
'PATH': os.path.join(BASE_DIR, 'whoosh_index'), # 保存索引文件的地址,选择主目录下,这个会自动生成
}
}
- 重新建索引
python3 manage.py rebuild_index