使用elasticsearch索引文件并搜索
2018-09-20 本文已影响16人
LI木水
1.elasticsearch索引文件需要一个插件
es版本 | 插件名 | 参考文档 |
---|---|---|
es5.0之前 | mapper-attachments | https://qbox.io/blog/index-attachments-files-elasticsearch-mapper , |
es5.0以后 | ingest-attachment | https://qbox.io/blog/how-to-index-attachments-and-files-to-elasticsearch-5-0-using-ingest-api , https://www.elastic.co/guide/en/elasticsearch/plugins/5.6/using-ingest-attachment.html |
由于原本的es集群是2.3.5版本的,先试了安装2.3.5版本的 mapper-attachments安装失败,原因是下载下来的插件版本说是匹配2.0的ES。好像es集群是2.4的时候可以安装成功,请自己测试。又想把ES版本升级到5.x,于是选择了5.6的ES版本。
2.插件安装
bin/elasticsearch-plugin install ingest-attachment
3.创建一个attachment pipeline
注:properties的字段可以指定,最多可指定"content", "title", "author", "keywords", "date", "content_length", "content_type"
curl -XPUT 'http://localhost:19200/_ingest/pipeline/attachment?pretty' -H 'Content-Type: application/json' -d '{
"description" : "Extract attachment information encoded in Base64 with UTF-8 charset",
"processors" : [
{
"attachment" : {
"field" : "data",
"properties": [ "content", "title", "author", "keywords", "date", "content_length", "content_type" ]
}
}
]
}'
4.创建索引test
curl -XPUT 'http://localhost:19200/test/' -d '{
"settings":{
"index":{
"number_of_shards":1,
"number_of_replicas":1
}
}
}'
5.创建mapping
curl -XPUT 'http://localhost:19200/test/_mapping/document/' -d '
{
"document": {
"_source": {
"excludes": [
"data",
"attachment.content"
]
},
"properties": {
"filename": {
"type": "text"
},
"attachment": {
"properties": {
"date": {
"type": "date"
},
"content_type": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"author": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"title": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"content": {
"type": "text"
},
"content_length": {
"type": "long"
}
}
},
"data": {
"type": "binary",
"store": false
},
"filePath": {
"type": "keyword"
},
"downloadTimes": {
"type": "long"
},
"source": {
"type": "keyword"
},
"type": {
"type": "keyword"
},
"uploadTime": {
"type": "date"
},
"viewTimes": {
"type": "long"
},
"fileType": {
"type": "keyword"
}
}
}
}'
说明:1.为了只索引而不存储content字段,否则文件过大查询一次要把内容都拿出来,需要在source中排除掉,只写store:false是没用的。
"_source": {
"excludes": [
"data",
"attachment.content"
]
},
type:"keyword",完全匹配搜索
"source": {
"type": "keyword"
}
ES5之后去掉了string类型,改为text
"content": {
"type": "text"
}
data 是原文档的base64编码,存储为binary,不需要被看到,也排除在_source中
"data": {
"type": "binary",
"store": false
}
6.索引数据
注:data 是原文档的base64编码,用java api索引的时候要把文件内容读为base64字符串放入data字段
curl -XPUT 'http://localhost:19200/test/document/test_id2?pipeline=attachment&pretty' -H 'Content-Type: application/json' -d '{
"source":"测试",
"filename":"测试文档",
"data": "UWJveCBlbmFibGVzIGxhdW5jaGluZyBzdXBwb3J0ZWQsIGZ1bGx5LW1hbmFnZWQsIFJFU1RmdWwgRWxhc3RpY3NlYXJjaCBTZXJ2aWNlIGluc3RhbnRseS4g"
}'
7.查询
curl -XPOST 'http://localhost:19200/test/document/_search?pretty' -d '{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"attachment.content": "Qbox"
}
},
{
"term": {
"source": "测试"
}
}
]
}
}
}'
其他参考:
https://www.elastic.co/guide/en/elasticsearch/plugins/5.6/using-ingest-attachment.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/binary.html