新闻搜索实现 -- Elasticsearch

2019-03-04 本文已影响0人爱修仙的道友

1. 需求分析

可以使用数据库的模糊查询（like关键字）来实现，但效率极低

# 模糊查询
# like
# %表示任意多个任意字符
# _表示一个任意字符
SELECT * FROM users WHERE username LIKE '%python%' AND is_delete = 0;

在多个字段中查询，使用like关键字不方便
因此使用搜索引擎来实现全文检索

2. 搜索引擎原理

搜索引擎并不是直接在数据库中进行查询
会对数据库中的数据进行一遍预处理，单独建立一份索引结构数据
类似字典的索引检索页

3. Elasticsearch

开源
搜索引擎首选
底层是开源库Lucene
REST API 的操作接口

image.png

搜索引擎在对数据构建索引时，需要进行分词处理。分词是指将一句话拆解成多个单字或词，这些字或词便是这句话的关键词。Elasticsearch 不支持对中文进行分词建立索引，需要配合扩展elasticsearch-analysis-ik来实现中文分词处理。（结巴，也是专门做中文分词处理的）
whoosh 是另一种搜索引擎，自行了解

4.使用docker安装elasticsearch

a.获取镜像

sudo docker image pull delron/elasticsearch-ik:2.4.6-1.0

b.将百度云盘中的elasticsearch.zip文件传到虚拟机中的家目录，然后unzip解压。在虚拟机中的elasticsearch/config/elasticsearch.yml第54行，更改ip地址为0.0.0.0（IP地址可以改命令行输入 ip a 172.17.0.1），端口改为8002，默认端口为9200

# 在xshell中使用rz命令将elasticsearch.zip文件传到虚拟机的家目录中, 然后解压
unzip elasticsearch.zip
cd ~/elasticsearch/config
vim elasticsearch.yml

# network.host: 172.18.168.123
network.host: 0.0.0.0
#
# Set a custom port for HTTP:
#
http.port: 8002(nat模式需要映射)

c.创建docker容器并运行
/home/pyvip/elasticsearch/config 这个是所在目录 pwd 生成的绝对路径不可弄错（只需改这里）

docker run -dti --restart=always --network=host --name=elasticsearch -v /home/pyvip/elasticsearch/config:/usr/share/elasticsearch/config delron/elasticsearch-ik:2.4.6-1.0

当前目录执行 docker container ls -a 查看 elasticsearch Up 即安装ok

d.在执行

curl 127.0.0.1:8002 （ip 0.0.0.0 任意ip可访问， 8002必须与上文设置的端口一致）

# 出现---环境ok

{
  "name" : "Hardshell",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "XYrjFhRARKGa4blqNuYZ4A",
  "version" : {
    "number" : "2.4.6",
    "build_hash" : "5376dca9f70f3abef96a77f4bb22720ace8240fd",
    "build_timestamp" : "2017-07-18T12:17:44Z",
    "build_snapshot" : false,
    "lucene_version" : "5.5.4"
  },
  "tagline" : "You Know, for Search"
}

e.进入项目虚拟环境中，安装相关包（django如何与Elasticsearch进行交互）

# 进入项目虚拟环境
pip install django-haystack（换源安装，方法如下）
pip install elasticsearch==2.4.1

f.在settings.py文件中加入如下配置：

INSTALLED_APPS = [
    'haystack',
]
# 子项目只有 一个端口 ，若有多个也可以类似redis数据库进行配置
ELASTICSEARCH_DSL = {  
    'default': {
        'hosts': '127.0.0.1:8002'
    },
}

# Haystack django专门与搜索引擎进行交互的模块 ，只需更改引擎即配置 类似于数据库的配置 即可更换 woosh等其他模块，只需要更改引擎既可作到可插拔的特性，
HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine',
        'URL': 'http://127.0.0.1:8002/',  # 此处为elasticsearch运行的服务器ip地址，端口号默认为9200
        'INDEX_NAME': 'dj_pre_class',  # 指定elasticsearch建立的索引库的名称必须小写（新华词典）
    },
}

# 设置每页显示的数据量
HAYSTACK_SEARCH_RESULTS_PER_PAGE = 5
# 当数据库改变时，会自动更新索引（相当于订阅管理）
HAYSTACK_SIGNAL_PROCESSOR = 'haystack.signals.RealtimeSignalProcessor'

5. 后端功能实现

# 在apps/news/search_indexes.py中创建如下类：（名称固定为search_indexes.py）


from haystack import indexes
# from haystack import site

from .models import News

# 命名规则 模型名Index
class NewsIndex(indexes.SearchIndex, indexes.Indexable):
    """
    News索引数据模型类
    """
    # 首先要定义text固定字段， haystack与elsticsearch交互字段
    text = indexes.CharField(document=True, use_template=True) # document=True, use_template=True 格式固定
    # 指定搜索引擎字段与模型字段匹配
    id = indexes.IntegerField(model_attr='id')
    title = indexes.CharField(model_attr='title')
    digest = indexes.CharField(model_attr='digest')
    content = indexes.CharField(model_attr='content')
    image_url = indexes.CharField(model_attr='image_url')
    # comments = indexes.IntegerField(model_attr='comments')

    # 方法名固定
    def get_model(self):
        """返回建立索引的模型类
        """
        return News

    # 方法名固定
    def index_queryset(self, using=None):
        """返回要建立索引的数据查询集
        """
        # 会出现延时错误
        # return self.get_model().objects.filter(is_delete=False,)
        # 第一次查询用tag_id__in指定tag类型 （本项目tag为[1-6]）
        # return self.get_model().objects.filter(is_delete=False, tag_id__in=[1, 2, 3, 4, 5, 6])
        # 若还出现异常，可以先创建小数量的索引 ，然后再加大数量级 update 就好
        return self.get_model().objects.filter(is_delete=False, tag_id=1)

# views.py
from haystack.views import SearchView as _SearchView

# 方法名称与属性名称固定（所有名称名都不可更改，与父类有关）
class NewsSearchView(_SearchView):
    # 模版文件
    template = 'news/search.html'

    # 重写响应方式，如果请求参数q为空，返回模型News的热门新闻数据，否则根据参数q搜索相关数据
    def create_response(self):
        # search/q=python&page=10
        kw = self.request.GET.get('q', '')
        # kw为空，没有参数
        if not kw:
            # 定义标记 是否显示热门新闻或者查询之后的数据
            show_all = True
            hot_news = models.HotNews.objects.select_related('news'). \
                only('news__title', 'news__image_url', 'news__id'). \
                filter(is_delete=False).order_by('priority', '-news__clicks')

            paginator = Paginator(hot_news, settings.HAYSTACK_SEARCH_RESULTS_PER_PAGE)
            try:
                page = paginator.page(int(self.request.GET.get('page', 1)))
            except PageNotAnInteger:
                # 如果参数page的数据类型不是整型，则返回第一页数据
                page = paginator.page(1)
            except EmptyPage:
                # 用户访问的页数大于实际页数，则返回最后一页的数据
                page = paginator.page(paginator.num_pages)
            if page == 0:
                page = paginator.page(1)
            return render(self.request, self.template, locals())
        # kw存在，返回查询结果
        else:
            show_all = False
            qs = super(NewsSearchView, self).create_response()
            return qs

# 创建templates/search/indexes/news/news_text.txt文件（文件名为：应用名_text.txt）
# 创建搜索模板 格式要求 
# 第一级 search
# 第二级 indexes
# 第三级 当前应用名（即创建每个应用所针对的索引类）
# 第四级 model模型名的小写_text.txt(模板)
# 内容 {{ object.需搜索字段名 }}

{{ object.title }}
{{ object.digest }}
{{ object.content }}

# 在apps/news/urls.py中

urlpatterns = [
    path('search/', views.NewsSearchView(), name='news_search'),

]

# 在虚拟机中执行如下命令，生成索引
python manage.py rebuild_index
python manage.py update_index

# 指令详情
python manage.py --help
[haystack]
    build_solr_schema  
    clear_index   清除索引
    haystack_info  haystack相关信息
    rebuild_index  创建索引
    update_index  需要你第一次手动更新
    
python manage.py rebuild_index --help(查询命令使用方法)
    内含 并行处理索引表 
    -k WORKERS, --workers WORKERS
      Allows for the use multiple workers to parallelize
          indexing. Requires multiprocessing.

cat /proc/cpuinfo | grep processor | wc -l 查询系统核心数

6. 前端代码（注：分页页码仅提供前十页，后续补充）


{% extends 'base/base.html' %}

{% block title %}
  searchPage
{% endblock %}

<!-- css link start -->
{% block link %}
  <link rel="stylesheet" href="{% static 'css/news/search.css' %}">
{% endblock %}
<!-- css link end -->


<!-- main-contain start  -->
{% block main_contain %}

<div class="main-contain ">
                   <!-- search-box start -->
                   <div class="search-box">
                       <form action="" style="display: inline-flex;">

                           <input type="search" placeholder="请输入要搜索的内容" name="q" class="search-control">


                           <input type="submit" value="搜索" class="search-btn">
                       </form>
                       <!-- 可以用浮动 垂直对齐 以及 flex  -->
                   </div>
                   <!-- search-box end -->
                   <!-- content start -->
                   <div class="content">
                       <!-- search-list start -->
                        {% if not show_all %}
                          <div class="search-result-list">
                            <h2 class="search-result-title">
                              搜索结果 <span style="font-weight: 700;color: #ff6620;">{{ paginator.num_pages }}</span>页
                            </h2>
                            <ul class="news-list">
                              {# 导入自带高亮功能 #}
                              {% load highlight %}
{# page.object_list page是分页的查询集  .object_list转为的列表，存放数据库模型实例 #}
                              {% for one_news in page.object_list %}
                                <li class="news-item clearfix">
                                  <a href="{% url 'news:news_detail' one_news.id %}" class="news-thumbnail" target="_blank">
                          {#  必须加 .object.  #}
                                  <img src="{{ one_news.object.image_url }}">
                                  </a>
                                  <div class="news-content">
                                    <h4 class="news-title">
         {# 'news:news_detail' one_news.id  中间一定要有空格#}
                                      <a href="{% url 'news:news_detail' one_news.id %}">
                {# 高亮处理格式 highlight xxx with query 这里 因为已经再 search_indexes.py 中加载了，所以不用加 .object. #}
                                        {% highlight one_news.title with query %}
                                      </a>
                                    </h4>
                                    <p class="news-details">{% highlight one_news.digest with query %}</p>
                                    <div class="news-other">
                                      <span class="news-type">{{ one_news.object.tag.name }}</span>
                                      <span class="news-time">{{ one_news.object.update_time }}</span>
                                      <span
                                          class="news-author">{% highlight one_news.object.author.username with query %}

                                      </span>
                                    </div>
                                  </div>
                                </li>
                              {% endfor %}


                            </ul>
                          </div>

                        {% else %}

                          <div class="news-contain">
                            <div class="hot-recommend-list">
                              <h2 class="hot-recommend-title">热门推荐</h2>
                              <ul class="news-list">

                                {% for one_hotnews in page.object_list %}

                                  <li class="news-item clearfix">
                                   {# 这里就不需要 object 了 自己创建的分页，并没有经过elistcsearch模块处理#}
                                      <a href="#" class="news-thumbnail">
                                      <img src="{{ one_hotnews.news.image_url }}">
                                    </a>
                                    <div class="news-content">
                                      <h4 class="news-title">
                                        <a href="{% url 'news:news_detail' one_hotnews.news.id %}">{{ one_hotnews.news.title }}</a>
                                      </h4>
                                      <p class="news-details">{{ one_hotnews.news.digest }}</p>
                                      <div class="news-other">
                                        <span class="news-type">{{ one_hotnews.news.tag.name }}</span>
                                        <span class="news-time">{{ one_hotnews.update_time }}</span>
                                        <span class="news-author">{{ one_hotnews.news.author.username }}</span>
                                      </div>
                                    </div>
                                  </li>

                                {% endfor %}


                              </ul>
                            </div>
                          </div>

                        {% endif %}

                       <!-- search-list end -->
                       <!-- news-contain start -->

                    {# 分页导航 #}
                     <div class="page-box" id="pages">
                       <div class="pagebar" id="pageBar">
                          <a class="a1">{{ page.paginator.count }}条</a>
                         {# 上一页的URL地址 #}
                           {# 判断是否有上一页 #}
                         {% if page.has_previous %}
                           {% if query %}
                           {# page 必须与 views.py 中一致 #}
                             <a href="{% url 'news:news_search' %}?q={{ query }}&amp;page={{ page.previous_page_number }}"
                                class="prev">上一页</a>
                           {% else %}
                             <a href="{% url 'news:news_search' %}?page={{ page.previous_page_number }}" class="prev">上一页</a>
                           {% endif %}
                         {% endif %}
                         {# 列出所有的URL地址 #}
{# page.paginator.page_range|slice:":10"  搜寻页数范围|过滤器 #}
                         {% for num in page.paginator.page_range|slice:":10" %}
                           {% if num == page.number %}
                             <span class="sel">{{ page.number }}</span>
                           {% else %}
                             {% if query %}
                               <a href="{% url 'news:news_search' %}?q={{ query }}&amp;page={{ num }}"
                                  target="_self">{{ num }}</a>
                             {% else %}
                               <a href="{% url 'news:news_search' %}?page={{ num }}" target="_self">{{ num }}</a>
                             {% endif %}
                           {% endif %}
                         {% endfor %}

                        {# 如果页数大于10，则打两点 #}
                         {% if page.paginator.num_pages > 10 %}
                           ..

                           {% if query %}
                             <a href="{% url 'news:news_search' %}?q={{ query }}&amp;page={{ page.paginator.num_pages }}"
                                target="_self">{{ page.paginator.num_pages }}</a>
                           {% else %}
                             <a href="{% url 'news:news_search' %}?page={{ page.paginator.num_pages }}"
                                target="_self">{{ page.paginator.num_pages }}</a>
                           {% endif %}

                         {% endif %}

                         {# 下一页的URL地址 #}
                         {% if page.has_next %}
                           {% if query %}
                             <a href="{% url 'news:news_search' %}?q={{ query }}&amp;page={{ page.next_page_number }}"
                                class="next">下一页</a>
                           {% else %}
                             <a href="{% url 'news:news_search' %}?page={{ page.next_page_number }}" class="next">下一页</a>
                           {% endif %}
                         {% endif %}
                       </div>
                     </div>

                     <!-- news-contain end -->
                   </div>
                   <!-- content end -->
               </div>
{% endblock %}
<!-- main-contain  end -->


{% block script %}

  <script src="{% static 'js/common.js' %}"></script>
{% endblock %}

/* 在static/css/news/search.css中加入如下代码： */

/* === current index start === */
#pages {
    padding: 32px 0 10px;
}

.page-box {
    text-align: center;
    /*font-size: 14px;*/
}

#pages a.prev, a.next {
    width: 56px;
    padding: 0
}

#pages a {
    display: inline-block;
    height: 26px;
    line-height: 26px;
    background: #fff;
    border: 1px solid #e3e3e3;
    text-align: center;
    color: #333;
    padding: 0 10px
}

#pages .sel {
    display: inline-block;
    height: 26px;
    line-height: 26px;
    background: #0093E9;
    border: 1px solid #0093E9;
    color: #fff;
    text-align: center;
    padding: 0 10px
}

.highlighted {
    color: coral;
    font-weight: bold;
}
/* === current index end === */