solr基础

2020-04-27 本文已影响0人拼搏男孩

1、简介

Solr是apache的顶级开源项目，它是使用java开发，基于lucene的全文检索服务器。

Solr比lucene提供了更多的查询语句，而且它可扩展、可配置，同时它对lucene的性能进行了优化。

Solr是如何实现全文检索的呢：

索引流程：solr客户端（浏览器、java程序）可以向solr服务端发送POST请求，请求内容是包含Field等信息的一个xml文档，通过该文档，solr实现对索引的维护（增删改）

搜索流程：solr客户端（浏览器、java程序）可以向solr服务端发送GET请求，solr服务器返回一个xml文档。

2、solr整合tomcat

solr自带的web容器是jetty，有时候我们想要使用tomcat替代jetty，就需要让solr整合tomcat，本文基于solr8.5与tomcat9，不同的solr版本与tomcat版本操作步骤不同，在进行下列操作前请检查你的版本是否和我的一致。下面是步骤

2.1 下载solr与tomcat

solr下载可以去它的官网:https://lucene.apache.org/solr/downloads.html。src.tgz结尾的是源码，tgz结尾的是Linux版，zip结尾的是windows版，注意根据你的操作系统选择不同的版本。

tomcat下载也可以去官网：http://tomcat.apache.org/

2.2 复制webapp

将上面下载的两个压缩包解压后，将solr目录下的server目录下的solr-webapp目录下的webapp文件夹复制到tomcat的webapps目录下并重命名为solr

2.3 复制jar包

将solr-8.5.1\server\lib\ext下的所有jar包，以及solr-8.5.1\server\lib下以metrics开头的jar、gmetric4j-1.0.7.jar以及以http2开头的jar复制到apache-tomcat-9\webapps\solr\WEB-INF\lib下

2.4 复制log4j2.xml

在tomcat\webapps\solr\WEB-INF中，新建classes文件夹，将solr-8.5.1\server\resources下的log4j2-console.xml与log4j2.xml文件拷贝到里面

2.5 修改web.xml

配置apache-tomcat-9\webapps\solr\WEB-INF下的web.xml

添加配置：

<env-entry>  
    <env-entry-name>solr/home</env-entry-name>  
    <env-entry-value>D:/software/solr/solr_home</env-entry-value>  
    <env-entry-type>java.lang.String</env-entry-type>  
  </env-entry>

solr_home是下面将要创建的目录

注释掉以下配置：

  <!-- Get rid of error message -->
  <!-- <security-constraint>
    <web-resource-collection>
      <web-resource-name>Disable TRACE</web-resource-name>
      <url-pattern>/</url-pattern>
      <http-method>TRACE</http-method>
    </web-resource-collection>
    <auth-constraint/>
  </security-constraint>
  <security-constraint>
    <web-resource-collection>
      <web-resource-name>Enable everything but TRACE</web-resource-name>
      <url-pattern>/</url-pattern>
      <http-method-omission>TRACE</http-method-omission>
    </web-resource-collection>
  </security-constraint> -->

2.6 修改log4j2.xml

修改apache-tomcat-9\webapps\solr\WEB-INF\classes\log4j2.xml文件：需要把所有${sys:solr.log.dir}修改为自己的指定的真实路径，我是修改为了tomcat下的logs目录

<RollingRandomAccessFile
        name="MainLogFile"
        fileName="D:/software/solr/apache-tomcat-solr/logs/solr.log"
        filePattern="D:/software/solr/apache-tomcat-solr/logs/solr.log.%i" >
      <PatternLayout>
        <Pattern>
          %maxLen{%d{yyyy-MM-dd HH:mm:ss.SSS} %-5p (%t) [%X{collection} %X{shard} %X{replica} %X{core}] %c{1.} %m%notEmpty{ =>%ex{short}}}{10240}%n
        </Pattern>
      </PatternLayout>
      <Policies>
        <OnStartupTriggeringPolicy />
        <SizeBasedTriggeringPolicy size="32 MB"/>
      </Policies>
      <DefaultRolloverStrategy max="10"/>
    </RollingRandomAccessFile>

    <RollingRandomAccessFile
        name="SlowLogFile"
        fileName="D:/software/solr/apache-tomcat-solr/logs/solr_slow_requests.log"
        filePattern="D:/software/solr/apache-tomcat-solr/logs/solr_slow_requests.log.%i" >
      <PatternLayout>
        <Pattern>
          %maxLen{%d{yyyy-MM-dd HH:mm:ss.SSS} %-5p (%t) [%X{collection} %X{shard} %X{replica} %X{core}] %c{1.} %m%notEmpty{ =>%ex{short}}}{10240}%n
        </Pattern>
      </PatternLayout>
      <Policies>
        <OnStartupTriggeringPolicy />
        <SizeBasedTriggeringPolicy size="32 MB"/>
      </Policies>
      <DefaultRolloverStrategy max="10"/>
    </RollingRandomAccessFile>

2.7 创建Solr CoreAdmin管理（Solr Home）

在solr的同级目录创建一个solr_home文件夹，
拷贝solr-8.5.1\server\solr\下所有文件、文件夹复制到solr_home目录下
拷贝solr-8.5.1下contrib和dist文件夹至solr_home目录下
在solr_home目录下新建demo_core文件夹；并复制solr_home\configsets\sample_techproducts_configs\目录下conf文件夹至solr_home\demo_core下

修改solr_home\demo_core\conf\solrconfig.xml文件

<lib dir="${solr.install.dir:../}/contrib/extraction/lib" regex=".*\.jar" />
  <lib dir="${solr.install.dir:../}/dist/" regex="solr-cell-\d.*\.jar" />

  <lib dir="${solr.install.dir:../}/contrib/clustering/lib/" regex=".*\.jar" />
  <lib dir="${solr.install.dir:../}/dist/" regex="solr-clustering-\d.*\.jar" />

  <lib dir="${solr.install.dir:../}/contrib/langid/lib/" regex=".*\.jar" />
  <lib dir="${solr.install.dir:../}/dist/" regex="solr-langid-\d.*\.jar" />

  <lib dir="${solr.install.dir:../}/dist/" regex="solr-ltr-\d.*\.jar" />

  <lib dir="${solr.install.dir:../}/contrib/velocity/lib" regex=".*\.jar" />
  <lib dir="${solr.install.dir:../}/dist/" regex="solr-velocity-\d.*\.jar" />

3、整合IKAnalyzer中文分词器

这个项目的github地址为：https://github.com/magese/ik-analyzer-solr

将jar包放入Solr服务的Jetty或Tomcat的webapp/WEB-INF/lib/目录下；
将resources目录下的5个配置文件放入solr服务的Jetty或Tomcat的webapp/WEB-INF/classes/目录下；
```
① IKAnalyzer.cfg.xml
② ext.dic
③ stopword.dic
④ ik.conf
⑤ dynamicdic.txt
```

配置Solr的managed-schema，添加ik分词器，示例如下；

<!-- ik分词器 -->
<fieldType name="text_ik" class="solr.TextField">
  <analyzer type="index">
      <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="false" conf="ik.conf"/>
      <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
      <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="true" conf="ik.conf"/>
      <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

4、managed-schema文件

solr_home/core_demo/conf下managed-schema文件，这个文件主要定义Field和FieldType，Field可以理解为数据库中的一个字段，FieldType是字段类型，managed-schema文件预先给我们定义了很多字段与字段类型，

4.1 field

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

上面的代码是一个字段的定义，这个字段的名称是id，类型是string，这是solr的内置类型

name：指定域的名称(自定义)
type：指定域的类型
indexed：是否索引(是否显示到页面索引)
是：(将分好的词进行索引，索引的目的，就是为了搜索)
否：不索引，也就是不对该field域进行搜索。
stored：是否存储（是否能搜索到）
是：将field域中的内容存储到文档域中。存储的目的，就是为了搜索页面显示取值用的
否：不将field域中的内容存储到文档域中。不存储，则搜索页面中没法获取该field域的值。
required：是否必须
multiValued：是否多值，比如查询数据需要关联多个字段数据，一个Field存储多个值信息，必须将multiValued设置为true。

4.2 dynamicField

dynamicField顾名思义就是动态字段，比如：

<dynamicField name="*_i" type="pint" indexed="true" stored="true"/>

上面这个动态字段就表示所有以"_i"结尾的字段

4.3 uniqueKey

唯一键，在一个managed-schema文件中只允许有一个：

<uniqueKey>id</uniqueKey>

其中的id是在Field标签中已经定义好的域名，而且该域设置为required为true。

4.4 copyField

复制字段，这个字段的作用可以这样理解：比如说我们想要查询在种类和商品名称中都含有手机的商品就可以这样定义

<field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="name" type="text_general" indexed="true" stored="true"/>
<copyField source="cat" dest="text"/>
<copyField source="name" dest="text"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

4.5 fieldType

   <field name="pre" type="preanalyzed" indexed="true" stored="true"/>
   <field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
   <field name="name" type="text_general" indexed="true" stored="true"/>
   <field name="manu" type="text_gen_sort" indexed="true" stored="true" omitNorms="true" multiValued="false"/>
   <field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="features" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="includes" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />

   <field name="weight" type="pfloat" indexed="true" stored="true"/>
   <field name="price"  type="pfloat" indexed="true" stored="true"/>
   <field name="popularity" type="pint" indexed="true" stored="true" />
   <field name="inStock" type="boolean" indexed="true" stored="true" />

   <field name="store" type="location" indexed="true" stored="true"/>

从上面的代码可以看出，solr至少有pfloat、pint、boolean、string等类型。这几个类型也是内置类型的别名:

    <fieldType name="pint" class="solr.IntPointField" docValues="true"/>
    <fieldType name="pfloat" class="solr.FloatPointField" docValues="true"/>
    <fieldType name="plong" class="solr.LongPointField" docValues="true"/>
    <fieldType name="pdouble" class="solr.DoublePointField" docValues="true"/>
    
    <fieldType name="pints" class="solr.IntPointField" docValues="true" multiValued="true"/>
    <fieldType name="pfloats" class="solr.FloatPointField" docValues="true" multiValued="true"/>
    <fieldType name="plongs" class="solr.LongPointField" docValues="true" multiValued="true"/>
    <fieldType name="pdoubles" class="solr.DoublePointField" docValues="true" multiValued="true"/>

还可以自定义类型：

  <fieldType name="text_ik" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="false" conf="ik.conf"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="true" conf="ik.conf"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>

上面定义了一个新的类型：text_ik，相当于起了个别名，并同时定义了分析器。

Name：指定域类型的名称

Class：指定该域类型对应的solr的类型

Analyzer：指定分析器

Type：index、query，分别指定搜索和索引时的分析器

Tokenizer：指定分词器

Filter：指定过滤器

5、客户端查询语法

q 查询的关键字，此参数最为重要，例如，q=id:1，默认为q= : ，
fl 指定返回哪些字段，⽤逗号或空格分隔，注意：字段区分⼤⼩写，例如，fl= id,title,sort
start 返回结果的第⼏条记录开始，⼀般分⻚⽤，默认0开始
rows 指定返回结果最多有多少条记录，默认值为 10，配合start实现分⻚
sort 排序⽅式，例如id desc 表示按照 “id” 降序
wt (writer type)指定输出格式，有 xml, json, php等
fq （filter query）过虑查询，提供⼀个可选的筛选器查询。返回在q查询符合结果中同时符合的fq条
件的查询结果，例如：q=id:1&fq=sort:[1 TO 5]，找关键字id为1 的，并且sort是1到5之间的。
df 默认的查询字段，⼀般默认指定。
qt （query type）指定那个类型来处理查询请求，⼀般不⽤指定，默认是standard。
indent 返回的结果是否缩进，默认关闭，⽤ indent=true|on 开启，⼀般调试json,php,phps,ruby输出
才有必要⽤这个参数。
version 查询语法的版本，建议不使⽤它，由服务器指定默认值。
*表示所有，比如*:*表示查询所有字段所有条件