数客联盟我爱编程

solr批量生成索引踩坑

2018-02-27  本文已影响168人  tinyMonkey

solr提供了一种批量生成索引的方式,各种文档中都有提到。由于有这个需求,所以笔者开始了艰辛的踩坑过程。

Lucene版本问题

其实Lucene版本问题也是始发因素,之前使用hbase-indexer去批量创建索引,hbase-indexer使用的solr客户端版本是solr-6.4.1,笔者的solr版本是solr-6.3.0,没有任何问题。但是后来使用了HDP,而HDP自带的solr版本solr-5.5.2,在进行索引合并操作时,出现了一个lucene版本问题:

18/02/11 16:38:06 ERROR mr.GoLive: Error sending live merge command
java.util.concurrent.ExecutionException: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://10.1.236.66:8886/solr: Could not load codec 'Lucene62'.  Did you forget to add lucene-backward-codecs.jar?
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
    at com.ngdata.hbaseindexer.mr.GoLive.goLive(GoLive.java:130)
    at com.ngdata.hbaseindexer.mr.HBaseMapReduceIndexerTool.runIndexingPipeline(HBaseMapReduceIndexerTool.java:541)
    at com.ngdata.hbaseindexer.mr.HBaseMapReduceIndexerTool.run(HBaseMapReduceIndexerTool.java:241)
    at com.ngdata.hbaseindexer.mr.HBaseMapReduceIndexerTool.run(HBaseMapReduceIndexerTool.java:120)
    at com.ngdata.hbaseindexer.mr.HBaseMapReduceIndexerTool.run(HBaseMapReduceIndexerTool.java:110)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
    at com.ngdata.hbaseindexer.mr.HBaseMapReduceIndexerTool.main(HBaseMapReduceIndexerTool.java:104)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:233)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://10.1.236.66:8886/solr: Could not load codec 'Lucene62'.  Did you forget to add lucene-backward-codecs.jar?
    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:593)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:262)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:251)
    at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
    at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:166)
    at com.ngdata.hbaseindexer.mr.GoLive$1.call(GoLive.java:100)
    at com.ngdata.hbaseindexer.mr.GoLive$1.call(GoLive.java:89)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
18/02/11 16:38:06 INFO mr.GoLive: Live merging of index shards into Solr cluster took 0.894 secs

这个错误提示去查看在solr中是否有lucene-backward-codecs.jar,而solr中对应的jar包版本为:lucene-backward-codecs-5.5.2.jar

然后猜想应该可以通过某个可以设置lucene版本,寻寻觅觅了很久后找到了一个唯一配置lucene版本的地方,solrconfig.xml文件中包含如下:

 <luceneMatchVersion>5.5.2</luceneMatchVersion>

但是,修改之后发现并没有什么用处;进一步查看solr源码,发现了原因:

solr的mapreduce包不支持csv文件

solr自己提供了批量生成索引的类org.apache.solr.hadoop.MapReduceIndexerTool。但是笔者发现默认的Mapper类是org.apache.solr.hadoop.morphline.MorphlineMapper,它是用来解析单独的文本文件的,明显不满足需求。并且直接传递mapper类给MapReduceIndexerTool的方法也行不通(需要传递各种参数),所以只能自定义了一个Mapper,并同时重新实现MapReduceIndexerTool.java。

MapReduceIndexerTool对solr的客户端代码有依赖

本来以为对于不同版本的solr,只需要改动pom.xml文件中对solr的依赖就可以解决codec不同版本的问题,但是MapReduceIndexerTool中使用的solrj版本的内容不一致,所以不可避免的需要两套程序来做两个solr版本的批量索引生成。

找不到solr config文件夹

solr 批量创建索引的过程,是通过mapper把数据生成solr doc,而SolrReducer.java也只是把solr doc序列化,而真正的生成索引是在org.apache.solr.hadoop.SolrOutputFormat输出文件的过程中,生成一个内置的EmbeddedSolrServer建立索引,但是在生成EmbeddedSolrServer过程中,发现了solr config文件找不到。原因出在org.apache.solr.hadoop.SolrRecordWriter中:

  public static EmbeddedSolrServer createEmbeddedSolrServer(Path solrHomeDir, FileSystem fs, Path outputShardDir)
      throws IOException {

    ...
       SolrCore core = container.create("core1", ImmutableMap.of(CoreDescriptor.CORE_DATADIR, dataDirStr));
    ...
 
  }

这里的create方法会自动去dataDirStr/core1下寻找solr collection的配置文件,而去zookeeper拉取的配置文件是放在dataDirStr下的,所以无法找到。怀疑作者在测试过程中使用的是写死的solrHomeDir,而这里存储着对应的core1,因此修改代码如下,问题解决。

  public static EmbeddedSolrServer createEmbeddedSolrServer(Path solrHomeDir, FileSystem fs, Path outputShardDir)
      throws IOException {

    ...
       SolrCore core = container.create("core1", Paths.get(solrHomeDir.toString()), ImmutableMap.of(CoreDescriptor.CORE_DATADIR, dataDirStr));
    ...
  }

⚠️:以上是solr-5.5.2 版本的代码,对于solr-6.3.0同样的问题也存在,只不过报错不一致

TreeMerge过程LockFactory问题

生成索引的过程中如果包含TreeMerge过程(第一次reduce时shard个数少于reduce个数,需要经过第二次的索引合并工作),会引发锁竞争的问题。可以修改org.apache.solr.hadoop.TreeMergeOutputFormat类中directoty生成方式来解决问题。

          Directory mergedIndex = new HdfsDirectory(workDir, NoLockFactory.INSTANCE, context.getConfiguration(), HdfsDirectory.DEFAULT_BUFFER_SIZE);
//        Directory mergedIndex = new HdfsDirectory(workDir, context.getConfiguration());
    

生成的索引未合并

如下图所示,对同一份数据进行4次批量索引操作,最终在hdfs上显示的索引文件如下,可以发现未进行合并索引操作。而进行检索时会检索出同样ID的四条数据。
⚠️因此,批量索引操作适合增量索引或者全量索引,但是不适合批量跟新索引。

drwxr-xr-x   - infra-solr hdfs          0 2018-01-31 10:49 /user/infra-solr/mrsolr/core_node5
drwxr-xr-x   - infra-solr hdfs          0 2018-01-31 10:49 /user/infra-solr/mrsolr/core_node5/data
drwxr-xr-x   - infra-solr hdfs          0 2018-02-03 21:29 /user/infra-solr/mrsolr/core_node5/data/index
-rwxr-xr-x   3 infra-solr hdfs        100 2018-02-03 21:15 /user/infra-solr/mrsolr/core_node5/data/index/_7.fdt
-rwxr-xr-x   3 infra-solr hdfs         83 2018-02-03 21:15 /user/infra-solr/mrsolr/core_node5/data/index/_7.fdx
-rwxr-xr-x   3 infra-solr hdfs        244 2018-02-03 21:15 /user/infra-solr/mrsolr/core_node5/data/index/_7.fnm
-rwxr-xr-x   3 infra-solr hdfs        489 2018-02-03 21:15 /user/infra-solr/mrsolr/core_node5/data/index/_7.si
-rwxr-xr-x   3 infra-solr hdfs        110 2018-02-03 21:15 /user/infra-solr/mrsolr/core_node5/data/index/_7_Lucene50_0.doc
-rwxr-xr-x   3 infra-solr hdfs        178 2018-02-03 21:15 /user/infra-solr/mrsolr/core_node5/data/index/_7_Lucene50_0.tim
-rwxr-xr-x   3 infra-solr hdfs        102 2018-02-03 21:15 /user/infra-solr/mrsolr/core_node5/data/index/_7_Lucene50_0.tip
-rwxr-xr-x   3 infra-solr hdfs         73 2018-02-03 21:15 /user/infra-solr/mrsolr/core_node5/data/index/_7_Lucene54_0.dvd
-rwxr-xr-x   3 infra-solr hdfs        118 2018-02-03 21:15 /user/infra-solr/mrsolr/core_node5/data/index/_7_Lucene54_0.dvm
-rwxr-xr-x   3 infra-solr hdfs        100 2018-02-03 21:20 /user/infra-solr/mrsolr/core_node5/data/index/_8.fdt
-rwxr-xr-x   3 infra-solr hdfs         83 2018-02-03 21:20 /user/infra-solr/mrsolr/core_node5/data/index/_8.fdx
-rwxr-xr-x   3 infra-solr hdfs        496 2018-02-03 21:20 /user/infra-solr/mrsolr/core_node5/data/index/_8.fnm
-rwxr-xr-x   3 infra-solr hdfs        489 2018-02-03 21:20 /user/infra-solr/mrsolr/core_node5/data/index/_8.si
-rwxr-xr-x   3 infra-solr hdfs        110 2018-02-03 21:20 /user/infra-solr/mrsolr/core_node5/data/index/_8_Lucene50_0.doc
-rwxr-xr-x   3 infra-solr hdfs        244 2018-02-03 21:20 /user/infra-solr/mrsolr/core_node5/data/index/_8_Lucene50_0.tim
-rwxr-xr-x   3 infra-solr hdfs        148 2018-02-03 21:20 /user/infra-solr/mrsolr/core_node5/data/index/_8_Lucene50_0.tip
-rwxr-xr-x   3 infra-solr hdfs         82 2018-02-03 21:20 /user/infra-solr/mrsolr/core_node5/data/index/_8_Lucene54_0.dvd
-rwxr-xr-x   3 infra-solr hdfs        179 2018-02-03 21:20 /user/infra-solr/mrsolr/core_node5/data/index/_8_Lucene54_0.dvm
-rwxr-xr-x   3 infra-solr hdfs        100 2018-02-03 21:24 /user/infra-solr/mrsolr/core_node5/data/index/_9.fdt
-rwxr-xr-x   3 infra-solr hdfs         83 2018-02-03 21:24 /user/infra-solr/mrsolr/core_node5/data/index/_9.fdx
-rwxr-xr-x   3 infra-solr hdfs        496 2018-02-03 21:24 /user/infra-solr/mrsolr/core_node5/data/index/_9.fnm
-rwxr-xr-x   3 infra-solr hdfs        489 2018-02-03 21:24 /user/infra-solr/mrsolr/core_node5/data/index/_9.si
-rwxr-xr-x   3 infra-solr hdfs        110 2018-02-03 21:24 /user/infra-solr/mrsolr/core_node5/data/index/_9_Lucene50_0.doc
-rwxr-xr-x   3 infra-solr hdfs        244 2018-02-03 21:24 /user/infra-solr/mrsolr/core_node5/data/index/_9_Lucene50_0.tim
-rwxr-xr-x   3 infra-solr hdfs        148 2018-02-03 21:24 /user/infra-solr/mrsolr/core_node5/data/index/_9_Lucene50_0.tip
-rwxr-xr-x   3 infra-solr hdfs         82 2018-02-03 21:24 /user/infra-solr/mrsolr/core_node5/data/index/_9_Lucene54_0.dvd
-rwxr-xr-x   3 infra-solr hdfs        179 2018-02-03 21:24 /user/infra-solr/mrsolr/core_node5/data/index/_9_Lucene54_0.dvm
-rwxr-xr-x   3 infra-solr hdfs        100 2018-02-03 21:29 /user/infra-solr/mrsolr/core_node5/data/index/_a.fdt
-rwxr-xr-x   3 infra-solr hdfs         83 2018-02-03 21:29 /user/infra-solr/mrsolr/core_node5/data/index/_a.fdx
-rwxr-xr-x   3 infra-solr hdfs        496 2018-02-03 21:29 /user/infra-solr/mrsolr/core_node5/data/index/_a.fnm
-rwxr-xr-x   3 infra-solr hdfs        489 2018-02-03 21:29 /user/infra-solr/mrsolr/core_node5/data/index/_a.si
-rwxr-xr-x   3 infra-solr hdfs        110 2018-02-03 21:29 /user/infra-solr/mrsolr/core_node5/data/index/_a_Lucene50_0.doc
-rwxr-xr-x   3 infra-solr hdfs        244 2018-02-03 21:29 /user/infra-solr/mrsolr/core_node5/data/index/_a_Lucene50_0.tim
-rwxr-xr-x   3 infra-solr hdfs        148 2018-02-03 21:29 /user/infra-solr/mrsolr/core_node5/data/index/_a_Lucene50_0.tip
-rwxr-xr-x   3 infra-solr hdfs         82 2018-02-03 21:29 /user/infra-solr/mrsolr/core_node5/data/index/_a_Lucene54_0.dvd
-rwxr-xr-x   3 infra-solr hdfs        179 2018-02-03 21:29 /user/infra-solr/mrsolr/core_node5/data/index/_a_Lucene54_0.dvm
-rwxr-xr-x   3 infra-solr hdfs        289 2018-02-03 21:24 /user/infra-solr/mrsolr/core_node5/data/index/segments_b
-rwxr-xr-x   3 infra-solr hdfs        351 2018-02-03 21:29 /user/infra-solr/mrsolr/core_node5/data/index/segments_c

上一篇下一篇

猜你喜欢

热点阅读