ElasticSearch/Lucene

[转]Lucene索引过程分析

2018-05-10  本文已影响7人  囧雪啥都不知道

原文链接# Lucene学习总结之四:Lucene索引过程分析(1)# Lucene索引过程分析(2)#Lucene索引过程分析(3)# Lucene索引过程分析(4),这里做一些摘要并收藏。

对于Lucene的索引创建过程,除了将词(Term)写入倒排表并最终写入Lucene的索引文件外,还包括分词(Analyzer)和合并段(merge segment)的过程。

本次暂不包括这两部分,将在以后的文件中进行分析。

Lucene的索引过程,很多的博客,文章都有介绍,推荐大家上网搜一篇文章:《Annotated Lucene》,好像中文名称叫《Lucene源码剖析》是很不错的。

想要真正了解Lucene索引文件过程,最好的办法是跟进代码调试,对着文章看代码,这样不但能够最详细准确的掌握索引过程(描述都是有偏差的,而代码是不会骗你的),而且还能够学习Lucene的一些优秀的实现,能够在以后的工作中为我所用,毕竟Lucene是比较优秀的开源项目之一。

本索引过程为Lucene 3.0.0的索引过程。

一、索引过程体系结构

Lucene 3.0的搜索要经历一个十分复杂的过程,各种信息分散在不同的对象中分析,处理,写入,为了支持多线程,每个线程都创建了一系列类似结构的对象集,为了提高效率,要复用一些对象集,这使得索引过程更加复杂。

其实索引过程,就是经历下图中所示的索引链的过程,索引链中的每个节点,负责索引文档中的不同部分的信息,当经历完所有的索引链的时候,文档就处理完毕了。最初的索引链,我们称之为基本索引链

为了支持多线程,使得多个线程能够并发处理文档,因而每个线程都要建立自己的索引链体系,使得每个线程能够独立工作,在基本索引链基础上建立起来的每个线程独立的索引链体系,我们称之为线程索引链

为了提高效率,考虑到对相同域的处理有相似的过程,应用的还蠢也大致相当,因而不必每个线程在处理每一篇文档的时候都重新创建一系列对象,而是复用这些对象。所以对每个域也建立了自己的索引链体系,我们称之为域索引链。域索引链的每个节点是由线程索引链中的相应的节点调用addFields创建的。

当完成对文档的处理后,各部分信息都要写到索引文件中,写入索引文件的过程是同步的,不是多线程的,也是沿着基本索引链将各部分信息依次写入索引文件的。

下面详细分析这一过程。

索引过程

二、详细索引过程

1、创建IndexWriter对象

代码:

IndexWriter writer = new IndexWriter(
        FSDirectory.open(INDEX_DIR), 
        new StandardAnalyzer(Version.LUCENE_CURRENT), 
        true, 
        IndexWriter.MaxFieldLength.LIMITED);

IndexWriter对象主要包含以下几个方面的信息:

有关SegmentInfos对象所保存的信息:
+ 当索引文件夹如下的时候,SegmentInfo对象如下表。

segmentInfos    SegmentInfos  (id=37)    
    capacityIncrement    0    
    counter    3    
    elementCount    3    
    elementData    Object[10]  (id=68)    
        [0]    SegmentInfo  (id=166)    
            delCount    0    
            delGen    -1    
            diagnostics    HashMap<K,V>  (id=170)    
            dir    SimpleFSDirectory  (id=171)    
            docCount    2    
            docStoreIsCompoundFile    false    
            docStoreOffset    -1    
            docStoreSegment    null    
            files    ArrayList<E>  (id=173)    
            hasProx    true    
            hasSingleNormFile    true    
            isCompoundFile    1    
            name    "_0"    
            normGen    null    
            preLockless    false    
            sizeInBytes    635    
        [1]    SegmentInfo  (id=168)    
            delCount    0    
            delGen    -1    
            diagnostics    HashMap<K,V>  (id=177)    
            dir    SimpleFSDirectory  (id=171)    
            docCount    2    
            docStoreIsCompoundFile    false    
            docStoreOffset    -1    
            docStoreSegment    null    
            files    ArrayList<E>  (id=178)    
            hasProx    true    
            hasSingleNormFile    true    
            isCompoundFile    1    
            name    "_1"    
            normGen    null    
            preLockless    false    
            sizeInBytes    635    
        [2]    SegmentInfo  (id=169)    
            delCount    0    
            delGen    -1    
            diagnostics    HashMap<K,V>  (id=180)    
            dir    SimpleFSDirectory  (id=171)    
            docCount    2    
            docStoreIsCompoundFile    false    
            docStoreOffset    -1    
            docStoreSegment    null    
            files    ArrayList<E>  (id=214)    
            hasProx    true    
            hasSingleNormFile    true    
            isCompoundFile    1    
            name    "_2"    
            normGen    null    
            preLockless    false    
            sizeInBytes    635     
    generation    4    
    lastGeneration    4    
    modCount    3    
    pendingSegnOutput    null    
    userData    HashMap<K,V>  (id=146)    
    version    1263044890832   

有关IndexFileDeleter:

(1) 创建IndexWriter时

IndexWriter writer = new IndexWriter(
        FSDirectory.open(indexDir),
        new StandardAnalyzer(Version.LUCENE_CURRENT),
        true,
        IndexWriter.MaxFieldLength.LIMITED);
writer.setMergeFactor(3);

索引文件夹如下:

引用计数如下:

refCounts HashMap<K,V> (id = 101)
    size  1
    table HashMap$Entry<K,V>[16] (id = 105)
        [8] hashMap$Entry<K,V> (id = 110)
            key  "segments_1"
            value IndexFileDeleter$RefCount (id = 38)
                count 1

(2) 添加第一个段时

indexDocs(writer, docDir);
writer.commit();

首先生成的不是compound文件

因而引用计数如下:

refCounts HashMap<K,V> (id = 101)
    size 9
    table hashMap$Entry<K,V>[16] (id = 105)
        [1] hashMap$Entry<K,V> (id = 129)
            key "_0.tis"
            value IndexFileDeleter$RefCount(id = 138)
                count 1
        [3] HashMap$Entry<K,V> (id = 130)
            key "_0.fnm"
            value IndexFileDeleter$RefCount  (id=141)    
                count 1
        [4] HashMap$Entry<K,V> (id = 134)
            key "_0.tii"
            value IndexFileDeleter$RefCount  (id=142)    
                count 1
        [8] HashMap$Entry<K,V> (id = 135)
            key "_0.frq"
            value IndexFileDeleter$RefCount  (id=143)    
                count 1
        [10] HashMap$Entry<K,V> (id = 136)
            key "_0.fdx"
            value IndexFileDeleter$RefCount  (id=144)    
                count 1
        [13] HashMap$Entry<K,V> (id = 139)
            key "_0.prx"
            value IndexFileDeleter$RefCount  (id=145)    
                count 1
        [14] HashMap$Entry<K,V> (id = 140)
            key "_0.fdt"
            value IndexFileDeleter$RefCount  (id=146)    
                count 1

然后会合并成compound文件,并加入引用计数

refCounts    HashMap<K,V>  (id=101)     
    size    10    
    table    HashMap$Entry<K,V>[16]  (id=105)     
        [1]    HashMap$Entry<K,V>  (id=129)     
            key    "_0.tis"     
            value    IndexFileDeleter$RefCount  (id=138)    
                count    1     
        [2]    HashMap$Entry<K,V>  (id=154)     
            key    "_0.cfs"     
            value    IndexFileDeleter$RefCount  (id=155)    
                count    1     
        [3]    HashMap$Entry<K,V>  (id=130)     
            key    "_0.fnm"     
            value    IndexFileDeleter$RefCount  (id=141)    
                count    1     
        [4]    HashMap$Entry<K,V>  (id=134)     
            key    "_0.tii"     
            value    IndexFileDeleter$RefCount  (id=142)    
                count    1     
        [8]    HashMap$Entry<K,V>  (id=135)     
            key    "_0.frq"     
            value    IndexFileDeleter$RefCount  (id=143)    
                count    1     
        [10]    HashMap$Entry<K,V>  (id=136)     
            key    "_0.fdx"     
            value    IndexFileDeleter$RefCount  (id=144)    
                count    1     
        [13]    HashMap$Entry<K,V>  (id=139)     
            key    "_0.prx"     
            value    IndexFileDeleter$RefCount  (id=145)    
                count    1     
        [14]    HashMap$Entry<K,V>  (id=140)     
            key    "_0.fdt"     
            value    IndexFileDeleter$RefCount  (id=146)    
                count    1   

然后会用IndexFileDeleter.decRef()来删除[_0.nrm, _0.tis, _0.fnm, _0.tii, _0.frq, _0.fdx, _0.prx, _0.fdt]文件

refCounts    HashMap<K,V>  (id=101)     
    size    2    
    table    HashMap$Entry<K,V>[16]  (id=105)     
        [2]    HashMap$Entry<K,V>  (id=154)     
            key    "_0.cfs"     
            value    IndexFileDeleter$RefCount  (id=155)    
                count    1     
        [8]    HashMap$Entry<K,V>  (id=110)     
            key    "segments_1"     
            value    IndexFileDeleter$RefCount  (id=38)    
                count    1    

然后建立新的segment_2

refCounts    HashMap<K,V>  (id=77)     
    size    3    
    table    HashMap$Entry<K,V>[16]  (id=84)     
        [2]    HashMap$Entry<K,V>  (id=87)     
            key    "_0.cfs"     
            value    IndexFileDeleter$RefCount  (id=91)    
                count    3     
        [8]    HashMap$Entry<K,V>  (id=89)     
            key    "segments_1"     
            value    IndexFileDeleter$RefCount  (id=62)    
                count    0     
        [9]    HashMap$Entry<K,V>  (id=90)     
            key    "segments_2"    
            next    null    
            value    IndexFileDeleter$RefCount  (id=93)    
                count    1     

然后IndexFileDeleter.decRef() 删除segments_1文件

refCounts    HashMap<K,V>  (id=77)     
    size    2    
    table    HashMap$Entry<K,V>[16]  (id=84)     
        [2]    HashMap$Entry<K,V>  (id=87)     
            key    "_0.cfs"     
            value    IndexFileDeleter$RefCount  (id=91)    
                count    2     
        [9]    HashMap$Entry<K,V>  (id=90)     
            key    "segments_2"     
            value    IndexFileDeleter$RefCount  (id=93)    
                count    1   

(3) 添加第二个段

indexDocs(writer, docDir); 
writer.commit();

(4) 添加第三个段,由于MergeFactor为3,则会进行一次段合并。

indexDocs(writer, docDir); 
writer.commit();

首先和其他的段一样,生成_2.cfs以及segment_4

同时创建了一个线程来进行背后进行段合并(ConcurrentMergeScheduler$MergeThread.run())

这时候的引用计数如下:

refCounts    HashMap<K,V>  (id=84)     
    size    5    
    table    HashMap$Entry<K,V>[16]  (id=98)     
        [2]    HashMap$Entry<K,V>  (id=112)     
            key    "_0.cfs"     
            value    IndexFileDeleter$RefCount  (id=117)    
                count    1     
        [4]    HashMap$Entry<K,V>  (id=113)     
            key    "_3.cfs"     
            value    IndexFileDeleter$RefCount  (id=118)    
                count    1     
        [12]    HashMap$Entry<K,V>  (id=114)     
            key    "_1.cfs"     
            value    IndexFileDeleter$RefCount  (id=119)    
                count    1     
        [13]    HashMap$Entry<K,V>  (id=115)     
            key    "_2.cfs"     
            value    IndexFileDeleter$RefCount  (id=120)    
                count    1     
        [15]    HashMap$Entry<K,V>  (id=116)     
            key    "segments_4"     
            value    IndexFileDeleter$RefCount  (id=121)    
                count    1   

(5)关闭writer

writer.close();

通过IndexFileDeleter.decRef()删除被合并的段

有关SimpleFSLock进行JVM之间的同步:

// Lock的抽象类 
public abstract class Lock {

  public static long LOCK_POLL_INTERVAL = 1000;

  public static final long LOCK_OBTAIN_WAIT_FOREVER = -1;

  public abstract boolean obtain() throws IOException;

  public boolean obtain(long lockWaitTimeout) throws LockObtainFailedException, IOException {

    boolean locked = obtain();

    if (lockWaitTimeout < 0 && lockWaitTimeout != LOCK_OBTAIN_WAIT_FOREVER) 
      throw new IllegalArgumentException("...");

    long maxSleepCount = lockWaitTimeout / LOCK_POLL_INTERVAL;

    long sleepCount = 0;

    while (!locked) {

      if (lockWaitTimeout != LOCK_OBTAIN_WAIT_FOREVER && sleepCount++ >= maxSleepCount) { 
        throw new LockObtainFailedException("Lock obtain timed out."); 
      } 
      try { 
        Thread.sleep(LOCK_POLL_INTERVAL); 
      } catch (InterruptedException ie) { 
        throw new ThreadInterruptedException(ie); 
      } 
      locked = obtain(); 
    } 
    return locked; 
  }

  public abstract void release() throws IOException;

  public abstract boolean isLocked() throws IOException;

}

//LockFactory的抽象类

public abstract class LockFactory {

  public abstract Lock makeLock(String lockName);

  abstract public void clearLock(String lockName) throws IOException; 
}

//SimpleFSLock的实现类

class SimpleFSLock extends Lock {

  File lockFile; 
  File lockDir;

  public SimpleFSLock(File lockDir, String lockFileName) { 
    this.lockDir = lockDir; 
    lockFile = new File(lockDir, lockFileName); 
  }

  @Override 
  public boolean obtain() throws IOException {

    if (!lockDir.exists()) {

      if (!lockDir.mkdirs()) 
        throw new IOException("Cannot create directory: " + lockDir.getAbsolutePath());

    } else if (!lockDir.isDirectory()) {

      throw new IOException("Found regular file where directory expected: " + lockDir.getAbsolutePath()); 
    }

    return lockFile.createNewFile();

  }

  @Override 
  public void release() throws LockReleaseFailedException {

    if (lockFile.exists() && !lockFile.delete()) 
      throw new LockReleaseFailedException("failed to delete " + lockFile);

  }

  @Override 
  public boolean isLocked() {

    return lockFile.exists();

  }

}

//SimpleFSLockFactory的实现类

public class SimpleFSLockFactory extends FSLockFactory {

  public SimpleFSLockFactory(String lockDirName) throws IOException {

    setLockDir(new File(lockDirName));

  }

  @Override 
  public Lock makeLock(String lockName) {

    if (lockPrefix != null) {

      lockName = lockPrefix + "-" + lockName;

    }

    return new SimpleFSLock(lockDir, lockName);

  }

  @Override 
  public void clearLock(String lockName) throws IOException {

    if (lockDir.exists()) {

      if (lockPrefix != null) {

        lockName = lockPrefix + "-" + lockName;

      }

      File lockFile = new File(lockDir, lockName);

      if (lockFile.exists() && !lockFile.delete()) {

        throw new IOException("Cannot delete " + lockFile);

      }

    }

  }

};

2、创建文档Document对象,并加入域(Field)

上一篇下一篇

猜你喜欢

热点阅读