《lucene in action》笔记: 初识Lucene

2019-02-24 本文已影响0人 Devops_cheers

Lucene是一款高性能的、可扩展的信息检索工具库。Lucene是一个Java实现的成熟的类库，它提供了一套简单而强大的核心API，利用这些API你就可以在不必深入理解全文索引机制和搜索机制的情况下构造一个搜索服务。Lucene在近年来已经成为最受欢迎的开源信息检索工具库。

程序示例

0.程序说明

下面将利用Lucene构造一个简单的程序，该程序实现对指定文件夹下以.txt扩展名结尾的文件进行内容索引，比如我可以通过该程序查找"lucene"在哪些.txt文件中出现。

1.建立索引

代码中会有一个Indexer的简单类，当Indexer运行结束时，会返回一个索引文件，供它的姐妹程序Searcher使用。Indexer可以对某个目录下所有以.txt扩展名结尾的文件建立索引。

public class Indexer {
  public static void main(String [] args) throws Exception {
    if (args.length != 2) {
      throw new IllegalArgumentException("Usage: java " + 
        Indexer.class.getName()
        + "<index dir><data dir>";
      )
    }  
    String indexDir = args[0];
    String dataDir = args[1];
    long start = System.currentTimeMillis();
    Indexer indexer = new Indexer(indexDir);
    int numIndexed;
    try {
      numIndexed = indexer.index(dataDir, new TextFilesFilter());
    } finally {
       indexer.close();
    }
    long end = System.currentTimeMillis();
    System.out.print1n("Indexing " + numIndexed + " files took" + 
    (end - start) + " milliseconds ");
  }

  private IndexWriter writer;

  public Indexer(String indexDir) throw IOException {
    Directory dir = FSDirectory.open(new File(indexDir));
    writer = new IndexerWriter(dir, new StandardAnalyzer(Version.LUCENE_30), 
                              true,
                              IndexWriter.MaxFieldLength.UNLIMITED); 
  }
  
  public void close() throw IOException {
    writer.close();
  }

  public index(String dataDir, FileFilter filter) throw Exception {
    File [] files = new File(dataDir).listFiles();
    for (File f: files) {
      if(!f.idDirectory() &&
        !f.isHidden() &&
        f.exists() &&
        f.canRead() &&
        (filter == null || filter.accept(f))) {
        indexFile(f);
      }
    }
    return writer.numDocs();
  }

  private static class TextFilesFilter implements FileFilter {
    public boolean accept (File path) {
      return path.getName().toLowerCase().endsWith('.txt');
    }
  }
  
  protected Document  getDocument(File f) throws Exception {
    Document doc = new Document();
    doc.add(new Field('contents', new FileReader(f)));
    doc.add(new Field("filename",  f.getName(),
                  Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.add(new Field("fullpath", f.getCanonicalPath(),
        Field.Store.YES, Field.Index.NOT_ANALYZED));
    return doc;
  }
  
  private void indexFile(File f) throws Exception {
    System.out.print1n("Indexing " + f.getCanonicalPath());
    Document doc = getDocument(f);
    writer.addDocument(doc);
  }
}

Indexer程序接收两个参数，第一个是索引存放的目录，第二个是需要被索引的.txt文件所在的目录。首先该程序初始化一个IndexWriter，该类接收doc，并最终向indexDir写入索引文件。然后程序遍历dataDir下所有的文件，过滤出符合条件的以.txt结尾的文件，并提取文件的内容生成一个doc。最终调用IndexWriter实例的addDocument方法，将doc加入到索引中。
这一步最终在indexDir目录中生成了一系列的索引文件，下面的Searcher程序中会用到这些索引文件，最终实现搜索的功能。

2.搜索索引

下面的代码展示了利用lucene索引实现查询功能。

public class Searcher {
  public static void main (String [] args) throws Exception {
    if (args.length != 2) {
      throw new IllegalArgumentException("Usage: java " + Searcher.class.getName() + " <index dir><query>");
    }
    String indexDir = args[0];
    String q = args[1];
    search(indexDir, q);
  }
  
  public static void search (String indexDir, String q) throws IOException, ParseException {
    Directory dir = FSDirectory.open(new File(indexDir));
    IndexSearcher is = new IndexSearcher(dir);
    QueryParser parser = new QueryParser(Version.LUCENE_30,
                          "contents",  new StandardAnalyzer(Version.LUCENE_30));
    Query query = parser.parser(q);
    long start = System.currentTimeMillis();
    TopDocs hits = is.search(query);
    long end = System.currentTimeMillis();
    System.err.print1n("Found " + hits.totalHits  + " document(s) (in " + (end -start) + " milliseconds) that matched query '" + q + "':");
    for(ScoreDoc scoreDoc : hits.scoreDocs){
      Document doc = is.doc(scoreDoc.doc);
      System.out.print1n(doc.get("fullpath"));
    }
    is.close();
  }
}

Searcher程序与Indexer程序一样的简单，首先利用Lucene的IndexSearcher类打开索引存放的目录(indexDir)，然后利用QueryParser对象分析关键词并生成Query对象，最终利用IndexSearch对象的search方法对query进行搜索，返回命中的文档集合，搜索完成后关闭IndexSearcher对象。

核心类

在Indexer示例代码中有以下几个关键类

IndexWriter: 负责创建新索引，或者添加、删除或更新索引信息
Directory: 描述了lucene索引存放的位置
Analyser: 分析器，从内容中提取单词
Document: 文档对象，是一些域(Field)的集合
Field: 文档的域，比如标题、正文都是域，搜索可以针对特定的域

搜索过程有以下几个关键类

IndexSearcher: 接收Query对象，执行搜索操作并返回命中结果
Term: 是搜索功能的基本单元，与Field类似，包含指定的域以及单词
Query: 查询类，TermQuery是Query的子类
TermQuery: 是lucene提供的最简单的查询类型
TopDocs: 是一个简单的指针容器，指向前N个排名的搜索结果

《lucene in action》笔记: 初识Lucene

程序示例

0.程序说明

1.建立索引

2.搜索索引

核心类

猜你喜欢

热点阅读