Exploring Lucene’s Indexing Code: Part 1

Next: Exploring Lucene’s Indexing Code: Part 2

While I have mucked around quite a bit in the search side code of Lucene, I am much less familiar with the hardcore indexing side (I’m talking the hardcore code, casual users need not apply – unless your interested). I’d like to learn more about Lucene’s indexing code, but its not so easy to wrap my mind around on first glance. I’m not looking to be an expert right away, but to have an overview understanding of the lower level details involved in constructing a Lucene index. In instances like this, I find its best to start from a high level and work my way in, hopefully understanding the overall process, and then each of the pieces.

From our use of Lucene, we know that the indexing code must center around the IndexWriter class. To help me get a handle on what IndexWriter does, I am going to trace a few key methods from a very simple Lucene test application that simply adds one small document to an index with an IndexWriter and then closes that IndexWriter. I’m going to limit my trace to IndexWriter methods, just so the info is digestible, and only key methods to start with. We don’t want to get too bogged down in the details – methods will change over time anyway. The idea is to get somewhat of an overview of the underlying process.

Also, we should remember our user knowledge of a Lucene index. The index is made up of 1-n segments. Each segment contains a number of Documents. A Lucene Document contains a number of Fields, which is just a field name, value, and attributes. New segments are written as we add Documents to the index, and segments are merged over time based on certain criteria. The fewer segments you have to search over, the better the performance. Searches on the whole index roll over each segment and an optimize will merge all segments down to one.

The Test Code

Directory directory = new RAMDirectory();
Analyzer analyzer = new SimpleAnalyzer();
IndexWriter writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
 
String doc = "a b c d e";
Document d = new Document();
d.add(new Field("contents", doc, Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(d);
writer.close();

And Now The Trace:

After static variable initialization, the IndexWriter instance is initialized:

 *Enter:void IndexWriter.init(Directory, Analyzer, boolean, boolean, IndexDeletionPolicy, boolean, int, DocumentsWriter.IndexingChain, IndexCommit) :1
 *Exit:void IndexWriter.init(Directory, Analyzer, boolean, boolean, IndexDeletionPolicy, boolean, int, DocumentsWriter.IndexingChain, IndexCommit) :1

Then we add the document and close the IndexWriter.

 *Enter:void IndexWriter.addDocument(Document) :1
 *Exit:void IndexWriter.addDocument(Document) :1

We certainly want to dig deeper into IndexWriter.addDocument, because we know a lot of the interesting stuff happens there. But before that, we can get a nice idea of the close process, which begins on line 5.

 *Enter:void IndexWriter.close() :1
  *Enter:void IndexWriter.close(boolean) :2
   // another thread may be closing
   *Enter:boolean IndexWriter.shouldClose() :3
   *Exit:boolean IndexWriter.shouldClose() :3
   // close doc writer, flush / maybe merge / commit / close everything
   *Enter:void IndexWriter.closeInternal(boolean) :3
    *Enter:boolean IndexWriter.doFlush(boolean, boolean) :4
    *Exit:boolean IndexWriter.doFlush(boolean, boolean) :4
    *Enter:void IndexWriter.maybeMerge() :4
    *Exit:void IndexWriter.maybeMerge() :4
    // either abort merges or wait for merges
    *Enter:void IndexWriter.finishMerges(boolean) :4
    *Exit:void IndexWriter.finishMerges(boolean) :4
    // commit all pending adds and deletes, sync index files
    *Enter:void IndexWriter.commit(long) :4
     *Enter:void IndexWriter.startCommit(long, String) :5
     *Exit:void IndexWriter.startCommit(long, String) :5
     *Enter:void IndexWriter.finishCommit() :5
      *Enter:void IndexWriter.setRollbackSegmentInfos(SegmentInfos) :6
      *Exit:void IndexWriter.setRollbackSegmentInfos(SegmentInfos) :6
     *Exit:void IndexWriter.finishCommit() :5
    *Exit:void IndexWriter.commit(long) :4
   *Exit:void IndexWriter.closeInternal(boolean) :3
  *Exit:void IndexWriter.close(boolean) :2
 *Exit:void IndexWriter.close() :1

So we haven’t gotten far, but at the same time we are at the end of our test code. We see that IndexWriter needs a bit of init, and will flush, maybe merge, and then commit upon closing if we add some documents. We have a sense for the overall process that we are inspecting, and in Part 2, we can start to dig into what happens in the IndexWriter.addDocument call.

Next: Exploring Lucene’s Indexing Code: Part 2

Update:

Got a request for the AspectJ code that was used to make the traces, so here is a sample aspect. Keep in mind that this particular sample is not thread safe (shared StringBuilder), so if you are tracing multiple threads, it needs a bit of work. This will get you started though. If you install the Eclipse AspectJ module, add the AspectJ nature to the Lucene project in eclipse (right click on the project and look at the menu for it), put the aspect somewhere with the Lucene source files (call it Trace.aj – you can also put the aspect directly in your Java file), and then build and run, you should be all set to play around. There are also command line programs and Ant tasks if you prefer to go that route – its just a bit more setup.

// the following aspect will print entry/exit stamps for all public methods
// that begin with org.apache.lucene and all private methods of IndexWriter (just to give an example)
public aspect Trace {
 
  private StringBuilder sb = new StringBuilder();
 
  pointcut traceableCalls() : traceMethods();
 
  pointcut traceMethods() :
    execution(public * org.apache.lucene..*(..) ) || execution(private * org.apache.lucene.IndexWriter..*(..) );
 
  /**
   * log method entry.
   */
  before() : traceableCalls() {
    sb.append(" ");
    int indent = sb.length();
    String enterLine = sb.toString() + "*Enter:"
        + thisJoinPoint.getSignature().toString() + " indent:" + indent;
    System.out.println(enterLine + "n");
  }
 
  /**
   * log method exit.
   */
  after() : traceableCalls(){
    int indent = sb.length();
    String exitLine = sb.toString() + "*Exit:"
        + thisJoinPoint.getSignature().toString() + " indent:" + indent
        + " thread:" + Thread.currentThread().getId();
    sb.setLength(sb.length() - 1);
    System.out.println(exitLine + "n");
  }
}

7 Responses to “Exploring Lucene’s Indexing Code: Part 1”

  1. Interesting stuff, Mark.

    How did you generate those traces? That would be a very handy tool to have around…

    March 5, 2009 07:04Stephen Green

  2. Hey Stephen,

    I use AspectJ to get the traces. That allows me to very simply weave code into the Lucene code base. Its a very simple aspect that simply injects a trace before and after selected method calls. Every ‘before’ I increase the indent and every ‘after’ I decrease it. Fairly simple, but pretty effective. AspectJ has some command line tools, but I just used the Eclipse integration – only took about 5 minutes to setup – you simply drop the aspect file into the Lucene project and run it.

    March 5, 2009 08:57Mark Miller

  3. Can you can export that from Eclipse and share?

    March 5, 2009 14:33Mike Smith

  4. No problem Mike – I’ve added a sample aspect to the post.

    March 5, 2009 15:18Mark Miller

  5. [...] Previous: Exploring Lucenes Indexing Code: Part 1 [...]

    March 18, 2009 13:37Lucid Imagination » Exploring Lucenes Indexing Code: Part 2

  6. StringBuilder (as of Java 1.5) has replaced StringBuffer as the class used by the compiler to implement “string + string + string”. StringBuffer is thread-safe (and thus much much slower) and can be used in this example.

    (StringBuffer is like Hashtable and Vector: thread-safe doodads from original Java because of course everyone will write multi-threaded code with this great new language that makes threaded code easy and fun.)

    March 22, 2009 17:51Lance Norskog

  7. The issue with the shared StringBuilder is that the indenting (which the StringBuilder is used for), won’t work correctly if the builder/buffer is shared across threads. That is why I didn’t just use a StringBuffer – I didn’t want to lull anyone into thinking the code was written to work with multiple threads – the indenting would have to be tracked per thread.

    March 23, 2009 02:24Mark Miller

Leave a Reply