• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
    • Certified Distributions
      • Certified Solr
      • Certified Lucene
    • Apache Releases
      • Apache Solr
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Training
    • Solr/Lucene Certification
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Technical Leadership
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . Blog

March 4, 2009

Exploring Lucene’s Indexing Code: Part 1

Posted by Mark Miller

Next: Exploring Lucene’s Indexing Code: Part 2

While I have mucked around quite a bit in the search side code of Lucene, I am much less familiar with the hardcore indexing side (I’m talking the hardcore code, casual users need not apply – unless your interested). I’d like to learn more about Lucene’s indexing code, but its not so easy to wrap my mind around on first glance. I’m not looking to be an expert right away, but to have an overview understanding of the lower level details involved in constructing a Lucene index. In instances like this, I find its best to start from a high level and work my way in, hopefully understanding the overall process, and then each of the pieces.

From our use of Lucene, we know that the indexing code must center around the IndexWriter class. To help me get a handle on what IndexWriter does, I am going to trace a few key methods from a very simple Lucene test application that simply adds one small document to an index with an IndexWriter and then closes that IndexWriter. I’m going to limit my trace to IndexWriter methods, just so the info is digestible, and only key methods to start with. We don’t want to get too bogged down in the details – methods will change over time anyway. The idea is to get somewhat of an overview of the underlying process.

Also, we should remember our user knowledge of a Lucene index. The index is made up of 1-n segments. Each segment contains a number of Documents. A Lucene Document contains a number of Fields, which is just a field name, value, and attributes. New segments are written as we add Documents to the index, and segments are merged over time based on certain criteria. The fewer segments you have to search over, the better the performance. Searches on the whole index roll over each segment and an optimize will merge all segments down to one.

The Test Code

Directory directory = new RAMDirectory();
Analyzer analyzer = new SimpleAnalyzer();
IndexWriter writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
 
String doc = "a b c d e";
Document d = new Document();
d.add(new Field("contents", doc, Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(d);
writer.close();

And Now The Trace:

After static variable initialization, the IndexWriter instance is initialized:

 *Enter:void IndexWriter.init(Directory, Analyzer, boolean, boolean, IndexDeletionPolicy, boolean, int, DocumentsWriter.IndexingChain, IndexCommit) :1
 *Exit:void IndexWriter.init(Directory, Analyzer, boolean, boolean, IndexDeletionPolicy, boolean, int, DocumentsWriter.IndexingChain, IndexCommit) :1

Then we add the document and close the IndexWriter.

 *Enter:void IndexWriter.addDocument(Document) :1
 *Exit:void IndexWriter.addDocument(Document) :1

We certainly want to dig deeper into IndexWriter.addDocument, because we know a lot of the interesting stuff happens there. But before that, we can get a nice idea of the close process, which begins on line 5.

 *Enter:void IndexWriter.close() :1
  *Enter:void IndexWriter.close(boolean) :2
   // another thread may be closing
   *Enter:boolean IndexWriter.shouldClose() :3
   *Exit:boolean IndexWriter.shouldClose() :3
   // close doc writer, flush / maybe merge / commit / close everything
   *Enter:void IndexWriter.closeInternal(boolean) :3
    *Enter:boolean IndexWriter.doFlush(boolean, boolean) :4
    *Exit:boolean IndexWriter.doFlush(boolean, boolean) :4
    *Enter:void IndexWriter.maybeMerge() :4
    *Exit:void IndexWriter.maybeMerge() :4
    // either abort merges or wait for merges
    *Enter:void IndexWriter.finishMerges(boolean) :4
    *Exit:void IndexWriter.finishMerges(boolean) :4
    // commit all pending adds and deletes, sync index files
    *Enter:void IndexWriter.commit(long) :4
     *Enter:void IndexWriter.startCommit(long, String) :5
     *Exit:void IndexWriter.startCommit(long, String) :5
     *Enter:void IndexWriter.finishCommit() :5
      *Enter:void IndexWriter.setRollbackSegmentInfos(SegmentInfos) :6
      *Exit:void IndexWriter.setRollbackSegmentInfos(SegmentInfos) :6
     *Exit:void IndexWriter.finishCommit() :5
    *Exit:void IndexWriter.commit(long) :4
   *Exit:void IndexWriter.closeInternal(boolean) :3
  *Exit:void IndexWriter.close(boolean) :2
 *Exit:void IndexWriter.close() :1

So we haven’t gotten far, but at the same time we are at the end of our test code. We see that IndexWriter needs a bit of init, and will flush, maybe merge, and then commit upon closing if we add some documents. We have a sense for the overall process that we are inspecting, and in Part 2, we can start to dig into what happens in the IndexWriter.addDocument call.

Next: Exploring Lucene’s Indexing Code: Part 2

Update:

Got a request for the AspectJ code that was used to make the traces, so here is a sample aspect. Keep in mind that this particular sample is not thread safe (shared StringBuilder), so if you are tracing multiple threads, it needs a bit of work. This will get you started though. If you install the Eclipse AspectJ module, add the AspectJ nature to the Lucene project in eclipse (right click on the project and look at the menu for it), put the aspect somewhere with the Lucene source files (call it Trace.aj – you can also put the aspect directly in your Java file), and then build and run, you should be all set to play around. There are also command line programs and Ant tasks if you prefer to go that route – its just a bit more setup.

// the following aspect will print entry/exit stamps for all public methods
// that begin with org.apache.lucene and all private methods of IndexWriter (just to give an example)
public aspect Trace {
 
  private StringBuilder sb = new StringBuilder();
 
  pointcut traceableCalls() : traceMethods();
 
  pointcut traceMethods() :
    execution(public * org.apache.lucene..*(..) ) || execution(private * org.apache.lucene.IndexWriter..*(..) );
 
  /**
   * log method entry.
   */
  before() : traceableCalls() {
    sb.append(" ");
    int indent = sb.length();
    String enterLine = sb.toString() + "*Enter:"
        + thisJoinPoint.getSignature().toString() + " indent:" + indent;
    System.out.println(enterLine + "n");
  }
 
  /**
   * log method exit.
   */
  after() : traceableCalls(){
    int indent = sb.length();
    String exitLine = sb.toString() + "*Exit:"
        + thisJoinPoint.getSignature().toString() + " indent:" + indent
        + " thread:" + Thread.currentThread().getId();
    sb.setLength(sb.length() - 1);
    System.out.println(exitLine + "n");
  }
}
  • Share this:
  • Email
  • Facebook
  • Digg
  • Share
  • Print
  • Reddit
  • StumbleUpon

Category: Events, Uncategorized

7 Responses to “Exploring Lucene’s Indexing Code: Part 1”

  1. Interesting stuff, Mark.

    How did you generate those traces? That would be a very handy tool to have around…

    March 5, 2009 07:04 — Stephen Green

  2. Hey Stephen,

    I use AspectJ to get the traces. That allows me to very simply weave code into the Lucene code base. Its a very simple aspect that simply injects a trace before and after selected method calls. Every ‘before’ I increase the indent and every ‘after’ I decrease it. Fairly simple, but pretty effective. AspectJ has some command line tools, but I just used the Eclipse integration – only took about 5 minutes to setup – you simply drop the aspect file into the Lucene project and run it.

    March 5, 2009 08:57 — Mark Miller

  3. Can you can export that from Eclipse and share?

    March 5, 2009 14:33 — Mike Smith

  4. No problem Mike – I’ve added a sample aspect to the post.

    March 5, 2009 15:18 — Mark Miller

  5. [...] Previous: Exploring Lucenes Indexing Code: Part 1 [...]

    March 18, 2009 13:37 — Lucid Imagination » Exploring Lucenes Indexing Code: Part 2

  6. StringBuilder (as of Java 1.5) has replaced StringBuffer as the class used by the compiler to implement “string + string + string”. StringBuffer is thread-safe (and thus much much slower) and can be used in this example.

    (StringBuffer is like Hashtable and Vector: thread-safe doodads from original Java because of course everyone will write multi-threaded code with this great new language that makes threaded code easy and fun.)

    March 22, 2009 17:51 — Lance Norskog

  7. The issue with the shared StringBuilder is that the indenting (which the StringBuilder is used for), won’t work correctly if the builder/buffer is shared across threads. That is why I didn’t just use a StringBuffer – I didn’t want to lull anyone into thinking the code was written to work with multiple threads – the indenting would have to be tracked per thread.

    March 23, 2009 02:24 — Mark Miller

Leave a Reply

Go to Blog Front Page

  • Recent Posts

    • Lucene Revolution 2012 – Call for Participation now open!
    • SolrCloud is Coming (and looking to mix in even more ‘NoSQL’)
    • Our Solr Reference Guide updated for v3.5
    • Enhancing Discovery with Solr and Mahout – session slides now available!
    • Solr and LucidWorks feature matrix available
    • LucidWorks Enterprise latest version 2.0.1 released!
    • Why Not AND, OR, And NOT?
    • Options to tune document’s relevance in Solr
    • Dallas JavaMUG December 14th 2011
    • Apache Mahout user meeting – session slides and videos are now available!
  • Archives

    • January 2012
    • December 2011
    • November 2011
    • October 2011
    • September 2011
    • August 2011
  • Tags

    acts_as_solr apache Apache Mahout best practices chump code4lib dismax drupal enterprise search Erik Hatcher field collapsing function query Grant Ingersoll hoss image isfdb local params Lucene lucene revolution LucidGaze lucid imagination Mahout Marc Krellenstein Mark Miller nested queries nutch Open Source Open Source Search qparser query parser queryparser Rails release result grouping Richmond Ruby schema design sint Solr solr 3.1 solr 4.0 solr cloud sortable Tika VA
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Log in

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2011 Lucid Imagination. All Right reserved.

loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.