*edit* Sorry – jumped the gun with my original test code here – need to close the IndexWriter after the optimize! The gains are only with multi segment indexes. Corrected entry follows:
Lets do a little test. We will load up a FieldCache with 5,000,000 unique strings and see how long it takes Lucene 2.4 in comparison to Lucene 2.9.
Lets use my quad core laptop and the following test code:
public class ContrivedFCTest extends TestCase { public void testLoadTime() throws Exception { Directory dir = FSDirectory.getDirectory(System.getProperty("java.io.tmpdir") + File.separator + "test"); IndexWriter writer = new IndexWriter (dir, new SimpleAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED); writer.setMergeFactor(37); writer.setUseCompoundFile(false); for(int i = 0; i < 5000000; i++) { Document doc = new Document(); doc.add (new Field ("field", "String" + i, Field.Store.NO, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); } writer.close(); IndexReader reader = IndexReader.open(dir); long start = System.currentTimeMillis(); FieldCache.DEFAULT.getStrings(reader, "field"); long end = System.currentTimeMillis(); System.out.println("load time:" + (end - start)/1000.0f + "s"); } }
The results?
Lucene 2.4: 150.726s
Lucene 2.9: 9.695s
We discovered early this year that in the past, Lucene has been terribly inefficient when loading FieldCaches over multiple segments. Lucene 2.9 addresses this at the MultiReader level (thank you Yonik!). Also, internal FieldCache usage is now per segment, which sidesteps loading FieldCaches over mutiple segments all together – each segment has its own FieldCache.


This was our biggest issue by far. Its taken significant load off our servers when installing a new snapshot.
Thanks indeed Yonik.
September 22, 2009 16:16 — Jim Murphy - PostRank
Why did you use merge factor of 37?
September 23, 2009 19:46 — Anonymous
mergeFactor=37 — presumably in order to avoid any segments to be merged during indexing, thus making it possible to show off the new and faster segment reloading.
September 24, 2009 10:15 — Otis Gospodnetic
Partially – since I am just timing the loading of the FieldCache (and not doing it per segment). It’s just to make sure I have a bunch of segments – its only faster over multiple segments – its the same speed on an optimized Index. The more segments, the faster it is.
The reason that its also faster when you do it per segment (how Lucene works internally now), is that it avoids the speed trap that was in MultiTermEnum, and uses SegmentTermEnum – Yonik fixed that as well though, and this test shows the fruits of that. So essentially, it was both fixed and side stepped at the same time
September 24, 2009 11:10 — Mark Miller