• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
    • Certified Distributions
      • Certified Solr
      • Certified Lucene
    • Apache Releases
      • Apache Solr
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Training
    • Solr/Lucene Certification
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Apache Lucene/Solr Committers
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . Blog

March 8, 2010

A billion here, a billion there

Posted by David M. Fishman

“…pretty soon you’re talking about real money”, goes the famous phrase attributed to the late Senator Everett Dirksen (apocryphally, it turns out).

The phrase came to mind in reading a recent post from Tom Burton-West of the Hathi Trust, on running into an index-size limitation of 2.47 billion words on a base of 555,000 documents.

When we read that the Lucene index format used by Solr has a limit of 2.1 billion unique words per index segment,  we didn’t think we had to worry.  However, a couple of weeks ago, after we optimized our indexes on each shard to one segment, we started seeing java “ArrayIndexOutOfBounds” exceptions in our logs.  After a bit of investigation we determined that indeed, most of our index shards contained over 2.1 billion unique words and some queries were triggering these exeptions.  Currently each shard indexes a little over half a million documents.

Culprits? Dirty OCR, CommonGrams, and 200 languages:

After a bit of digging in the log files, we found a query containing a Korean term which consistantly triggered the exception and used that for testing.   We re-read the index documentation and realized that the index entries are sorted lexicographically by UTF16 character code. ( Korean uses Hangul which is near the end of the Unicode BMP and therefore a very high UTF-16 character code range.)

The fix, recently committed to Lucene 2.92, 3.0.1, and 3.1 by Mike McCandless, raises the limit per index to about 274 billion unique terms. Now, you may not have 274 billion unique terms in your collection, or not yet — but that is some pretty substantial headroom for lexical and linguistic search trickery.

If you haven’t put the Hathi Project’s blog on your list, it’s a must-read: see also Tom’s excellent posts on distributed Solr search and how common words and phrases can affect Lucene/Solr performance. The library space is at the forefront of driving the limits of open source: monstrous indexes, tough metadata to whip into shape, handling OCR, deep field faceting, database integration, data types ranging a mile wide, phrase queries that will make your head spin, and unforgiving relevancy for the masses (our own Erik Hatcher hails from that space and is still quite active in it).  Chances are if you’re running into a tough search problem, the folks in the library space are already working through the thick of it.

  • Share this:
  • Email
  • Facebook
  • Digg
  • Share
  • Print
  • Reddit
  • StumbleUpon

Category: Uncategorized

Leave a Reply

Go to Blog Front Page

  • Recent Posts

    • Lucene Revolution 2012 – Call for Participation now open!
    • SolrCloud is Coming (and looking to mix in even more ‘NoSQL’)
    • Our Solr Reference Guide updated for v3.5
    • Enhancing Discovery with Solr and Mahout – session slides now available!
    • Solr and LucidWorks feature matrix available
    • LucidWorks Enterprise latest version 2.0.1 released!
    • Why Not AND, OR, And NOT?
    • Options to tune document’s relevance in Solr
    • Dallas JavaMUG December 14th 2011
    • Apache Mahout user meeting – session slides and videos are now available!
  • Archives

    • January 2012
    • December 2011
    • November 2011
    • October 2011
    • September 2011
    • August 2011
  • Tags

    acts_as_solr apache Apache Mahout best practices chump code4lib dismax drupal enterprise search Erik Hatcher field collapsing function query Grant Ingersoll hoss image isfdb local params Lucene lucene revolution LucidGaze lucid imagination Mahout Marc Krellenstein Mark Miller nested queries nutch Open Source Open Source Search qparser query parser queryparser Rails release result grouping Richmond Ruby schema design sint Solr solr 3.1 solr 4.0 solr cloud sortable Tika VA
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Log in

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2011 Lucid Imagination. All Right reserved.

loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.