• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
    • Certified Distributions
      • Certified Solr
      • Certified Lucene
    • Apache Releases
      • Apache Solr
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Training
    • Solr/Lucene Certification
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Board of Directors
    • Apache Lucene/Solr Committers
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . Blog

May 13, 2009

Exploring Lucene and Solr’s TrieRange Capabilities

Posted by Grant Ingersoll

Recently, Uwe Schindler and others have added a new capability to Lucene and Solr to make working with numeric ranges a lot faster.  I haven’t tried out this new functionality yet, so I thought I would walk through it here and explore it’s capabilities.

Since Lucene treats most everything as Strings, encoding  numbers and dates and then utilizing them in ranges has always required a little extra work to make it perform well.  Previously, one would have to have either use less precision or slower running queries in order to work with ranges that had a lot of distinct values.  This is due to the need for Lucene to enumerate through a large number of terms.

Now, however, thanks to the new org.apache.lucene.search.trie package (currently located in the contrib/queries area of Lucene, but it may move to the Lucene core, see [1] below) and it’s addition to Solr via a FieldType, Lucene and Solr users can take advantage of much faster range searches.

To try it out, I gave it a spin Solr by doing the following:

First, I grabbed the FieldType definitions from the Solr example schema:

    <fieldType name="tint" class="solr.TrieField" type="integer" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" />
    <fieldType name="tint4" class="solr.TrieField" precisionStep="4" type="integer" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" />
    <fieldType name="tfloat" class="solr.TrieField" type="float" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" />
    <fieldType name="tlong" class="solr.TrieField" type="long" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" />
    <fieldType name="tdouble" class="solr.TrieField" type="double" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" />
    <fieldType name="tdouble4" class="solr.TrieField" type="double" precisionStep="4" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" />

Next, I declared a few Fields.  In my case, I just used a bunch of dynamic fields, because I’m not too worried about the actual field names:

   <field name="id"          type="string"    indexed="true" stored="true" />
   <dynamicField name="*_i"  type="integer"    indexed="true"  stored="true"/>
   <dynamicField name="*_f"  type="float"    indexed="true"  stored="true"/>
   <dynamicField name="*_d"  type="double"    indexed="true"  stored="true"/>
   <dynamicField name="*_l"  type="long"    indexed="true"  stored="true"/>
   <dynamicField name="*_si"  type="sint"    indexed="true"  stored="true"/>
   <dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
   <dynamicField name="*_sl"  type="slong"   indexed="true"  stored="true"/>
   <dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
   <dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
   <dynamicField name="*_sf"  type="sfloat"  indexed="true"  stored="true"/>
   <dynamicField name="*_sd"  type="sdouble" indexed="true"  stored="true"/>
   <dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>
   <!-- Trie Range fields -->
   <dynamicField name="*_tri" type="tint"    indexed="true"  stored="true"/>
   <dynamicField name="*_tri4" type="tint4"    indexed="true"  stored="true"/>
   <dynamicField name="*_trf" type="tfloat"    indexed="true"  stored="true"/>
   <dynamicField name="*_trl" type="tlong"    indexed="true"  stored="true"/>
   <dynamicField name="*_trd" type="tdouble"    indexed="true"  stored="true"/>
   <dynamicField name="*_trd4" type="tdouble4"    indexed="true"  stored="true"/>

Now it’s time for the rubber to meet the road, as they say.  In other words, we need to put in some documents and search.  In my case, I’m going to create  three fields for every numeric value, one as a “regular” primitive, one as a “sortable” primitive, and one as a Trie* field, just to underscore some of the differences.  My SolrInputDocument looks like:

SolrInputDocument result = new SolrInputDocument();
 
result.addField(idField, id);
 
addField(result, "integer_i", "integer_si", "integer_tri", random.nextInt(10000));//get only positive values
addField(result, "float_f", "float_sf", "float_trf", random.nextFloat()*100);
addField(result, "long_l", "long_sl", "long_trl", Math.abs(random.nextLong()));
double dbl = random.nextDouble() * 100;
addField(result, "double_d", "double_sd", "double_trd", dbl);
result.addField("double_trd4", dbl);

(all the addField does is add the value to the document three times, one for each field name.  I could have used copyField in the schema.xml instead, but I wanted to explicitly declare them during document construction as opposed to editing the schema later when I change my mind).  I then created an index with 10K documents.

Let’s start off with some basics.  First up, I want to find all docs that have an integer value following between 0 and 5,000.

Naturally, the wrong way to do this is:

http://localhost:8983/solr/select/?&version=2.2&start=0&rows=10&indent=on&fl=id,integer_i&q=integer_i:[0%20TO%205000]&debugQuery=true

In this case, you get the lexicographic sorting and not the numeric sorting, which yields results, albeit bad ones.  Bad Grant.

So, then I move over to the sortable version, using the same query, but changing the field to be integer_si.  This now produces correct results.  Not much to say here other than it works.

Finally, I tried  out the Trie stuff by changing the query to use integer_tri.  The results are the same as the sortable stuff, but, in my totally informal testing, the Trie stuff is a lot faster (others have done more formal testing, so I feel comfortable with my results.)  Good news!

Next, I tried it with a lot more documents (1M) and saw pretty much the same speedups.  In this case, I actually tried getting a much bigger range (50K), which is slower than the smaller range, but still faster than the sortable approach.

Of course, this is only scratching the surface.  The take away, though, is the new Trie stuff in L/S holds a lot of promise for speeding up range based numeric queries and further blurs the line between search engines and databases (I’d argue it makes search all that more compelling, but…)  More importantly, it is not dependent on the index size, but instead the precision chosen.  Essentially, it formalizes what many people have done in practice through the years with various field values.  See [2] for more details.

Look for the Trie Range stuff (possibly renamed) to be officially released in Lucene 2.9 and Solr 1.4

References:

  1. http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//contrib-queries/org/apache/lucene/search/trie/package-summary.html
  2. http://www.thetaphi.de/share/Schindler-TrieRange.ppt
  • Share this:
  • Email
  • Facebook
  • Digg
  • Share
  • Print
  • Reddit
  • StumbleUpon

Category: Enterprise Search, Events, Lucene, Solr

Tags: Lucene, numeric data, range queries, Solr

11 Responses to “Exploring Lucene and Solr’s TrieRange Capabilities”

  1. [...] Lucid Imagination » Exploring Lucene and Solr’s TrieRange Capabilities. [...]

    May 13, 2009 04:47 — Lucid Imagination » Exploring Lucene and Solr’s TrieRange Capabilities

  2. Great overview, Grant! What would you say are the drawbacks of using the trie range field type for foreign key like fields that you sometimes want to sort on or put range limitations on? And do you have any thoughts on efficiency of representing dates as primitives instead of the Date field, if you use date as a sort field very often?
    - Aleksander

    May 15, 2009 01:23 — Aleksander Stensby

  3. [...] Lucid Imagination » Exploring Lucene and Solr’s TrieRange Capabilities Recently, Uwe Schindler and others have added a new capability to Lucene and Solr to make working with numeric ranges a lot faster. I haven’t tried out this new functionality yet, so I thought I would walk through it here and explore it’s capabilities. [...]

    May 17, 2009 14:01 — Webhamer Weblog: Search & ICT-related blogging » links for 2009-05-17

  4. [...] Lucid Imagination » Exploring Lucene and Solr’s TrieRange Capabilities Recently, Uwe Schindler and others have added a new capability to Lucene and Solr to make working with numeric ranges a lot faster. I haven’t tried out this new functionality yet, so I thought I would walk through it here and explore it’s capabilities. [...]

    May 17, 2009 15:02 — StauthamerNet :: Staut’s Family Blog» Blog Archive » links for 2009-05-17

  5. [...] you want to read more, you can read the article posted by Grant on Lucid’s [...]

    May 19, 2009 17:05 — Solr/Lucene Feature Alert: TrieRange Capabilities | Productification

  6. Hi Aleksander,

    I’m not sure on your questions just yet, as I have only begun to explore the TrieRange capability. One thought, though, on dates, is I think Dates are currently stored as Strings (I presume you are talking Solr) and thus, storing as primitive Dates should be more compact, thus using less disk space, memory, etc. That being said, we likely should have a new Date field in Solr that stores as primitives even w/o the primitives.

    May 20, 2009 03:11 — Grant Ingersoll

  7. Hi Aleksander,

    the only drawbacks of TrieRange are a little bit larger index sizes, because of the additional terms indexed, and you cannot read the values with tools like Luke at the moment.

    TrieRange fields can be natively sorted (so the FieldCache uses no String comparison, they are parsed to longs/ints/…) and so function queries are also possible. Additionally TrieRange also supports dates (there is a “tdate”, too), which are stored natively (as unix ts in long format, see Date.getTime()). So sorting by date is very fast and you can also do ranges down to the millisecond. So I would recommend tdate for every date-time datatype! Date math is also possible in queries like with standard dt field types.

    June 3, 2009 06:13 — Uwe Schindler

  8. [...] course, Solr 1.4 also contains the new TrieRange functionality that will generally have the best time/space profile for range queries over numeric [...]

    July 6, 2009 17:37 — Ranges over Functions in Solr 1.4 « Solr ‘n Stuff

  9. [...] course, Solr 1.4 also contains the new TrieRange functionality that will generally have the best time/space profile for range queries over numeric [...]

    July 6, 2009 17:43 — Lucid Imagination » Ranges over Functions in Solr 1.4

  10. [...] there. One new thing you can do with Lucene 2.9 and Solr 1.4 more readily, is to use the new numeric search to encode proprietary business rules around ranges within data. In general, since the data is [...]

    October 23, 2009 07:24 — Lucid Imagination » Lucene and Solr: the Java of data

  11. Hey guys,
    Just an FYI, numeric ranges aren’t properly indexed or returned in Lucene unless you use a precision step of Integer.MAX. This is currently a bug within the LuceneTermEnum not enumerating over results properly. I’ve filed the bug and I’m working on the fix, but you should be leery of depending on accurate results with the current source version ( 56007e4630932be57699)

    http://github.com/tjake/Lucandra/issues#issue/40

    October 6, 2010 10:52 — Todd Nine

Leave a Reply

Go to Blog Front Page

  • Recent Posts

    • Indexing with SolrJ
    • Advanced Filter Caching in Solr
    • Lucene Revolution 2012 – Call for Participation now open!
    • SolrCloud is Coming (and looking to mix in even more ‘NoSQL’)
    • Our Solr Reference Guide updated for v3.5
    • Enhancing Discovery with Solr and Mahout – session slides now available!
    • Solr and LucidWorks feature matrix available
    • LucidWorks Enterprise latest version 2.0.1 released!
    • Why Not AND, OR, And NOT?
    • Options to tune document’s relevance in Solr
  • Archives

    • February 2012
    • January 2012
    • December 2011
    • November 2011
    • October 2011
    • September 2011
  • Tags

    acts_as_solr apache Apache Mahout best practices chump code4lib dismax drupal enterprise search Erik Hatcher field collapsing frange function query Grant Ingersoll hoss image isfdb Lucene lucene revolution LucidGaze lucid imagination Mahout Marc Krellenstein Mark Miller nutch Open Source Open Source Search qparser query parser queryparser Rails release result grouping Richmond Ruby schema design sint Solr solr 3.1 solr 4.0 solr cloud sortable spatial search Tika VA
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Log in

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2011 Lucid Imagination. All Right reserved.

loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.