Recently, Uwe Schindler and others have added a new capability to Lucene and Solr to make working with numeric ranges a lot faster. I haven’t tried out this new functionality yet, so I thought I would walk through it here and explore it’s capabilities.
Since Lucene treats most everything as Strings, encoding numbers and dates and then utilizing them in ranges has always required a little extra work to make it perform well. Previously, one would have to have either use less precision or slower running queries in order to work with ranges that had a lot of distinct values. This is due to the need for Lucene to enumerate through a large number of terms.
Now, however, thanks to the new org.apache.lucene.search.trie package (currently located in the contrib/queries area of Lucene, but it may move to the Lucene core, see [1] below) and it’s addition to Solr via a FieldType, Lucene and Solr users can take advantage of much faster range searches.
To try it out, I gave it a spin Solr by doing the following:
First, I grabbed the FieldType definitions from the Solr example schema:
<fieldType name="tint" class="solr.TrieField" type="integer" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" /> <fieldType name="tint4" class="solr.TrieField" precisionStep="4" type="integer" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" /> <fieldType name="tfloat" class="solr.TrieField" type="float" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" /> <fieldType name="tlong" class="solr.TrieField" type="long" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" /> <fieldType name="tdouble" class="solr.TrieField" type="double" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" /> <fieldType name="tdouble4" class="solr.TrieField" type="double" precisionStep="4" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" />
Next, I declared a few Fields. In my case, I just used a bunch of dynamic fields, because I’m not too worried about the actual field names:
<field name="id" type="string" indexed="true" stored="true" /> <dynamicField name="*_i" type="integer" indexed="true" stored="true"/> <dynamicField name="*_f" type="float" indexed="true" stored="true"/> <dynamicField name="*_d" type="double" indexed="true" stored="true"/> <dynamicField name="*_l" type="long" indexed="true" stored="true"/> <dynamicField name="*_si" type="sint" indexed="true" stored="true"/> <dynamicField name="*_s" type="string" indexed="true" stored="true"/> <dynamicField name="*_sl" type="slong" indexed="true" stored="true"/> <dynamicField name="*_t" type="text" indexed="true" stored="true"/> <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/> <dynamicField name="*_sf" type="sfloat" indexed="true" stored="true"/> <dynamicField name="*_sd" type="sdouble" indexed="true" stored="true"/> <dynamicField name="*_dt" type="date" indexed="true" stored="true"/> <!-- Trie Range fields --> <dynamicField name="*_tri" type="tint" indexed="true" stored="true"/> <dynamicField name="*_tri4" type="tint4" indexed="true" stored="true"/> <dynamicField name="*_trf" type="tfloat" indexed="true" stored="true"/> <dynamicField name="*_trl" type="tlong" indexed="true" stored="true"/> <dynamicField name="*_trd" type="tdouble" indexed="true" stored="true"/> <dynamicField name="*_trd4" type="tdouble4" indexed="true" stored="true"/>
Now it’s time for the rubber to meet the road, as they say. In other words, we need to put in some documents and search. In my case, I’m going to create three fields for every numeric value, one as a “regular” primitive, one as a “sortable” primitive, and one as a Trie* field, just to underscore some of the differences. My SolrInputDocument looks like:
SolrInputDocument result = new SolrInputDocument(); result.addField(idField, id); addField(result, "integer_i", "integer_si", "integer_tri", random.nextInt(10000));//get only positive values addField(result, "float_f", "float_sf", "float_trf", random.nextFloat()*100); addField(result, "long_l", "long_sl", "long_trl", Math.abs(random.nextLong())); double dbl = random.nextDouble() * 100; addField(result, "double_d", "double_sd", "double_trd", dbl); result.addField("double_trd4", dbl);
(all the addField does is add the value to the document three times, one for each field name. I could have used copyField in the schema.xml instead, but I wanted to explicitly declare them during document construction as opposed to editing the schema later when I change my mind). I then created an index with 10K documents.
Let’s start off with some basics. First up, I want to find all docs that have an integer value following between 0 and 5,000.
Naturally, the wrong way to do this is:
In this case, you get the lexicographic sorting and not the numeric sorting, which yields results, albeit bad ones. Bad Grant.
So, then I move over to the sortable version, using the same query, but changing the field to be integer_si. This now produces correct results. Not much to say here other than it works.
Finally, I tried out the Trie stuff by changing the query to use integer_tri. The results are the same as the sortable stuff, but, in my totally informal testing, the Trie stuff is a lot faster (others have done more formal testing, so I feel comfortable with my results.) Good news!
Next, I tried it with a lot more documents (1M) and saw pretty much the same speedups. In this case, I actually tried getting a much bigger range (50K), which is slower than the smaller range, but still faster than the sortable approach.
Of course, this is only scratching the surface. The take away, though, is the new Trie stuff in L/S holds a lot of promise for speeding up range based numeric queries and further blurs the line between search engines and databases (I’d argue it makes search all that more compelling, but…) More importantly, it is not dependent on the index size, but instead the precision chosen. Essentially, it formalizes what many people have done in practice through the years with various field values. See [2] for more details.
Look for the Trie Range stuff (possibly renamed) to be officially released in Lucene 2.9 and Solr 1.4
References:
- http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//contrib-queries/org/apache/lucene/search/trie/package-summary.html
- http://www.thetaphi.de/share/Schindler-TrieRange.ppt
[...] Lucid Imagination » Exploring Lucene and Solr’s TrieRange Capabilities. [...]
May 13, 2009 04:47 — Lucid Imagination » Exploring Lucene and Solr’s TrieRange Capabilities
Great overview, Grant! What would you say are the drawbacks of using the trie range field type for foreign key like fields that you sometimes want to sort on or put range limitations on? And do you have any thoughts on efficiency of representing dates as primitives instead of the Date field, if you use date as a sort field very often?
- Aleksander
May 15, 2009 01:23 — Aleksander Stensby
[...] Lucid Imagination » Exploring Lucene and Solr’s TrieRange Capabilities Recently, Uwe Schindler and others have added a new capability to Lucene and Solr to make working with numeric ranges a lot faster. I haven’t tried out this new functionality yet, so I thought I would walk through it here and explore it’s capabilities. [...]
May 17, 2009 14:01 — Webhamer Weblog: Search & ICT-related blogging » links for 2009-05-17
[...] Lucid Imagination » Exploring Lucene and Solr’s TrieRange Capabilities Recently, Uwe Schindler and others have added a new capability to Lucene and Solr to make working with numeric ranges a lot faster. I haven’t tried out this new functionality yet, so I thought I would walk through it here and explore it’s capabilities. [...]
May 17, 2009 15:02 — StauthamerNet :: Staut’s Family Blog» Blog Archive » links for 2009-05-17
[...] you want to read more, you can read the article posted by Grant on Lucid’s [...]
May 19, 2009 17:05 — Solr/Lucene Feature Alert: TrieRange Capabilities | Productification
Hi Aleksander,
I’m not sure on your questions just yet, as I have only begun to explore the TrieRange capability. One thought, though, on dates, is I think Dates are currently stored as Strings (I presume you are talking Solr) and thus, storing as primitive Dates should be more compact, thus using less disk space, memory, etc. That being said, we likely should have a new Date field in Solr that stores as primitives even w/o the primitives.
May 20, 2009 03:11 — Grant Ingersoll
Hi Aleksander,
the only drawbacks of TrieRange are a little bit larger index sizes, because of the additional terms indexed, and you cannot read the values with tools like Luke at the moment.
TrieRange fields can be natively sorted (so the FieldCache uses no String comparison, they are parsed to longs/ints/…) and so function queries are also possible. Additionally TrieRange also supports dates (there is a “tdate”, too), which are stored natively (as unix ts in long format, see Date.getTime()). So sorting by date is very fast and you can also do ranges down to the millisecond. So I would recommend tdate for every date-time datatype! Date math is also possible in queries like with standard dt field types.
June 3, 2009 06:13 — Uwe Schindler
[...] course, Solr 1.4 also contains the new TrieRange functionality that will generally have the best time/space profile for range queries over numeric [...]
July 6, 2009 17:37 — Ranges over Functions in Solr 1.4 « Solr ‘n Stuff
[...] course, Solr 1.4 also contains the new TrieRange functionality that will generally have the best time/space profile for range queries over numeric [...]
July 6, 2009 17:43 — Lucid Imagination » Ranges over Functions in Solr 1.4
[...] there. One new thing you can do with Lucene 2.9 and Solr 1.4 more readily, is to use the new numeric search to encode proprietary business rules around ranges within data. In general, since the data is [...]
October 23, 2009 07:24 — Lucid Imagination » Lucene and Solr: the Java of data
Hey guys,
Just an FYI, numeric ranges aren’t properly indexed or returned in Lucene unless you use a precision step of Integer.MAX. This is currently a bug within the LuceneTermEnum not enumerating over results properly. I’ve filed the bug and I’m working on the fix, but you should be leery of depending on accurate results with the current source version ( 56007e4630932be57699)
http://github.com/tjake/Lucandra/issues#issue/40
October 6, 2010 10:52 — Todd Nine