Interview with Uwe Schindler
Grant Ingersoll speaks with Apache Lucene committer Uwe Schindler. Uwe talks about his work innovating the TRIE-range numeric query, its application to geographic and scientific search applications, frequent re-indexing.
- Grant Ingersoll:
- Today I'm speaking with Uwe Schindler, who is an Apache Lucene committer. Welcome Uwe.
- Uwe Schindler:
- Hello Grant.
- Grant Ingersoll:
- Why don't we start off by having you introduce yourself, and tell us a little bit about your background before you got involved with Lucene.
- Uwe Schindler:
- Yeah, so I come from Germany. Currently I am 31 years old, and before coming to Lucene, I started very early with Java programming, even with JDK 1.0.2. I wrote my first applet in JDK 1.0.2 for the Netscape browser, at this time. And then, a little bit later, came to PANGAEA, which is a geoscientific information system, and there we started to use full text search engines and the database plug-in for our SYBASE database was not working really good, so we switched to Lucene, and because of that, I came to Lucene and started to improve Lucene, for example, for this geospatial search.
- Grant Ingersoll:
- So speaking of spatial search, can you tell us a little bit more about what are the kind of the problems you work on?
- Uwe Schindler:
- Yeah, the different problems with a spatial search… Most users, for example, use Solr or Lucene, just to search for text phrases inside the index. The problem with geospatial searches is that you want to find, for example, all datasets – we call it datasets – inside a specific area, and so you have a latitude and longitude arranged query, and want to execute it.
And when I started with Lucene we had some problems with the range query that was very, very slow and often hits too many clauses exceptions if you have a lot of distinct values in your index, and a large number of documents. And so I started to find an easy solution to do those numerical searches.
So most users will enter some keywords, and then, for example, by selecting an area in the map, specific bounding box, and start a search. But it would be the same, for example, for a date search where you can also express it as a numerical search. And this is where I started to extend Lucene in the past.
- Grant Ingersoll:
- Can you kind of describe what your documents look like to give people a little bit better picture of, you know, why you would have so many terms in your index for a given field, and how, you know, so you've got lat/long and things like that, but then what else does a document look like?
- Uwe Schindler:
- I work for PANGAEA, which his a network for geoscientific data. What we are doing is, we do the same as libraries do for scientific books or papers or something like that. We publish the data. So it's just like a journal where you have these articles, these scientific articles. We publish the data.
So the datasets consists of the standard biographic information like name of authors, name of principle investigators (which are the persons that measured the values), the measurement values, for example, then you have latitude and longitude for the position of the measurement, some date/time information, when it was measured, and so, so it's just the standard Dublin Core information with a lot of other information.
And all this metadata is searchable by a Lucene index. And, for example, for the geospatial or date/time search, we need very large range queries. So for example, some people just do a range around the whole Africa, and they search for some keyword like "nitrate"; the problem is now, that we have a lot of datasets and if you combine this query with a very large range, you may hit, for example - if you have the bounding box around Africa - a lot of datasets.
Our index currently has about 700,000 datasets. I call the datasets citable units. And, for example, around Africa you may hit about 100,000 documents.
- Grant Ingersoll:
- Each document has one latitude and longitude value?
- Uwe Schindler:
- Yeah. Yeah.
- Grant Ingersoll:
- Or does it have multiple lat/longs?
- Uwe Schindler:
- It depends. So at the moment it's mostly one latitude/longitude, but according to some standards, there should be also a bounding box provided. Normally this bounding box is only one point, so each data set has a bounding box. Those are north, south, east, and west point, and you have a hit if the bounding box user entered into his search box intersects with the bounding box of the dataset.
But most mostly it's just point data. So a ship is going somewhere and starts to drill a hole, do some scientific measurements, and then this data is recorded at just this point of the drill hole, and so it's very simple. But there may be more complicated datasets.
- Grant Ingersoll:
- So I know your primary contribution is that, your first contribution to Lucene was the, the TrieRangeQuery or TrieRangeQuery as some people might call it, T R I E. Can you explain what that is and how it works?
- Uwe Schindler:
- The idea behind was…, when you read the wiki about Lucene, there are some examples that look like this TrieRangeQuery. You may know if you, for example, do some data searches, for example, you want to have all datasets between a specific range of dates, it is often interesting to not just go until the second, because if you only have a search query going from January to February, it's enough to just search all datasets that contain January or February, or something like that.
And the idea behind TrieRangeQuery is that you index in numeric value, not as just one token, as you would do it with a classic range query. You add some additional tokens in a lower position by shifting away the lower bits, and then you have some predefined areas.
Back to the example of the date: If you index, for example, for each record also the year number alone, you can get all datasets in this year by just searching for the year number. If you have, for example, a query from December 2006 to February 2008, you would just use December of 2006, and then the whole Year 2007, and then January and February of 2008. And TrieRangeQuery does exactly that.
At the end of the range (or the bounds of the range), it is searching very exact, but in the middle it uses the more lower precision terms to hit as most documents with, with less terms. So the overhead of term doc search for document ID's and the number of terms is reduced dramatically.
- Grant Ingersoll:
- You don't have iterate through as many terms, essentially, is how it happens when you're in that middle section, right?
- Uwe Schindler:
- Yes, in the middle section. So for example, for the case of Africa, as I told you before, we have 100,000 datasets around Africa, and as latitudes and longitudes are measured values, they are all going to five digits after the dot, and so you have for every data set a distinct value. And so it maybe possible that you have to iterate over 100,000 terms in the index and find only one data set for each term. And this is very expensive in cost of I/O and seeking.
- Grant Ingersoll:
- And so now the TrieRange stuff, that's going to be available in Lucene 2.9, right?
- Uwe Schindler:
- Yes.
- Grant Ingersoll:
- It was just committed as part of 2.9, so it's not in any of the previous releases, the 2.4 of Lucene or anything like that. And I thought there was some discussion about actually changing the name to be something like NumericRangeQuery or something like that. Is that, is that still going on?
- Uwe Schindler:
- This is still going on, but at the moment the discussion is stuck, because of a lot of other discussions on the mailing list, and a lot of open issues for 2.9. But at the moment it's in the contrib queries area, and the discussion was to move it to the core, because there's currently no support for real numeric values. At the moment you have to index them as simple text strings by zero padding and then you have the problem with negative numbers, because they must be in alphabetical order, the terms. So you have problems with negatives values.
And so on, there's a whole chapter in, in the Lucene in Action book about that. And with TrieRangeQuery you can natively index numeric values, which can be integers or floating point numbers. So maybe it moves before 2.9 into the core. I'm working on that.
- Grant Ingersoll:
- Oh great. So now I know you're a Lucene committer. Can you tell us about some of the other things in Lucene that you've worked on?
- Uwe Schindler:
- Yeah, there were some other things I started at the beginning together with the TrieRange. I had the problem that the TrieRange values are in a special format, and so if you want to, for example, order the search results by, for example, the numeric value, you have the problem that it was not possible, because the standard sorting algorithm was only able to handle standard text like numbers, and I wasn't able to use trie for that.
And the only possibility was to use it just by using a string cache, but this is, especially for large indexes, very inefficient in terms of building the field cache and so on, and so the problem was to add the parser support to sort fields was my fourth contribution, and then any other contributions, not only for Lucene, it was also a contribution to Solr, to put a TrieRangeQuery there, and then I forgot what I'm doing.
Sometimes I think it's, yes it was, the new field cache code where we discussed a lot, and how to do it, but it's not currently implemented. Maybe I forget something. I'm not really sure what –
- Grant Ingersoll:
- No, that's fine. That's fine. I mean it all starts to blur together after a while. When you've been working on Lucene for so long you'll find that you're never sure what patch went into what version and things like that without having to go look it up. So. So what other kinds of issues do you have in making your content searchable? I mean so I imagine on the, the query side you have some issues as well in dealing with geospatial information.
- Uwe Schindler:
- Most problems are not really related at the moment to our geospatial search. Our other problems are that users sometimes do not know exactly how to write their terms into the search field, because of missing stemming and so on in our index. Our problem is we have a lot of languages, so datasets are not only in English language; we also have German language and so on. So it's very complicated to detect correctly the languages. So we have some auto-complete at the moment in our index that helps a user for entering a search term by just doing a list terms on the index and showing the user an AJAX box, all terms starting with the currently entered letters. And then you can select one of the terms and start a search, for example.
But there are also problems. At, at this time I'm thinking about improving that, for example, by using statistics and so on.
- Grant Ingersoll:
- So do you also have, you know I've talked with Ryan McKinley and some other people about geo searching in these podcasts, and do you have the issue of, you know, do people type in things like addresses and that at all, or is it purely they're, they're more looking for people and more scientific kind of things.
So they wouldn't specifically look, "oh find me information about a sensor that's in", you know, Morocco, for instance.
- Uwe Schindler:
- Yeah, exactly that. So normally you would not enter an address and want to find something around a specific area. It's just more the user enters some search terms together with some constraints on the geographical position, and also not only geographical position, also time. So it's, it's also possible that someone asks, for example, for all data set coming from 10 million years ago. So it's not when it was measured. It was, for example, about core data from this time area. So these are all this numerical queries users enter together with names and parameters, and information about, for example, the ship was used or any other device. So just all metainformation that may belong to this data.
- Grant Ingersoll:
- So can you walk us through a little bit what the PANGAEA search architecture looks like, you know, how many, how does your indexing work a little bit. How does the search side work? What kind of machines do you have? Those kind of things.
- Uwe Schindler:
- Yeah, sure. So we, we started in the past using a full text search engine built in our SYBASE database. So behind all this are relational database, but at the moment we only use this relational database to manage our data. So we do not do any searches from the clients directly through the database. This is currently only done through the Lucene index.
And so all data is almost completely relational. So it's a classical star schema and you have, for example, for all staff members that are authors or scientists and so on you have a table with all scientists in it with their address information and so on. And there is central information the user is looking for are the so called datasets, which are the citable entities. It's like a paper in a scientific journal. You, you can search it with the citations or you have authors and title and so on.
And all our information are just attributes to those datasets. And so what we are doing is: We first create XML files of all our datasets. We have a specific schema for our database that conforms to global standards on geoscientific information.
A lot of people may know ISO 19115, which is for geospatial information where there is all this stuff inside. And to create these XML files we just use XSLT and do some SELECT statements on the database, and write it into our XML, and then we use an indexer and index the content of those XML files using a special indexer.
So, and we also store these XML files in a separate data store for later access, because we also have other services beyond the search on the home page with Lucene. So, for example, to exchange data with other libraries, we have services to contribute those XML files to other libraries.
For example, in Germany, we contribute all our metadata to the Technical Information Library. And so we need to create all this datasets. The problem with relational database is if you, for example, change of the related fields, like for example, you change a staff member and you change the email address of one person, it may be related to a lot of datasets. And so at the moment we have some triggers in our database that will queue of updates, and then we have a process in the background that harvests this queue, and then determines, which datasets need to be updated, and these datasets are then recreated from the current content of the database.
- Grant Ingersoll:
- That makes sense.
- Uwe Schindler:
- So if somebody changes something that, XML data is updated and later it goes into the Lucene index.
- Grant Ingersoll:
- So do you then index every night or?
- Uwe Schindler:
- No, it is, it's very often. At the beginning when we used 2.4 of Lucene, it was about, I think once per hour, to not completely index it, but only index those changed XML files. But since I've used the development version because of TrieRange of 2.9, and then you re-implemented per segment search, we can also re-index every 20 minutes at the moment. And because of the reopen functionality, together with the optimized field cache and per-segment search, the reloading of, for example, field caches is now very, very fast. So we can re-index in new documents very, very often.
But we sometimes do a complete re-indexing when, when for example, we change our XML schemas and we need to recreate the whole index. So maybe this is clear, and, so things like optimizing is done once per week.
- Grant Ingersoll:
- Gotcha.
- Uwe Schindler:
- And we also do some consistency checks. So there is one service running at the weekends that checks all dataset information with date/time stamps, if index version is up to date, and if not it queues a new harvesting for this data for this XML files from the dataset and so on.
- Grant Ingersoll:
- Makes sense.
- Uwe Schindler:
- So, and this system is currently running on two large machines. The database is running on a Sun Solaris server with I think eight processors, and the web server, application server, and so on, is running on 64 bit Solaris, too, using a Sun ONE web server, which also supports Java servlets and web applications. And this machine has 16 processors.
But we also have an additional data warehouse running on it. But this is not related to Lucene. It's more if you want to get the real measurement values and not just information about metadata.
- Grant Ingersoll:
- So do you have any useful tips for someone who is just getting started doing spatial search and how they might leverage that in Lucene and Solr?
- Uwe Schindler:
- I would first look into the two contributions at the moment. There is the TrieRangeQuery, which is just for every numeric query, so you cannot only use it for latitudes and longitudes, you can also use it for latitudes and longitudes, but you can also use if for prices or for daytime.
So you can use it for everything that you can express as a number. So, for example, if you use a data and just get the milliseconds, then the Unix Epoch, it can also be used with TrieRangeQuery, and this is how it works at the moment. So it's completely open for all numerical types. This works very good with bounding boxes and so on.
But, for example, if you want to do, for example, such queries like "find me all ATMs around a specific address somewhere", you have the problem that you have a little bit more complex queries, because of some distance calculations, and for that there is an additional contribution. But this contribution has the problem that it is not really good for very, very big, big areas.
So the example I mentioned before, bounding box around the whole Africa may be not so effective. So for that, I think TrieRangeQuery is better. And TrieRangeQuery, in my opinion, is for a simple case, is much simpler to use, because it's just a way of how you would do it with a standard relational database. So you do not need to know much about, about all this geo stuff. You only have to know I have these coordinates, and you do some greater, smaller, or between queries on it.
- Grant Ingersoll:
- That's great. So any other tips on Lucene in general that maybe you want to pass on, things you've learned along the way and maybe can save some people some time who are just getting started with Lucene?
- Uwe Schindler:
- We are running on a very, very large machine with a lot of RAM, and also 64 bit. I usually recommend all my users to just use a newer server with 64 bits and not only because you can put in more than three gigabytes of RAM into it, in my opinion the best way is to use this memory map directory implementation as this only works good with 64 bits, because with 32 bits you may get a conflict because of the used address space.
But in my tests, this was the fastest implementation. So there is currently some discussion about changing the defaults in Lucene so we have a factory for this directory implementation that chooses if you're using Windows or 64 or 32 bit, and then it's automatically uses the classical file system, the new Java NIO, or the memory mapping.
So I always recommend memory mapping, though it's a good start.
And although think, thinking about how to create your fields, it's sometimes so, for example if you want to index those XML files, I do it using a "catch all" field. So what we are not doing is, for example, if the user enters a query string without any fields, and I want to search everywhere, some developers start, for example, to index author into one field, the title into another field, and then on the query side they start to make a big OR query using all field names using this MultiFieldQueryParser. I don't think this is a good approach, disk space is so low cost at the moment. So I would just index all those fields redundant, and so have a field for title, for author, and so on, and just one field where you put everything in.
- Grant Ingersoll:
- Ah, that makes sense. Yep.
- Uwe Schindler:
- Yeah. So for this XML metadata, we have a very simple XML schema. We have two possibilities for the numeric fields. We then use XPath during harvesting. And so we harvest XML files, and then we do XPath queries on the XML files, and then index it aside of the XPath query into, for example, a trie field, or, and I think with the data import handler, this is also possible with Solr now. And for the other fields I have some kind of SAX parser that just uses the element names and creates field names.
So it creates a hierarchical structure of field names. So if you have a top level element called metadata, and then a second level element called author, and then a search level element with first name and last name, then it creates a field "first name" and a field "last name", and it also creates a field for the whole name and so on, and you can index a whole XML file and then you can enter your query if you only want, for example, look in the authors or in the abstract, which may consist of more than one XML element. You can search directly. So, so this is our field structure, and how we do it at Lucene at the moment.
- Grant Ingersoll:
- Yeah, no that's, that's all good stuff. I think that's, that's about all the questions I had Uwe. I want to thank you for your time, and I appreciate you speaking with me today.
- Uwe Schindler:
- Okay, yeah. It was, was nice to speak with you Grant, too. Yeah, thank you.
- Grant Ingersoll:
- Great. Thanks again.
