Interview with Ryan McKinley
Table of Contents
Ryan McKinley (search for Ryan) speaks with Lucid's Grant Ingersoll about Solr and Lucene, his work using Lucene and Solr to do geo-spatial search, as well as his use of Solr at places like http://www.instructables.com.
- Grant Ingersoll:
- Today I’m speaking with Ryan McKinley, Apache-Solr Committer and Lucene PMC Member. Welcome, Ryan, and thank you for speaking with me.
- Ryan McKinley:
- Hello, Grant.
- Grant Ingersoll:
- So why don’t we start off by having you introduce yourself and tell us a little bit about your background before you came into Solr.
- Ryan McKinley:
- Let's see, I've been working with computers since, I guess, about '95. Went to University of California, San Diego. And there worked for a professor doing virtual reality stuff, so early C programming on Silicon Graphics machine doing 3D stuff. Sort of from there, stumbled into project after project changing technologies, changing languages the whole way along. So, worked on a bunch of data retrieval stuff back in '97. From there, worked in Spain for a while doing science museum exhibits, little programs to do that. Then I went to the MIT Media Lab where I worked on a few projects somewehre including robotic, moving robotics around and then later big information system that was monitoring government activity and that sort of led me into Solr because I built something similar to what Lucene and Solr does but in some ways not as performant and not as scalable. Currently, I'm working with mapping and trying to do map, searching for geographic data sets..
- Grant Ingersoll:
- So are you using Solr to do that?
- Ryan McKinley:
- Yes, I’m using a version of Solr that then has some of my own extensions written on that to enable a geographic search. So it uses Solr and Lucene with all the standard text-based stuff, a custom R-tree implementation that gets wrapped into a Lucene filter, so I can limit searches by geographic region but still have all the nice faceting and sorting that Solr has to offer.
- Grant Ingersoll:
- Can we take a step back and kind of explain – give a brief overview of what local searches and in this GIS mapping, and what you’d need to do in order to make – what are the kinds of things that you would do to solve problems in that space?
- Ryan McKinley:
- So GIS and geographic search have been a round for a long time, and there’s many data formats that are well tried and true index formats that work well for geographic data. The biggest ones are probably an R-tree which is a graph structure where each node represents a minimum bounding box. So aregion within the world that your object falls within and the structure essentially is just a tree that puts the minimum bounding box within themselves, so you say, “Here is a bounding box, three items fit in that, if you put more than three items in it you break that box up into more bounding boxes with greater and greater fidelity.” So that’s one structure. There’s maybe five or six versions of that that.
- Grant Ingersoll:
- If I’m at my house and I want to find all of the Starbucks say within 30 miles of where I’m at right now, what would I do?
- Ryan McKinley:
- So with standard GIS indexing schemes you would first, as you are indexing those, you would sort of break their world up into regions. So if you were saying, “I need to find something in Raleigh North Carolina”
- Grant Ingersoll:
- Raleigh.
- Ryan McKinley:
- Raleigh, North Carolina. You don’t want to start checking to see the distance from Raleigh to every Starbucks. You want to first break that up into chunks, so first make sure, you know, let’s say our dataset contains a million Starbucks. We don’t want to check the distance to all those million. First you limit that to a smaller region. So if there is only a hundred of them in North Carolina – not true but -- (Laughing.)
- Grant Ingersoll:
- (Laughing.)
- Ryan McKinley:
- -- you would limit it to first North Carolina, if you could still limit further to say “Raleigh,” you would get it down to that list, and then only start doing the math to calculate the distance once you are able to limit your search results to a much smaller region.
- Grant Ingersoll:
- So in Solr terms I would have – is – are those implemented as a filter?
- Ryan McKinley:
- In Solr Lucene terms we’re still in – yeah, forging new territory. The various techniques that exist have all been well-established for indexes that don’t do text search well. So with Solr there’s a couple of people now – and I think lots of people have tried various techniques to get this to happen – the biggest project right now is something called Local Lucene and that’s Patrick O’Leary at AOL has been using that to do their Yellow Pages search. And that takes – he originally started that when he started with – fromLucene in Action. There is a chapter that says, “Hey, you could try a geographic search with a simple range query on latitude-longitude, search within a bounds as a standard filter.” That works, it’s great until you get a few documents then you quickly hit the maximum Boolean Clause error, so after that he started looking at some other methods for how to change that to be a little bit faster, and what that’s currently doing is it takes a bunch of points. It works purely with point data and with point data you can ask for this point, again I’m gonna break the world into a series of tiers, so Cartesian tiers that, given a point you break that up and for each level of detail, so one might be, “Are we in the Western part of the United States or the Eastern part of the United States?” So that gets represented as a token in Lucene and then we go break that down and do another tier, so are you in Oregon or Washington, or California? It gets broken down into another token at a different level, and then it goes all the way down to a much more granular level, I think, with the Java precision and with floating points so it actually doubles we end up with precision that gets you down to about three or four feet in real world units. So essentially you break that down and you get a bunch of tokens, and what’s great about that is that then these very complex geographic queries can be broken down into simple term queries which, as you know, are pretty fast in Lucene and get all the power of distributed search and all that, which is the beauty of that approach.
- Grant Ingersoll:
- So are each of the tokens on a single document or how do you then create documents?
- Ryan McKinley:
- So in Local Solr and Local Lucene what happens is you have a single field that – or actually you have two fields. You have a latitude and longitude field that get read together and then we generate another – it’s configurable but let’s say 10 different fields but each field represents a tier in this Cartesian space. So the highest level of detail will have the fewest number of possible grids, and then that gets more and more detailed the further down you go.
- Grant Ingersoll:
- if I say I had my house would that – that would then have – that would be one document and it would have all 10 fields for each layer that it belongs to?
- Ryan McKinley:
- Correct.
- Grant Ingersoll:
- Okay. Whereas just if I’m North Carolina, that document would just be whatever field is needed to represent North Carolina.
- Ryan McKinley:
- Correct. I mean one of the limitations of Local Lucene as written is that it only supports point data. So you have to put a point in where you could say, “my house,” but you can’t say “North Carolina” which is at a house but has some width and height, a maximum height. So it’s what Local Solr has – Local Lucene has focused on because it’s core application. There might be some tricks that we can do to it to get it to represent regions and represent bounding boxes but at its core it’s built for searching for point data.
- Grant Ingersoll:
- And that makes sense given the context of AOL’s yellow pages.
- Ryan McKinley:
- Exactly, that’s the problem they’re solving. It’s a very common problem and they’ve done it in a way that leverages all of the power, like it would break these geographic queries down to very simple Lucene queries that does the minimum of math.
- Grant Ingersoll:
- So how does – how well does this perform in real applications? As I understand it AOL has this in production, right?
- Ryan McKinley:
- AOL has it in production. I’m a little reluctant to run off the numbers that might not be correct –
- Grant Ingersoll:
- Sure, sure.
- Ryan McKinley:
- -- but it was on the order of 15 million if I remember correctly. It’s on the order of 15 million locations that they do a search for and get results in under a tenth of a second.
- Ryan McKinley:
- And that’s at a couple of hundred, I mean even a couple of thousand inquiries per second. I mean their queries are off the charts and are – it’s totally impossible to think of that kind of query response with standard GIS database formats.
- Grant Ingersoll:
- One of the things I have experienced with doing this local kind of search is the notion of query parsing becomes quite difficult, right, because people when they enter addresses, they’re incomplete, they don’t remember the town right, or they don’t spell it right or words that could be – you know – you could be at 1600 Pennsylvania Avenue and but Pennsylvania is also a state and so knowing – you know if somebody just put in ‘1600 Pennsylvania’ knowing when they – that they really mean Pennsylvania Avenue versus 1600 in the state of Pennsylvania. So have you dealt with those kind of problems? This notion of query parsing on the local side?
- Ryan McKinley:
- Yeah, there’s actually – there’s an interesting article that I’d just point people to a website, it’s Patrick O’Leary’s blog. It’s called gissearch.com. He actually has an article about parsing. You know how they have the simple one box at the top where you might say, “Starbucks in Raleigh.” Figuring out that when you’re saying that you actually want Raleigh to be a location and to extract that, to have that be limiting things by location whereas you know 1600 Pennsylvania Avenue, you might be wanting to treat that separately. So he actually has a pretty good post there and the thing I’ll just say in general on that is once you know your index statistics writing a special parser that is a whole separate request can help you with that. His solution to that was phrase slop.
- Grant Ingersoll:
- So that’s the notion of finding words that are close to each other –
- Ryan McKinley:
- Right.
- Grant Ingersoll:
- -- but not necessarily exactly next to each other?
- Ryan McKinley:
- Exactly, and building an index based on the results that you actually want. So if you know your list of cities that are meaningful to you you would put those into an index.
- Grant Ingersoll:
- And I would imagine some query log analysis would also be helpful so that you could actually feedback and/or tune your system based on how your users are querying?
- Ryan McKinley:
- Right, yeah. I mean I’ve done tricks in the past that weren’t specifically around parsing out locations, which locations are slightly different because they can be expressed in a few ways as you mentioned. The general approach is to try a few things, see what matches best and offer those back as solutions.
- Grant Ingersoll:
- Right. So what other major concerns are there? So if I were adding local search right now what kind of obstacles and hurdles would I have to get over? Is it my data? Is it understanding that? Is it the query parsing?
- Ryan McKinley:
- I mean the biggest one to start with now is just understanding what exactly you would need, so the general purpose what does adding local search mean? Is – are you adding points to your documents and want those to show up as maps? Are you searching for regions? What kind of – how complex is the query that you’re gonna need? Is it always going to be, “Show me the points within a radius to something?” The range of what you start thinking about you need with the geographic search is extreme, are you actually looking for things given two polygons, give me the stuff that intersects. Show me everything that is excluded from these polygons. So first, it’s just figuring out what your problem is, sort of off the shelf solutions for Lucene now are still pretty limited. I hope to see those grow and I hope to see some better integration in the future but right now we’re pretty limited to points within a radius.
- Ryan McKinley:
- So actually for a project I’m working on I have some different needs where this limitation of doing a – not being able to represent regions has come up. So I’m working on a project that is essentially a crawler for geographic data. It focuses on – ESRI is a large software vendor for geographic tools, software tools, and we wrote a crawler that crawls through all of your existing data and indexes as it goes along. So it opens up all your data sources, finds all the statistics about them, builds thumbnails, indexes those, but every dataset, every map, every layer within a map, it all represents a region, and so we index that region, and for that rather than using Local Lucene and building a special or custom R-tree implementation which renders just a pretty standard GIS data format, it’s actually beyond GIS but – and then that gets read and currently what I’m doing is that gets read into memory every time the Lucene index starts up and I have a little callback on the – I’m blanking on where it is but - there’s a callback in Solr every time you – the new sort searcher warms, and I build this in memory R-tree, and then that structure, you can ask given a region what falls, what other items fall within that region. So that’s how I’m implementing that now. Its drawbacks are just that it doesn’t scale yet as it’s a purely in memory solution but it seems to do the trick. –
- Grant Ingersoll:
- So is that implemented then as a – for searching that, is that then a query component?
- Ryan McKinley:
- That is a query component on a standard Solr query. So it’s a query component that adds a filter, a special filter to the query.
Grant Ingersoll:So this is all part of your Voyager GIS solution?
- Ryan McKinley:
- Yeah, this is – this product is Voyager GIS and it will be most interesting to GIS professionals who work with lots of GIS data.
- Grant Ingersoll:
- So these are people who have specific file formats in the GIS domain that are sitting on your desktop or on your server and this tool goes through and harnesses them all.
- Ryan McKinley:
- Correct. It’s – think of it if – it’s like Google Desktop if they were sort of focusing on GIS professionals where the types of formats that they knew how to read well were datasets representing geographic regions. So we kind of have tuned this search engine on our crawler to support all those types of formats.
- Grant Ingersoll:
- So it’s not something that – you know so me being a non-GIS person, it’s not something I would just point at like my hard drive and have it find things like addresses or locations in any type of content that I have. It’s specifically for GIS data?
- Ryan McKinley:
- Correct. It’s – it is for – the output of it are types of data that you could add into the ESRI software. So for example if on your hard drive – I don’t know what you have on it but if you do happen to have census track data, it will find that and index that, and show you – sort of give you nice pointers for how to add that into the ESRI. What’s actually nice about our product too is the other thing it does is they have this concept of maps, so a map is a collection of layers. So each layer represents various things, so it could be road networks, it could be lakes, it could be political boundaries. Any of those things get representative layers. All of those at some level are just data sets, so something akin to an Excel spreadsheet that’s just points, but to make a map useful you have to put a lot of work into it to sort of add the symbology to say, okay, if it’s a capital city I want to draw it with this type of icon, and it needs to be labeled in this way. There are no existing good tools to sort of use that from project to project, so people tend to reinvent that labor every time they do it. So what our product also does is it cracks open all of your map documents, it cracks open everything and looks at how you’ve symbolized things, and so for a given data set I can see how have people used it across my whole computer? If you’re on a network, you know, across my whole network how have other people used this? So we’re hoping that kind of lowers the barrier for getting a lot of this data used because we find so often that people have lots of good data but don’t know they have it, and don’t know how to use it.
- Grant Ingersoll:
- Right.
- Ryan McKinley:
- So this just makes that process a lot easier.
- Grant Ingersoll:
- So let’s take a step back and have you put on kind of your – Apache hat. So you’ve talked a little bit that you’ve done other things with Solr. Kind of can you explain some of these other applications you’ve done with Solr as well? Ryan McKinley: So the applications I’ve done with Solr, I guess where to start is the ones that are probably the biggest are there’s a site called instructables.com that was the first project that I worked on that we were doing a similar faceting type of strategy. So that side is a Wikipedia of how to make things, so it’s people show off what they make, how they make it and post instructions, and share their skills, and as part of that as people put things in, things are classified and categorized, and to navigate through to help you find what you want we include something akin to faceting. That I had originally written as SQL queries, which if anyone has had that experience before you know things kind of quickly fall over once you hit scale. Our machine resources were going off the roof, the response times were kind of terrible, everything ended up having to be cached for much longer than you would rather really want it to be cached, and then that was where it started working on Solr and with Solr we quickly replaced what was a beast of a system, always running at I think it was close to like 15 on the load and replaced that with Solr and it was just sitting along quite happy. So Solr saved our ass there and then since then I’ve just worked on it with a variety of other projects including some through Boston Public Library. I’ve worked on – they have a mapping collection there that was – those are all historic maps, so they’re just images, no geographic regions, but use that again to index all the standard bibliographical data about these, their map collection, and other projects with – there was actually the whole Massachusetts library system that – crawlers that check each of their local repositories and index those in standard ways. But with all of those things, the flexibility of Solr and the off the shelf faceting and display qualities that it gives you just make those very easy projects.
- Grant Ingersoll:
- With your committer hat on what do you see in the future for Solr? I mean what’s your kind of vision for where Solr is headed?
- Ryan McKinley:
- Well in general the biggest thing, I guess the ways I’ve pushed Solr which might be slightly different than some of the other committers, I guess the types of things I’m working on, in particular if I look at this Voyager project, it is both Google Desktop and at some level, I won’t say Google because it’s not but you know it’s looking at something that – trying to use the same software that I run locally on my computer and the same – I mean not the exact same set-up but something similar to run a larger let’s say enterprise level search, so something that could scale have more distribution. So the way I’ve been pushing the things is I guess a lot for having clients and configuration that can be shared across those two domains. One of the biggest things I see is a need to clean up and re-factor and just get our configuration under control. Right now it works. It works great, so I hate to poo-poo it but it needs some work so that those – that we have more flexibility in how things are deployed and how things are packaged.
- Grant Ingersoll:
- Like the XML-based configuration isn’t necessarily for everyone.
- Ryan McKinley:
- Isn’t necessarily for everyone. It’s got limits if you are trying to do some run-time configurations, and I think there’s definitely standard ways that we see that we can use existing projects to leverage XML configuration. It’s not something that we should be focusing on. In general, it’s a solved problem. So –
- Grant Ingersoll:
- I think more and more people are looking to use Solr in an embedded fashion –
- Ryan McKinley:
- Right.
- Grant Ingersoll:
- -- because Solr in a lot of ways is the Lucene best practices, so my experience with Lucene was I ended up building something that was very Solr-like, so when Solr came around I said, “Oh well, I can just use Solr and then that way I get all the benefits of the community and everything like that.”
- Ryan McKinley:
- Right.
- Grant Ingersoll:
- And then now with the ability to use it in embedded mode that then furthers the adoption I think and saves people a lot of time when it comes to needing something like Lucene in their system.
- Ryan McKinley:
- Yeah, I agree. I mean to me its biggest contribution is just that it has all of the details of Lucene worked out so that when I work with Lucene I’m just – when I work with Solr I’m just dealing with the search and kind of highly-functional part, like more towards the user-interface side whereas with Lucene I still have to worry about when did I open which Searcher, or which threads could be deadlocked somewhere. And with Solr I’m allowed to like – I let other people kind of – and myself included, I re-focus on those as separate problems and hopefully that gets solved once in a while. And then the other thing I’ll come back to is there’s – I’m hoping to get some of the geographic search ideas back into a core Apache contrib. There’s some movement to move what’s now run at Sourceforge into Apache for the local Lucene just because it would have an easier distribution and wider audience, and with that I think that could make a good space for hopefully some experiments in trying various indexing schemes. The approach that Patrick is taking is a great one, which is this Cartesian Tier System. I think something along an R-tree lines or perhaps not even limited to that but trying to integrate directly with an open GIS framework so that you could have any GIS backing, and have Lucene filters or query components hit those and meld those two results. So there’s plenty of work there. The other approach that I think we should also look at is something that people using, trying to do simple geographic search with BigTable are using, and it’s a Geohash solution. Tt’s a method for encoding latitude and longitude into a single string, and the prefix of that string, so the first character of it essentially works as a bounding box, so the further down the string you walk, the more precision you get. So there’s some tricks with that that make it pretty nice because you – you have your latitude and longitude encoded as a single field and range queries become boundary box queries but it’s a pretty neat trick. It was pushed a bit because BigTable only lets you sort on one field and so you don’t get multiple field searches. You couldn’t do a latitude greater than 80 and longitude less than 20, you know.
- Grant Ingersoll:
- And just to fill in a little bit of background, BigTable is Google’s map reduce distributed large-scale database-like thing? (Laughing.)
- Ryan McKinley:
- (Laughing.) Yeah, it’s – I mean it’s essentially their massive – it’s a massive table, so you could in some ways think of it as a SQL table other than all of the constraints that you have on SQL which is like, “Okay, I shouldn’t have a million columns in my SQL table.” That’s actually the best practice in BigTable. In some ways similar to how Lucene gives you a document where you – you can have as many fields as you want in that and that adding fields and removing fields is not a -- there are marginal performance problems with adding too many fields. It’s – you know – it’s fine to represent two items in different domains within the same Lucene index across distinct columns or distinct field names similar to BigTable, and BigTable lets you do that across -- you know they had put all their scaling behind it. One of their interesting constraints is that they actually timeout automatically if your query does not return, I think it’s within two hundredths of a second or something like that. What’s actually nice about them is they force you to – everything you do has to be quick and you can’t even develop it if it’s not quick, so you don’t get to do the – “Oh, I’ll just make it work and then I’ll optimize it later.”
- Grant Ingersoll:
- Right. And just to put our Apache spin on it, BigTable actually has an implementation in Hadoop as – what do they call it, HBase?
- Ryan McKinley:
- HBase, yeah.
- Grant Ingersoll:
- HBase, so there is a – for those looking for a free version there is Apache Hadoop has an implementation called “HBase.” ... So this is for people who really want to do really large-scale GIS-type applications?
- Ryan McKinley:
- Yeah, probably not yet GIS-type because there are severe limitations to it as well, as I mentioned of like – you know you can sort on one field at a time. I mean the thing that it’s best for is if you look at the Google series of things it would be -- you know – representing all documents in like if you were writing in Google Docs and like I need to store X-million people’s documents, where do I – how can I do that? So everything still comes down to a simple search across a couple of fields. – I haven’t looked into it too much so I don’t wan to speak too authoritatively but it’s not where if you were trying to do complex calculations you shouldn’t be looking at that.
- Grant Ingersoll:
- So I think that’s it for my questions. Did you have anything you feel that interests you left that you would like to talk about?
- Ryan McKinley:
- Oh, there’s always more but -- (Laughing.)
- Grant Ingersoll:
- (Laughing.)
- Ryan McKinley:
- And we’ll still be – we’ll still be around, so that’s what the mailing lists are for and – yeah.
- Grant Ingersoll:
- Great. I’d like to thank again, Ryan McKinley, for taking time to speak with us about Solr and its use of – his use of it in various GIS applications. Thank you again, Ryan.
- Ryan McKinley:
- Great. Thanks, Grant.
[INTRO SEQUE TO SHORT FOLLOW UP INTERVIEW]
- Grant Ingersoll:
- Today I'm following up with Apache Solr Committer Ryan McKinley, and getting an update on some recent events and news that Ryan can easily relate to. Welcome back, Ryan
- Ryan McKinley:
- Hi Grant.
- Grant Ingersoll:
- Why don't we start off by you telling us your recent news in Lucene?
- Ryan McKinley:
- Lucene recently -- the spatial contrib is a reality now, when we talked earlier, we discussed how this may happen, but there is now officially a spatial contrib in Lucene. I'm a committer for that and Patrick is also a new committer for that. So, there is a lot of momentum gaining, the things we discussed as possibilities, including the geohash implementation is now built into this spatial contrib, so there is a lot of progress and momemntum.
- Grant Ingersoll:
- And, Patrick, just to clarify, is Patrick O'Leary, right?
- Ryan McKinley:
- Yes, Patrick O'Leary, who, sort of wrote the original code for Local Lucene and who has spent a lot of time working on geographic search and Lucene.
- Grant Ingersoll:
- Great, so that's all committed now, right?
- Ryan McKinley:
- Yep.
- Grant Ingersoll:
- If I go and get the latest trunk version of Lucene, I should be able to try out all of the Local Lucene stuff, right?
- Ryan McKinley:
- Correct. The Solr version will still require some work, because there are some integration issues, but the Local Lucene side is checkout trunk and the test cases show you what it does.
- Grant Ingersoll:
- Thanks again for the update.
Here are some helpful resources related to Ryan's interview:
-
Voyager GIS -- Ryan's company specializing in GIS search
-
instructables.com -- powered by Solr
-
Wikipedia article on Bounding Box
-
Wikipedia article on Geohash.
DevZone
Latest Blog Post
Mark your calendars today! The largest worldwide conference dedicated to Lucene and Solr will take place in Boston May 7-10.
The 2012 conference will build on the success of last...
