• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
    • Certified Distributions
      • Certified Solr
      • Certified Lucene
    • Apache Releases
      • Apache Solr
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Training
    • Solr/Lucene Certification
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Apache Lucene/Solr Committers
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . DevZone . Videos & Podcasts

Interview with Patrick O'Leary

Download the MP3, Subscribe to the Podcast (Subscribe With iTunes)

Search local! Listen to Grant Intersoll as he speaks with Patrick O'Leary, creator of local Lucene and local Solr, and learn more about inplementing geospatial search.

Transcript

Interviewer:

Today I'm speaking with Lucene committer Patrick O'Leary. Patrick is the creator of local Lucene and local solr, which are both plug-ins that doing geospatial search easier in Lucene and solr. Welcome, Patrick.

Patrick O' Leary:

Thank you, Grant.

Interviewer:

So, Patrick, could you start off by introducing yourself and telling us a little bit about your background before you got involved in Lucene and Solr?

Patrick O' Leary:

Sure, so I started out going to college in Ireland. University College Dublin. From there, I moved into areas such as AOL and I began working in a publishing group. I worked for AOL in Dublin, moved to London and finally moved to the U.S. with them. I run projects as small as AOL UK publishing. I became a Netscape engineer for a while and moved onto bigger things such as AOL.com publishing.

And resided there for about, maybe two years. I've had a background in obviously engineering and also, in operations as well. We've in the past worked on quite a few open source projects in AOL and one of the components I worked in was something called AOL server.

That gave me a good background in the open source community and also in areas such as Apache and Sorceforge. From there I made the transition about AOL search.

Interviewer:

Great. That sounds really interesting. So, once you were in search group, I presume then you became involved with Lucene and solr. Can you talk a little bit about that?

Patrick O' Leary:

Sure, so had worked in search for maybe a year or so before we made the decision to move towards Lucene solr. Up till then we had been using some third party software and some internal software. And we just went particularly happy with what we had so we began to look outside of our usually solutions. And Lucene was the prime candidate.

We played with Lucene and also solr, as well. And eventually found that solr was definitely the component that met our needs. Lucene was very, very powerful, provided a lot of the tools that we've had to spend a lot of time in the past building for ourselves. But it also provided a framework that allowed us to extend, to achieve internal methods that we had used that were yet to be made public.

So it really met a lot of our key needs.

Interviewer:

So you mentioned you were on a previous platform. What were the paying points that you were seeing with that one that were then solved by Lucene and solr?

Patrick O' Leary:

So there were to main platforms that we had to use, one was an internal proprietary platform and the biggest pain problem there was actually retaining internal knowledge. Like any company people have a tendency of moving on and maintaining the main specific knowledge about that platform was quite difficult. The second platform that we're using was a third party platform.

And we ran into several issues there. The first issue was essentially grows off that platform. Because it was third party, we could not control the road map of it and we were vying with other customers and so it was difficult for us to come along and say, here are our primary needs that we need to have met.

And get those done in a reasonable amount of time. The second problem that we faced was simply knowledge transfer, as it was third party, it was obviously proprietary and then we just could never see the internal components and look at ways to extend them.

With any software, it's never going to fully meet your needs and you often want to have the desire to be able to take it and rework and make it meet your needs, but when you're using third party software, that's not always possible. So we were kind of hitting block walls all the time. And that really drove or decision to look at open source again.

Interviewer:

So basically the transparency of Lucene and almost replacing your internal and your third party were pretty key points then.

Patrick O' Leary:

Yes, definitely. We definitely realized that everything had a life cycle and the life cycle depended on how many people were actively working on this and whether or not those activities matched your own internal road maps. And when you're using stuff that you built in house, you can maintain it as long as you maintain the full complement of staff to do that. But that's quite difficult.

If you're using third party software, trying to get your priorities into the top of the list along with a list of other customers is also quite difficult.

Interviewer:

Right, even if you are AOL.

Patrick O' Leary:

Even if you are AOL, yes.

Interviewer:

All right, so then, what kind of sites were you using Lucene and solr on at this point?

Patrick O' Leary:

Um, so we started off using Lucene and solr on a lot of AOL internal sites. A lot of individual channels had Lucene, or were powered by solr really. And the search box on most of the AOL channels goes to something that's Lucene driven. For the group that I worked in we mainly powered local base technologies, AOL local properties.

There were such things as AOL yellow pages and AOL city guide and cities best, so there were quite highly trafficked properties.

Interviewer:

And so now those are all powered by solr at this point? Right?

Patrick O' Leary:

Yes, they are.

Interviewer:

And then, since you're the creator of the local Lucene and local solr, which are these geospatial base components for Lucene and solr, I imagine that came out of your work at AOL as well. Can you introduce us to that and tell us about how those work?

Patrick O' Leary:

Sure so local Lucene was the original component and I extended it to create a handler that fit in with solr.

Solr provided a very nice little framework that allowed me to do that extension. The local Lucene component has to main elements in it. It has an indexing component that can take, basically, the latitudes and longitudes off multiple documents and just transpose them onto a searchable grid. And it has a search component or a filter that will take a geographical bound search, so point X and everything within X number of miles of point X.

I should say Y number of miles point X.

Interviewer:

Right. And will, final documents are continued when that search radius by simply looking up with a multistep component. The multistep will first of all dig into a grid, find the best grids or the best sized grid to look at will pull out the potential canvas or results from that grid and then do a final radius filter of all those documents to make sure that they actually fit into the radius that you searched and then we'll return those document ID's.

Interviewer:

And those documents can also then match one other, this is just a filter, in other words. You can match on key words and all of those things too.

Patrick O' Leary:

Yes, exactly. It simply just sits along side the regular search itself so you can do additional search, which can be text based, and you can do additional filters as well, so it just simply complements the actual search itself.

Interviewer:

Ah, gotcha. So, is there also a component that factors in the distance as part of the relevant score too, of is that a separate kind of thing that you would have to handle on your own?

Patrick O' Leary:

It's a separate kind of thing, but you can actually do it within Lucene itself using a custom scorer. I've actually included the code to do that when one of the test case within local Lucene and it's just simply a custom scorer that takes all the distances and just boosts the score anyway you particularly want.

Interviewer:

Ah, I see. Or so, could you use a function query too or, I mean, both Lucene and solr have these notion of function query, would that work too, or is that what you mean by customer scorer?

Patrick O' Leary:

Um, I think you can probably do anything that can be applied to the document post search or post filter can be used as a boost factor.

Interviewer:

Oh, okay, then that should work. So now can you tell us a little bit about these properties at AOL, like what was your kind of general architecture and performance and things like that.

Patrick O' Leary:

Sure, so there were a couple of key components in terms of the architecture. We needed to build something that had a smaller footprint than what we we're currently using. We had quite a large system, but we also needed something that was extendable and could last beyond the actual initial estimate of document sets that we had.

So the key component was to build something that was horizontally scalable, rather than just vertically scalable. For that, we did a good bit of work on the horizontal scaling component, the sharding component within solr. So the existence of some works as what we term as clusters we have five boxes per cluster and every time a request comes in, it gets distributed amongst those five boxes and gets condensed into one set of results and return to the user.

Interviewer:

So is this using your own internal distributive search or were you actually using solr? I imagine you guys started with solr pre—

Patrick O' Leary:

We started with solr and everything, one of the key goals to what we were achieving was to keep as much of this as open source as possible. We don't really like the concept of secret sauce. We try to make it as open, as reusable as possible, the belief being that what we achieve by open sourcing it is providing a life span to the actual technology itself and if we keep anything internal, it usually doesn't last to long. It usually gets tied to only a handful of people who either know the technologies or who can support the technologies and we found in the past that that's not really fully reliable.

But by open sourcing we actually do extend the lifecycle of our technologies.

Interviewer:

Well, and you probably see improvements that you wouldn't otherwise see as well, right?

Patrick O' Leary:

Exactly, we reap a lot of benefits by open sourcing us. I've even spoken to folks recently, in the past couple of weeks who have pointed out bugs in the software that we've done, which is kind of unusual. It's been out there for quite a while. There was one person who pointed out a one line bug.

Interviewer:

Oh wow.

Patrick O' Leary:

Yeah, and it's kind of unusual, but it's of great benefit to have that come back in. It was something that simply was more of a utilization bug that change it actually improve performance a little bit, but everything counts. It definitely helps a lot.

Interviewer:

Right, great. So can you talk a little bit about your performance was well?

Patrick O' Leary:

Sure, the data set that we use is up in the region of about 16 million documents and our performance, we have a metric that we absolutely must achieve which is 100 milliseconds. And we are able to do geographical base searching and return results within 100 milliseconds. It averages around about 40 milliseconds. And the number of results that we get back can average about, I think around about 3 to 4,000 results. So, we're able to do a search across five boxes within, usually about 25 mile radius and get back 3 to 4,000 results. And it can go a lot higher.

The radius itself can expand this. We generally block it around about 100 miles. But we do end up searching quite heavy areas such as somewhere as dense as Manhattan Island where there could be anywhere up to maybe oh, 3-400,000 results within a 25 mile radius.

Interviewer:

Gotcha. So for your document set, do you distribute the entries across the shards or do you like partition your shards based on geography?

Patrick O' Leary:

No, we distribute it randomly. This allows us to get a much more even distribution and allows us to get better CPU utilization. If you partition it according to geography, you will always have one area that is a lot hotter than other areas.

And if you look somewhere like the United States where you've got three time zones, that area that's hot can move across your partitions. So somewhere that's receiving a lot of traffic at maybe 6:00 Eastern Time, such as the East Coast, will suddenly end up getting absolutely utilization come anywhere from 9:00 to 12:00 at night as the time zone moves across the country.

Interviewer:

Hmm, so does that give you any flexibility in terms of being able to add capacity to certain areas or is it just not even worth the hassle?

Patrick O' Leary:

It's really not worth the hassle. The only real time it would really make a huge difference is doing something that's continental based where you can possibly overlay different continents. So, you might do something like if you are using data from Canada versus Australia, you can basically put all those documents onto one partition and it should roll over nicely, there should be an even overlap.

But if your data set is all based within one continent, they you get a lot better performance by simply randomizing the distribution of data.

Interviewer:

Hmm, okay, yeah, that makes senses. So can you tell us a little bit about, I know you have worked on some other projects with Lucene, can you talk about those?

Patrick O' Leary:

Sure, so one of the projects that I've recently been doing is a key words extraction component that's embedded into a real time taxonomy system. Essentially, I've got a system that goes out, crawls a whole bunch of different websites. Do you mind if I take care of my dog for a second?

Interviewer:

Yeah, I can ask again.

Patrick O' Leary:

Sure. Sorry about that.

Interviewer:

No problem. All right, so I'll have them cut this part, but I'll ask again, so starting now, so I know you've worked on some other Lucene related problems, can you tell us about those?

Patrick O' Leary:

Sure, so one of the components I've been working on is a real time taxonomy system. It's essentially going out and extracting key words from multiple sites, mainly news sites at the moment and attempting to find trends and basically tag and create a taxonomy based upon the categorization based on the websites it's looking at.

From there, I'm able to extract, essentially breaking news or hot topics as they occur. And one of the key elements about it is its independent, completely independent of domain knowledge of the website itself.

It can go and look at a whole range of different websites and just examine the content and there's no entity extraction. There's no position of speech required. It just really looks at essentially a frequency distribution of the data that's on the website versus a much larger corpus of data.

The major benefit to this is that it's really good at picking out information that hasn't been viewed before it just is seldomly viewed. And so it's very good at picking out breaking components or breaking news components.

Interviewer:

So do you have to train this ahead of time or is this an unsupervised approach?

Patrick O' Leary:

It's totally unsupervised.

Interviewer:

Oh, neat. And is this in production somewhere? Or is it internal?

Patrick O' Leary:

It's still internal to me. I have it out there and viewable, so it's kind of discoverable, but I'm keeping it quite for the moment.

Interviewer:

Okay. All right, I'll leave it at that. Great. No, and so no, that sounds really interesting, I know like my background with Mahout and machine learning is we're always looking for new and novel ideas for machine learning approaches so that sounds quite interesting.

So, when it is viewable, maybe we can discuss some more, at that point.

Patrick O' Leary:

Sure, definitely.

Interviewer:

so going back to the geospatial search, you know, if I'm just getting started on this, can you point me at some good places where I can go learn some more I know you've got a website and some other things like that, so maybe you can give a few pointers to our listeners.

Patrick O' Leary:

Sure, so I put together a little website, mainly behind local Lucene called GISsearch.com. And that contains a few articles on how to begin GIS searching or geographical information searching and it will show you how to do things like geocoding, which takes an address and transforms it into a latitude and longitude that you can put on a map. There are other documentation sites out there, just by simply doing a search for local Lucene.

It brings up quite a few sites. That was one of the very interesting things about local Lucene and local solr, quite a few people use it and have actually contributed a lot to its documentation.

Interviewer:

Great. So what's the future of local Lucene and local solr, where do you see it heading?

Patrick O' Leary:

So, there's quite a community of people asking for new features. One of the biggest features everybody wants is polygon searching, being able to extend it to handle arbitrary shapes. And also, to start looking at more standardized GIS components, so there's a group out there who are working towards a single standard for GIS and shape files and descriptions of poly features called the open GIS consortium.

 

And they are publishing a lot of standards where, I'm going to start looking at them and seeing if can incorporate those. And the grid model that local Lucene has is capable of supporting, it just requires the additional work to make it happen.

Interviewer:

All right. As always, is the fun of open source, right?

Patrick O' Leary:

Yeah.

Interviewer:

Finding enough time to do all the things that you want to do.

Patrick O' Leary:

Exactly.

Interviewer:

I can relate to that. So, no that sounds great. Do you, I guess that's about all I have. I guess there is one remaining question that I should ask and what's the name of your dog so we know.

Patrick O' Leary:

So my dogs name is Scooter. He's a little three-year old Daschund.

Interviewer:

Great, so I figure we wouldn't be able to edit out some of those specific dog barks, so we might as well introduce Scooter to the world. So.

Patrick O' Leary:

Yeah, he likes a little bit of attention every now and then.

Interviewer:

Yeah, dogs are that way. Well, thanks a lot, Patrick. I really enjoyed our conversation and I look forward to more improvements in local Lucene and local solr.

Patrick O' Leary:

Yeah, it's great talking to you, Brent.

Interviewer:

Thank you.

Patrick O' Leary:

Take care.

  • Login or register to post comments

Case Study

Closing the Knowledge Gap: A Case Study - How Cisco Unlocks Communications
Solr Development Case Study: resolutionfinder.org

Whitepapers

Programmer's Guide: Using LucidWorks Enterprise to add Search to your Web Application
Getting Started With LucidWorks Enterprise

DevZone

Latest Blog Post

Lucene Revolution 2012 - Call for Participation...
Mark your calendars today! The largest worldwide conference dedicated to Lucene and Solr will take place in Boston May 7-10. The 2012 conference will build on the success of last...
  • Tutorials
  • Blog
  • Whitepapers
  • Docs
  • Forums
  • Support
Share
Follow Facebook Twitter LinkedIn YouTube
RSS Feed
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Website Feedback
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Sitemap
  • Admin

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2012 Lucid Imagination. All Right reserved.