
Grant Ingersoll Talks with Peter Keegan
Grant talks to Peter Keegan, principal software engineer at Monster.com, about how Monster uses Lucene to match jobs and jobseekers with search that spans nearly a million jobs alongside tens of millions of resumes, updating index and content every 15 minutes, plus using payloads with fields, and more.
Transcript
- Grant Ingersoll
-
Today I'm talking with Peter Keegan, who's a principal software engineer at Monster.com. Welcome, Peter.
- Peter Keegan
-
Good morning, Grant.
- Grant Ingersoll
-
Good to have you. Can you start off by introducing yourself and give us a little bit about your background?
- Peter Keegan
-
As you said, I'm a principal software engineer at Monster, and prior to that, I spent a number of years working at the AltaVista Company. You probably remember them.
- Grant Ingersoll
-
Yep.
- Peter Keegan
-
One of the first search Web sites to go online, I think, back in 1996 or so. Anyways, our group was in the software business division of AltaVista which was responsible for building Enterprise search products based on the technologies that were used at the AltaVista.com Web site. And in particular, my responsibilities for the two major components were a CAPI that was used in an SDK library for Enterprise search for companies that were writing their own search applications. Monster was one of those; I believe CNET was another one and quite a few others were writing their own search applications.
Others who didn't want to write their own, would use this other component. It was basically an out-of-the-box search application suite. I was responsible for the indexing and search servers in that part, but it included crawlers for database, Web, email, file systems, all of that good stuff, so you could basically configure it and get it up and running fairly quickly. It wasn't terribly customizable, but for many customers it worked just fine.
Then around 2003 I think it was, AltaVista basically was split up and sold and I think Yahoo! ended up with most of the intellectual property, and I think one of the other well-known Enterprise search companies also benefited from the sale.
And then after that for about a year or so, I did some consulting for some of the existing AltaVista customers who needed support or who maybe were thinking of migrating to something else, and that's where I learned about Lucene. I think it was in its maybe 1.2 of release at that time, so I got to become familiar with Lucene there, and then I started at Monster about a year after that.
- Grant Ingersoll
-
So that sounds like a good segue. Can you tell us a little bit about Monster, what they do and how they approach search?
- Peter Keegan
-
So Monster is, if you've ever been to the site, is a job seeker's site. That's how you would see it if you went to Monster.com. It also is a site where recruiters can post their jobs and also search resumes to find matches for jobs that they have. There are lots of other things at the site. You can develop a career path based on job titles that you are interested in pursuing, things like that.
Monster had been using AltaVista when I started there and was in the process of evaluating alternatives, including building their own search engine, buying commercial search engines from other well-known places, or using Lucene, and that's kind of where I came in.
- Grant Ingersoll
-
So were you involved in the decision-making on Lucene, too?
- Peter Keegan
-
No, not at all.
- Grant Ingersoll
-
Oh, okay.
- Peter Keegan
-
No, basically what I was doing there was doing a lot of prototyping with Lucene, showing what it could do, trying to get some performance numbers to see if it was gonna be able to scale as well as AltaVista was, which was very good. And I think the breakthrough was when we tried it on a 64-bit platform, and we realized, wow, this thing could really scale to tens of millions of documents.
- Grant Ingersoll
-
Wow, yeah, that's pretty common to hear from people. Can you describe a little bit more about Monster's domain? I mean you have pretty interesting data that you're searching. You've gotta be able to find keywords, I imagine, about jobs, but you also probably have resumes and you have job descriptions. Can you describe how you model that in Lucene?
- Peter Keegan
-
Okay, so it's pretty straightforward. The two largest databases of information are the jobs that are being posted across various companies, some private, some public and in many, many countries growing every day. And the other large depository is of resumes for people have submitted over the years. So essentially, a job maps to a document in Lucene as does a resume.
- Grant Ingersoll
-
Okay, so but I mean resumes are usually quite long. Whereas job descriptions are pretty short. Does that play a factor in how you're doing search or do you just not worry about it?
- Peter Keegan
-
I guess where it matters is at how much storage you need. Obviously, when you reach a certain point with index size, if when your response time gets too low, then it's time to start distributing it and creating shards and things like that. We shard the index for resumes. We don't shard it yet for jobs, so that kinda gives you the idea. As you said, jobs are much smaller, and we have far more resumes than jobs so...
- Grant Ingersoll
-
Can you give us some ballpark idea of what kind of sizes you're talking about for jobs and resumes and kind of a picture of your architecture that way?
- Peter Keegan
-
Well, just the ballpark numbers, jobs, they vary from hundreds of thousands up to a million or so. Resumes is far more. We're talking tens of millions because people post them and leave them there for a long time. Jobs tend to expire after a period of time. They're fairly large but jobs is kind of moderate size; resumes is very large and growing all the time. Whereas, jobs tends to be more… jobs come and go, so it kind of moves around with the economy more than anything else.
- Grant Ingersoll
-
Yeah, so you have to deal with keeping that index up to date much more so than the job side.
- Peter Keegan
-
Absolutely.
- Grant Ingersoll
-
The job side you probably just have a lot of ads. Whereas, I'm sure you have ads and updates but on the job side you have deletes as well.
- Peter Keegan
-
Oh yes, the update model for jobs is much more dynamic. In fact, we basically perform updates every 15 minutes. We have a model that works quite well for that. It keeps the response time low.
- Grant Ingersoll
-
Great, so can you tell us a little bit more about kind of your high-level architecture of how you have this set up and maybe a little bit on what kind of hardware you're running on?
- Peter Keegan
-
Sure, yeah. We're basically running on commodity hardware, Dell hardware. They're fairly old right now, but they've got eight CPUs on them. We're running on a Windows 64-bit architecture with Java 6, and with the jobs index, we'd see during maybe low times of the day, 300 to 400 queries per second with an average response time of maybe 40 milliseconds.
- Grant Ingersoll
-
Nice.
- Peter Keegan
-
Which is pretty good and we use a replication truss, geographically separate data centers for high availability, and I think we're running eight servers now and typical CPU time is maybe ten percent, so we have plenty of room for growth there.
- Grant Ingersoll
-
Switching gears a little bit, I think you've mentioned to me in the past that, and just from dealing with you on the list, it seems you do a lot with payloads and maybe you have some customizations along those lines. Can you tell us a little bit about some of the things you've customized in Lucene?
- Peter Keegan
-
Sure, the reason for going with payloads is basically for the boosting term queries. We boost certain terms in the jobs index. For example, keywords that match terms in the job title get a big boost or are in the physical location get a boost. So the boosting term query was a natural for that.
- Grant Ingersoll
-
So if I see the word "Lucene" in the job title for instance, you might boost that by five or something like that, and then you have the boosting term query is actually – for those who aren't familiar. The boosting term query is a query that can look at position by position within the index and then take a byte array from it and use that as a scoring factor. So you would take the word "Lucene" in the index or in your title and you'd say, "Oh, Lucene is much more important than these other ones," and then you could boost matches on that word?
- Peter Keegan
-
That's right so maybe I should talk a little about how we structured the index. So we have a lot of fields – I don't know – 20 fields or so, job title being just one of them. And what we do is we index all these fields separately, but we also create a contents field, which I think is fairly common, which basically has all the other fields mixed in all into one big field. So that if you're not doing a field-specific search, you just search the contents field, and that's basically what most of our searches are against.
So when we put terms into the contents field, if we know that it's one of the words from a job title say, then we put a payload on that term.
- Grant Ingersoll
-
Gotcha.
- Peter Keegan
-
So we have an analyzer that is responsible for looking for semaphores that said ah, this is a term that needs to be boosted, so set the payload on it.
- Grant Ingersoll
-
And then you're at query time the boosting term query. I mean, so the boosting term query is a span query, which means it can take advantage of the positions. You found that that performs well enough in your environment, to have to go and look at specific things, specific payloads, position by position.
- Peter Keegan
-
Right, so it's definitely, we took a little performance hit there but the win was that it was a net no change because the way we had been doing boosting before that was by creating a separate boost field using standard Lucene field boosting.
- Grant Ingersoll
-
Right.
- Peter Keegan
-
But it complicated our queries because every keyword query had to search both the contents field and also this boost field, and so you end up having a lot more terms in the query. And then it got a little sticky when you had excluded terms.
- Grant Ingersoll
-
Yeah, right.
- Peter Keegan
-
Because the clause for contents says okay, well the job says, well that word's not in the job title but it was in the contents field so...
- Grant Ingersoll
-
Yeah.
- Peter Keegan
-
You can get around it, but it made the query parsing logic kind of get a little bit unwieldy so that the boosting term queries that I guess you put it, Grant, really simplified that tremendously.
- Grant Ingersoll
-
Oh, great.
- Peter Keegan
-
The downside, of course, was that there's no query parse or support in Lucene for the span queries and the boosting queries and so we basically had to move away from Lucene's built-in query parser and write our own.
- Grant Ingersoll
-
Yeah, I think a lot of people tend to at least modify the query parser if not write their own.
- Peter Keegan
-
Yeah.
- Grant Ingersoll
-
Just because you may not like that syntax or you want your own syntax or something along those lines. So can you tell me a little bit about what you did for the query parser then?
- Peter Keegan
-
Yeah, so we had this need to do the boosting and also we had these custom queries that handled numeric range searches basically outside of Lucene, but we needed to kind of weave the custom queries into the regular queries to get the proper business logic, so we chose ANTLR. It's a freely available source parser, sort of along the lines of Lex and Yacc but much more advanced, and basically wrote a Boolean-style query parser using the ANTLR parser, mainly because we have to kind of support backwards compatibility with the previous query language which was basically a Boolean language. And Lucene, as you know, is not strictly a Boolean parser, but you can kind of make it look like one with the proper logic around it so...
- Grant Ingersoll
-
Right. I also imagine searching for jobs tends to be a pretty local kind of approach, so you must have some geography capabilities in your search as well.
- Peter Keegan
-
Oh, for sure, yeah. I mean quite a few of the queries, if you actually go to the site and go to search for a job, you'll see a location field, and in there you'll see the first implementation of a Lucene index. This is not the jobs index, but this is a special locations index. If you start typing a city, it will do the auto-fill for you. Type B-O-S, and it'll pop up a bunch of cities that start with B-O-S, and that's actually going to a small location index using n- grams to find cities that begin with those letters. And it's actually going over a network hop, and the response is just incredible. It works very well.
- Grant Ingersoll
-
So it's like an AJAX drop down.
- Peter Keegan
-
Yeah, exactly and it's going to a static index on a back end.
- Grant Ingersoll
-
So then once you have your city, are you doing like a bounding box approach, where you take the centroid of whatever that city is and then look within a certain radius or does it just do keyword matching?
- Peter Keegan
-
No, so this is actually a simple bounding box. I think others have used the same techniques, where you approximate a circle with a rectangle, okay, so basically you end up having to do two range queries between these two longitudes and between these two latitudes. Yeah, okay, we got a match and so...
- Grant Ingersoll
-
And you use that as a filter to find only those documents within the bounding box basically.
- Peter Keegan
-
Right, right, so, but the thing is, unlike a Lucene filter, where you can reuse them, radius search is like this: every query can be different. So you can't really cache these things.
- Grant Ingersoll
-
Hmm.
- Peter Keegan
-
So you gotta kinda do the computation each time, and that's where we realized we might have some performance issues way back when, in looking at the way the old early version of Lucene did numeric range queries. You gotta worry about having too many Boolean clauses in the query.
- Grant Ingersoll
-
Right.
- Peter Keegan
-
So basically what we did was we said, well, we're gonna keep all our numeric data that we need to do ranges queries on but basically in this little extension outside of Lucene.
- Grant Ingersoll
-
Humph.
- Peter Keegan
-
And it's, essentially what we distilled it down to is all the numeric data we can represent as integers, and that gives us enough precision for dates and for geographic latitudes and longitudes and anything we need to represent as integers. And with that, we can essentially create a Lucene record that corresponds to the document that basically has a bunch of integers in it, and you can very quickly get to any of those values through basically just an array access. And so it's very fast.
We have a range clause in the query. We have a custom query object that basically does the computations using this extension data, and Lucene then merges it in with all the other clauses in the query and we get the proper match.
- Grant Ingersoll
-
Ah, that makes sense. That's neat. So have you looked at the new numeric capabilities that are coming out in 2.9? I mean are they?
- Peter Keegan
-
Yeah, the Trie stuff?
- Grant Ingersoll
-
Yeah, the Trie, Tree.
- Peter Keegan
-
Yeah, pardon me? Oh, Trie.
- Grant Ingersoll
-
Yeah, Tree, Trie. I mean, just for our listeners, sometimes people say "tri," sometimes "tree." I think it's actually just now called Numeric Range.
- Peter Keegan
-
Okay.
- Grant Ingersoll
-
So better numeric support basically is what it comes down to.
- Peter Keegan
-
Right, yeah. Yeah, certainly if that had been available a few years ago, it definitely would have been something we'd looked into. But what we have now works extremely well. It does make a lot of assumptions about how Lucene indexing works, and we have to be very careful and we have to know when Lucene is going to renumber the doc IDs because we have to keep this data that's outside Lucene in sync with Lucene's document IDs. So but that's not too hard once you understand how it works.
- Grant Ingersoll
-
Right, so you just have to kinda control merging and your deletes and things like that.
- Peter Keegan
-
Yeah.
- Grant Ingersoll
-
Right.
- Peter Keegan
-
So we have total control over that. We do this on these 15-minute intervals. We actually have a separate indexing process and separate search process, which has worked out very well. It makes things fairly robust. Typically, the problems you might have creating index may be from corrupted data or something like that coming into the pipeline, and you don't want one bad job causing searches to stop for the rest of the world.
- Grant Ingersoll
-
Right.
- Peter Keegan
-
So the searching is always up and running. If we have a problem with a job in the index, we can get that cleaned up and corrected without the job seeker being aware that something was not working for a while.
- Grant Ingersoll
-
That makes sense. So I think that kinda wraps up my technical questions, and one question I'd like to ask for kind of at the business level is, so what has Lucene meant to Monster in terms of are you able to quantify like say cost savings or anything like that? Obviously, you guys had AltaVista before, and even your background with AltaVista, so you were with a proprietary vendor.
What was involved in the switch to Lucene businesswise? Can you tell us a little bit about what you guys went through in making that switch?
- Peter Keegan
-
So I wasn't directly involved in it myself, but what I did observe was that we realized that Lucene was very extensible. It definitely was very clear that it was gonna scale well. While we thought we might have to make some changes to the Lucene core and that we would be able to do that quite easily, it turns out we haven't had to make any changes to Lucene core other than just a few patches that are for some bug fixes.
And meanwhile, the evaluation of the Enterprise engine had some issues with the update model. You couldn't get updates searchable in the timeframe we needed, this 15-minute timeframe. So it became apparent that Lucene was probably gonna be able to provide the performance that we needed and fairly easily because it was not that different than the AltaVista library in terms, if you know the kinds of capabilities it provided and the level of API and how you can get into the expert APIs if you needed to and that kind of thing. So I think the cost thing, I don't think that was so much of an issue. That was probably more of a bonus but it was more that we were able to provide a scalable and reliable solution.
- Grant Ingersoll
-
Right and it sounds like four years down the road, I mean, you're pretty happy still with your choice.
- Peter Keegan
-
Oh, absolutely. I mean, we've been in production now for probably two and a half years and not one Lucene bug in that entire time.
- Grant Ingersoll
-
Great. Great.
- Peter Keegan
-
We did a fairly extensive beta period before that, a lot of testing. I think there was one thing that came up years ago, and this was in the very early days of developing what they were. There was a performance issue when there was a heavy search load, and we were updating the index at the same time. This is before we had separate indexers and searcher. And I remember submitting a question about the problem to the users group and got an acknowledgement within 20 minutes and a patch couple hours later. That blew me away.
- Grant Ingersoll
-
Yeah. Yeah it's.
- Peter Keegan
-
Amazing.
- Grant Ingersoll
-
Usually when it comes to performance like that, the community is right on top of it 'cause that's one of the things I think we really focus on.
- Peter Keegan
-
Well I think that's one of the best things about it. I mean, I think I can appreciate Lucene more than a lot of the others because I deal with the groups. I'm probably this is the point of contact for the Lucene community here, and between the groups and the Lucene In Action books, which are excellent, there's a lot of good things going on.
- Grant Ingersoll
-
Yeah, definitely. Well, Peter, I think that's all I have. If there's anything else you'd like to add, feel free.
- Peter Keegan
-
My kudos to your whole team and I'm glad to see that you've got your own commercial company now, see a lot of good things coming out of there, too.
- Grant Ingersoll
-
Yeah, it's been enjoyable on my side as well, and Lucene has definitely meant a lot to me so... I really appreciate you taking the time from your busy schedule to speak with me and I look forward to seeing more great search capabilities out of Monster. Thank you.
- Peter Keegan
-
Thank you, Grant.
