• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
    • Certified Distributions
      • Certified Solr
      • Certified Lucene
    • Apache Releases
      • Apache Solr
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Training
    • Solr/Lucene Certification
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Apache Lucene/Solr Committers
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . DevZone . Videos & Podcasts

Interview with Mike Klaas

Download the MP3, Subscribe to the Podcast (Subscribe With iTunes)

Grant Ingersoll [search on Grant] talks to Mike Klaas [search on Mike], Lucene/Solr committer and CTO of internet search innovator Worio.com.  Worio's Discovery Engine helps users find results by observing their interests, without relying exclusively on keywords. Worio has scaled to a 500 million document corpus searched through distributed instances of Solr, with customized scoring, ranking and highlighting.

“There’s actually a lot of information that is useful to people on the web that they aren’t finding through search.  And that’s because it just doesn’t fit in the search mold  … As a person uses this system for a while, Worio learns about their interests and starts being able to provide them recommendations for web pages without even having to do search.”

Transcript

Interviewer:
Today I’m speaking with Mike Klass, Apache Solr Committer and Lucene PMC member. Welcome, Mike, and thanks for speaking with me.
Mike Klass:
Hi, Grant.
Interviewer:
Why don’t we start off by having you introduce yourself and tell us a little bit about your background before you came into Solr.
Mike Klass:
Sure thing. When I did my undergrad in computer science and math, I always had a bit of interest in natural image processing. I did my honors thesis on automatic term extraction. I took a little detour after that and worked in financial modeling software for a little while. And then I came to UBC in Vancouver to do my Masters degree in machine learning. That’s when I was approached by Ali Davar, the CEO of Worio. And he had some new ideas to approach web search. I thought it was a really interesting project. And so, we started working on that part time for a couple of years and then, that grew into several of us working full time at Worio, building a new search engine for which Solr is a central component.
Interviewer:
Can you walk me through what led you to Solr? How did you guys choose that? And what was, kind of, the run up to Solr? What else did you evaluate?
Mike Klass:
When we started the project, Solr was in its infancy before the first release, I believe. And we actually, I wasn’t aware of Solr when we first started thinking about the search aspect of things. And we started building some of these components ourselves thinking about term indices and trying to get that right. And we kind of had a little prototype version running of our own code and we just found that it’s very tricky and difficult to get high performance quality code for this kind of task. So, we started then looking around a little bit more seriously on the web to see what was out there. And it seemed like Lucene was the solution that people really started pointing too, so we thought that we would give it a try. Our main implementation language at the time was Python so we tried the highly seen implementation. And we found that the coupling of the interface between PyLucene and Python just wasn’t as smooth as we’d like and that’s what led us to Solr, which really was just a breath of fresh air. A simple interface that uses an existing protocol, http, and really, it was just a joy to use. Wrote a few scripts to start inputting data and getting queries and really, we started getting results within hours and that’s when we decided that we’d give it a serious shot.
Interviewer:
So, this sounds like a good time for you to tell us more about Worio. Actually, I forgot to also say in your intro that you are the CTO of Worio.
Mike Klass:
That’s right.
Interviewer:
So what makes Worio different from other internet search engines? What’s your approach? How do you guys look at search, and I guess, discovery as well.
Mike Klass:
The way we look at it, there’s two ways that people are finding information on the web. One way is, okay, you know that there’s a certain question you have or a certain information that you want and so you’ll use key word search. You either know the queries to retrieve it or you explore a little bit, but essentially, you have a known information need. Now, we find that there’s actually a lot of information that is useful to people on the web that they aren’t finding through search. And that’s because it just doesn’t fit in the search mold. It’s information that’s useful to them, but I mean, they might find it through being recommended by a friend or stumbling across a blog post or some other means which isn’t well structured to find this information. And what ends up happening is they might not stumble across it until quite a while later, maybe years or months. And what we’re trying to do with Worio is bring structure to the web, in terms of, this process of discovery. We call ourselves the Discovery Engine. And so the way we do that is we couple that with search, so as someone is searching on the topics that interest them we’re trying to provide them these discoveries, these relevant documents, we think, that are interesting to them, but not necessarily directly about the key words they’re entering. And as a person uses this system for a while, Worio learns about their interests and starts being able to provide them recommendations for web pages without even having to do search.
Interviewer:
You know, so one of the things that I think a lot of people run across with search is you know there are documents out there that answer your question, you are just not sure what the key words are so in some sense, the discovery could help you find those kind of documents, right? Is that, is that reasonable?
Mike Klass:
Yep. Sometimes, that is exactly what happens. By not being constrained to the key words you kind of find a slightly different approach in the discovery that answers your question. Other times you don’t even have a specific question. You might be searching for like an API documentation for a function call and then an interesting blog post about that language shows up in your discoveries. You know, you go, oh, I didn’t realize that. So, it might not be about information you are looking for at that moment, but it’s still, nevertheless, interesting to you.
Interviewer:
And it actually learns my behavior over time of what’s interesting to me so by based on what my click throughs and things like that?
Mike Klass:
Exactly, so we look at your click throughs. We look at your searches. Of course, you can turn this off for privacy, but most importantly, you can also save pages in your library and collect this set of information. And Worio builds and better and better model of it as you use it.
Interviewer:
You said you use Solr, so can you walk us through what your Solr set up is, a little more details? Keeping in mind, the listeners of this are people who use Solr and are interested in hearing about cool things that you are doing with Solr, so.
Mike Klass:
Absolutely, so we have two main uses for Solr, one of which is a large scale web search engine. So we have built of to a 500 million document corpus through distributed instances of Solr. We have intensely customized our distribution with different ways of scoring and ranking documents, as well as kind of separating a hit highlighting with relevancy search and so that is an interesting use of Solr that I would be happy to talk about in more detail. As well as, we allow users to save and tag pages into their own library and this library is then searchable. So, then as you collect this corpus of documents, you can go back later and search it at any time and so that’s kind of a much more real time use of Solr as documents are coming in and within 30 seconds they are searchable by the user.
Interviewer:
So, point on the library thing, how are you implementing that? Does each user have their own core or is each user, you just put all of them into one big index and you filter based on user ID?
Mike Klass:
Exactly, so we are just using a single index and using filter queries to do filtering, which we find works perfectly. We have a social component to Worio so people can have a friend network and another thing you can do is search all your friends’ documents at the same time. So that really necessitates things being in the same index.
Interviewer:
Now, you said you also customized scoring, is that to factor in your machine learning components? Or, was that for other reasons?
Mike Klass:
The scoring is mostly customized on the web search side and the reason we did that is because getting relevancy right at such a large scale on the web is very challenging. First of all, you need to index anchor text, which is one of the most important factors for relevancy. To do that properly with Lucene you really have to think about the length normalization because the set of words in the anchor text per document has very different properties from say the bonding of a document. In fact, the repeated instance of the same term are much more highly indicative of relevancy in anchor text than it is in the body. So you really want to soften the length normalization and not penalize documents that are longer just because they have more terms. In addition, I mean, it’s necessary to think about the link structure of the web. And so, we found the best way to do that was to incorporate a multiplicative component to the score. So, we added some customer query classes, which allow you to specify an arbitrary query string as a multiplied component, a multiplied boost onto the relevancy.
Interviewer:
So are you essentially, you are calculating authority as well with, when you do the link analysis structure?
Mike Klass:
Essentially, right. So that’s a completely offline process. We don’t do that within Solr, but we compute basically a boost value for each document and it’s important to us that we have a relatively granular specific number here so that’s why we implemented this as a custom query class, as opposed to just using the boost value, the document boost within Lucene.
Interviewer:
I see, so you’re not. You don’t have to deal with updating the documents every time you recalculate this boost factor or do you?
Mike Klass:
The way we have it currently implemented you would have to be next to the document.
Interviewer:
Okay. So you can just have this kind of background test that’s run either at some interval and just goes through and is constantly updating.
Mike Klass:
Right.
Interviewer:
Oh, that makes sense.
Mike Klass:
Basically, we build these indices kind of all at once so we kind of have a large crawl that we process, analyze a whole bunch and then all index at once in Solr.
Interviewer:
So, speaking of crawling, what’s your approach to crawling? Does that, do you guys roll your own or are you using something like Nutch or one of the available crawlers?
Mike Klass:
We’re actually rolling our own on this one. We really require a crawler that can be distributed well over dozens of computers. I’m not actually familiar with the state of the art in the nutche project, but from what I understood, it didn’t have kind of that much flexibility back when we first started evaluating these things. But, the advantage of this is we can kind of do things in an offline process so we can crawl a couple million documents, give an analysis of them. Look for kind of the high authority documents, the high authority links that we don’t have and then prioritize those in the next batch of crawling.
Interviewer:
What’s your approach to dealing with spam? I know people who search on the internet, spam is always a constant issue. What do you guys, do you guys have anything special you are doing there or are you just relying on the authority calculations to figure it out?
Mike Klass:
Yeah, unfortunately, it isn’t that simple, because most spam on the web is exploiting this authority calculation.
Interviewer:
Right.
Mike Klass:
And trying to do things that’ll artificially inflate it, the importance of those documents. So we have to do things where, for instance, we seed an authority calculation with known good documents, and as well as, known bad documents, and propagate in goodness in one direction through links. And badness and spaminess through the other direction. On top of that, we also have machine learning based classifiers so we have a, I mean, a lot of spam pages look essentially similar. They are kind of these generic looking pages with on specific topics like credit card insurance or something like that and mostly filled with Google Adwords. And so you can kind of pick up on some of these features and learn what makes a page spammy. That will help a little bit. I mean, all these things have to kind of be all carefully put together, but it’s still and it’s still a big problem.
Interviewer:
So that all factors into your offline crawling task, I imagine.
Mike Klass:
Right. So often, we filter these documents before they even get to Solr.
Interviewer:
So you’ve mentioned machine learning a couple of times here. Being involved with Mahout, which I know you guys aren’t using, but I have an interest in machine learning as well, so can you fill me in a little bit about your approach to machine learning?
Mike Klass:
Machine learning is really quite central to Worio. There’s lots of kinds of discovery themed sites on the web, which what they end up doing, is providing, kind of, social sharing of links, which we think is quite important and we do allow that behavior as well. But, our goal is really to expand this data that people are providing by tagging and sharing links and generalizing that to as much of the web as possible. So, for instance, one technology that is key for us is automatic tagging. So, we look at the corpus of documents that users have applied tags too and apply pattern recognition methods to, to see what makes a certain tag apply to a document. And then we can go out and apply that to the documents on the web, so that is crucial to us. On the other side of things, it’s just, we have to consider how to generate interesting, relevant recommendations. Part of that involves learning a model of the users likes and dislikes. Not only what topics they are interested in, but also, if they kind of prefer articles from a certain source, articles from certain authors. I mean, we can consider almost sometimes even the ton or length of articles. Like there are all these things that actually implement how people might perceive a document. So, that all goes into a learned model of the user and is applied when generating recommendations. Furthermore, just in terms of general interests, some collaborative filtering techniques are employed. I can’t go into to many details there. I hope that gives you a sense of what we’re doing.
Interviewer:
This is kind of the future of search in a lot of ways, right? Is that, it’s not just enough to do key words, you actually have to take into account what people are doing as well. Well, it’s not actually even the future or search, it is the current state of internet scale search, which books like Programming Collective Intelligence and things like that have made really popular. But how do you deal with the problem of say, overfitting for a user. So I’m on there and I keep choosing the same things over and over again. Do I get stuck in a local minimum or do you have ways of dealing with that as well?
Mike Klass:
Well, we try to smooth out these models when you have, especially, low amounts of data. So, from a Bayesian perspective, it would be like having a relatively strong prior that you update with the user’s interest and the less data they provided to you, the less you trust their preferences and the more you trust the prior model. I mean, if they keep clicking on the same thing over and over again and tagging it then eventually you’ll start believing that, okay, that’s really what they want and that’s what they’ll get. But no where do we ever limit explicitly to the users model. I mean, when you put in a query, it will always be relevant in some way to the query. Also, we really try to push things a bit towards kind of zeitgeist and current interesting topics, so there will always be a little slice of that as well.
Interviewer:
So does that happen automatically in your system or is that like you guys have some editorial input to it?
Mike Klass:
No, it happens completely automatically. We’re constantly grabbing data about what’s going on in the Web and that leads into automatic algorithms.
Interviewer:
Going back to your Solr a little bit. You said you had a distributed model, how many machines and things like that? Can you tell us a little bit about infrastructure and operations? Are you replicating as well your shards and things like that?
Mike Klass:
When we started doing a distributed Solr architecture, we actually had our own distributed system already in place and we essentially used Solr as an endpoint indexing node. This was before Solr had its own distributed querying. Before Solr had even multiple cores. Some of this architecture I wouldn’t, necessarily, recommend to someone who was jumping into Solr now, but could use some of these tools. But I will describe to you what we did. For our corpus of about 500 million documents, we had about 60 indexing machines. We had two Solr instances running on each machine so about 120 and they served about 10 million documents per server. We could have probably done more documents than that if we were happy with slightly slower performance or less rigorous relevancy analysis.
Interviewer:
Right.
Mike Klass:
And on top of that, we had a multiple layer query distribution distributed system, so a query would come down from the top and be distributed to about ten mid level aggregators, which would then collect the query results from about six shards, merge them together, and then post them to the top and they would be merged again. So, at every level you were only working with a couple hundred results, so you don’t overwhelm one CPU.
Interviewer:
Gotcha.
Mike Klass:
And at the very top, these would be merged together in a priority queue and then the top results would be sent back down for highlighting and document data fetching. One thing that we had to do to make this scale is separate out the contents of documents in terms of storing them as a stored field from the index because if you are querying 120 instances, you are going to be retrieving, it’ll sort through hundreds of documents before merging them together and this is quite slow if you have 10 or 20 K of data per document. So, we had two parallel Solr's running in the same web ap. One of which had the reverse index, one of which had the document contents. And so the reverse index would be the one that served the query and once the results were merged from all of the servers we request highlighting for the top ten documents from a separate request handler, which would fetch the document data, do the highlighting, and then send results back up to the top.
Interviewer:
That’s all roll your own right there, right?
Mike Klass:
That’s all roll our own, especially that later thing we just talked about separating the data into two different Solrs' and writing a different request handler. It really wasn’t much work at all.
Interviewer:
Right, Solr gives you a lot of that capabilities.
Mike Klass:
Right.
Interviewer:
Exactly. Do you have any other custom components you can talk about as far as things that you have done in Solr that you feel are pretty interesting?
Mike Klass:
Sure, I think, in addition to what I already talked about in terms of the scoring, one component which I think works quite well is we think about the proximity of words in terms of relevancy. Now, Solr has this built in already with the dismax handler and the phrase query boosting. But one disadvantage of that is that that will only work if all the query terms actually exist as a phrase, so we built a very similar query class to the phrase query, which doesn’t require that all the terms match. And so, that was, I thought, a very interesting kind of way to get around that problem.
Interviewer:
Right.
Mike Klass:
Now we also have the main query request handler is, has some custom components on it as well. For instance, you can send it a filter query that instead of being applied post facto, like the current architecture, it actually embeds it within the query which can make for more efficient querying because you are not, for instance, mashing a large amount of documents on a query and then refine to a small amount. We took advantage of Lucene’s short cutting semantics to speed things up. And again, that same request handler has a bunch of additional primer to support different kinds of multiplicative scoring and different kind of boosting as well. But when we implemented it, we essentially wrote our own request handler by taking the existing one in Solr and adding things to it. These days, you can write a search component and just kind of plug that into the existing request handlers, which is a much better and more maintainable way of doing that.
Interviewer:
So what’s next for Worio? Where are you guys headed? What’s your plan for taking over search?
Mike Klass:
We’re kind of just in the phase of starting to try to get some feedback to what we’ve released already. We’ve kind of achieved a scale now where we feel that the site is really interesting to use. It’s not exactly where we want it yet. We still think there’s a long way to go. The field of discovery with search is still new and there’s lots of ideas we want to try, but we’re ready just to get some feedback for our user base. As a company, we might be looking into raising more financing in the next year, but essentially kind of removing the beta tag. You know?
Interviewer:
Right.
Mike Klass:
Getting feedback, stabilizing things, just making something polished enough that we can proudly remove the beta. Not that there’s anything embarrassing about having beta on a product.
Interviewer:
So, any tips you have for people getting started with Solr or thinking about maybe what’s next in Solr for you guys?
Mike Klass:
Well, we don’t have any major plans for our use of Solr in the short term, essentially Solr has been so rock solid for us that we haven’t had to put much thought into how we’re going to change our use of it in the future. One thing I’d like to do is re-factor some of the customization we have done into using some of the customization points that the current version of Solr supports, like the search components that I mentioned.
Interviewer:
Right.
Mike Klass:
For someone who is just starting to get into Solr, this is perhaps obvious advice, but I think it is the most important. I mean, Solr has such a great example. I mean, I saw, I heard you talking to Chris Hostetter, how you tend to work from the example and move on to there. And that’s exactly what we did as well. You know, you start with this schema that has all these different things. You can play around with them and really get something usable very quickly. It’s a really fast learning process.
Interviewer:
The trick there, of course is that people will still, once you’ve gotten up the curve a little bit, it’s good to then go back and revisit, as well.
Mike Klass:
I agree. At some point, you really want to dive in a big more deeply and see what each primer means and really customize things the way you want to because the customization provided by Solr is practically boundless.
Interviewer:
Well, great. Thank you, Mike. I’m glad you could take the time to talk with me and good luck with Worio.
Mike Klass:
Thanks so much, Grant.

 

 

  • Login or register to post comments

Case Study

Closing the Knowledge Gap: A Case Study - How Cisco Unlocks Communications
Solr Development Case Study: resolutionfinder.org

Whitepapers

Programmer's Guide: Using LucidWorks Enterprise to add Search to your Web Application
Getting Started With LucidWorks Enterprise

DevZone

Latest Blog Post

Lucene Revolution 2012 - Call for Participation...
Mark your calendars today! The largest worldwide conference dedicated to Lucene and Solr will take place in Boston May 7-10. The 2012 conference will build on the success of last...
  • Tutorials
  • Blog
  • Whitepapers
  • Docs
  • Forums
  • Support
Share
Follow Facebook Twitter LinkedIn YouTube
RSS Feed
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Website Feedback
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Sitemap
  • Admin

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2012 Lucid Imagination. All Right reserved.