• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
    • Certified Distributions
      • Certified Solr
      • Certified Lucene
    • Apache Releases
      • Apache Solr
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Training
    • Solr/Lucene Certification
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Apache Lucene/Solr Committers
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . DevZone . Videos & Podcasts

Interview with Chris Hostetter

Download the MP3, Subscribe to the Podcast (Subscribe With iTunes)

Chris Hostetter (search for Chris) speaks with Lucid's Grant Ingersoll about Solr and Lucene, writing Solr plugins, Solr at CNet, the dismax query parser, and faceted search, amongst other Solr features. "if I’m a PHP developer, I can have Solr up and running and just hit it and talk to it via the way I know how to do things in PHP." ... "As long as you can run a servlet container, which is fairly trivial for most people, you don’t have to understand why or how it’s working under the covers. You just have to know that when you hit these URL’s, you’re going to get this type of response."

Transcript

Grant Ingersoll:
Today I’m speaking with Chris Hostetter, one of the original Apache Solr Committers and a Lucene PMC Member. Thank you for joining me, Chris.
Chris Hostetter:
Thanks.
Grant Ingersoll:
Why don’t we start by having you tell us a little bit about your background before Lucene and Solr?
Chris Hostetter:
I’m a software engineer at CBS Interactive. I’ve been there for about ten years now, starting back when it was CNET Networks. I started there straight out of college when I graduated from UC Berkley and pretty much from day one was working on search applications for CNET.com.
Grant Ingersoll:
Great. So how did you get involved with Solr and Lucene?
Chris Hostetter:
We had been trying to scratch an itch with regards to product searching a couple of years back and dealing with faceted searching of products based on product attributes and a co-worker of mine had been working with Lucene and done a prototype for an unrelated project and based on what he had told me about how it worked, it seemed like it would be a really easy way to implement some of the things that we were trying to do. So I started playing around with it and within a couple of weeks had a really good sense of sort of the fact that I could take it pretty far considering that it was open source. I’d played with open source before that. I’d used some open source tools, but that was the first time that I had ever really used an open source library where I’d been able to get down into the guts of it very quickly and it compared to some of the proprietary search tools that we’d been using before. I realized that this was going to be pretty much a perfect fit for what I needed to do.
Grant Ingersoll:
Great, so you mentioned faceted searching. Can you explain a little bit more about what faceted search is?
Chris Hostetter:
I hope so. (Laughter) Faceted searching mainly comes down to giving the user options related to refining their search, but always ensuring that those options will result in a meaningful set of results, that you are never giving them a dead link going to a no results found page or something of that equivalent. When you compare sort of traditional search page from ten years ago or from a less than stellar implementation, you might have lots of pull-down options or check boxes for refining your searches. Faceted searching tends to involve more immediate gratification. Usually there is what we would call a facet constraint count telling you, “If you pick this option, if you restrict your results in this way, this is how many results you will get.” So the user always has a very good expectation of what they are about to change when they narrow by category or narrow by price range or narrow by some property of the results that they are dealing with.
Grant Ingersoll:
So is there maybe a website listeners would be familiar with, that they would be like, “Oh yeah, that’s faceting.” Can you think of one?
Chris Hostetter:
Oh, well, CNET.com for starters, but other good examples? One of the first ones that I remember doing a really good job was Recipes.com in terms of browsing recipes and they provided a really good faceted UI for saying I want to narrow by what type of meal I’m cooking for or I want to narrow by the main ingredient or preparation time, things like that.
Grant Ingersoll:
So it might have like Mexican and it would tell you how many recipes are in Mexican or in Southern and how many recipes are in Southern and then you could narrow it down that way, right?
Chris Hostetter:
Exactly, yeah. It’s all about giving the user the options in advance to know how they can filter their results in more than one way without them having to click a link and see, “No I only have two. Back up, click a different link, okay now I got seven.” You want to give them all that information up front before they ever move forward.
Grant Ingersoll:
Right and I suppose it helps. I think a lot of times as a searcher you end up putting, adding key words, and subtracting key words, and trying phrases. I think this kind of helps in that case too.
Chris Hostetter:
I think the biggest feature of Solr is sort of its ease of use as an application, there’s a lot of things that can be done sort of out of the box, as the expression that tends to come up a lot, without having to write a lot of custom code. You can use things like the data import handler or the CSV loader to load directly from CSV files on disk or directly from a database using some basic sequel expressions or even index XML files remote or local. Things like that really let you get a hold of your data very quickly without needing to write custom wrapper code that you might need to write for other applications. Once your data is in Solr, I think some of the nice features are things like the highlighting support, the built in faceting support, some of the various options that exist that other applications wouldn’t have like the multitudes of query parsers and query parsing options. Once you get sort of beyond the surface, then other sort of really nice features that you don’t always see are things like the built-in cacheing which make all of those other features a lot faster.
Grant Ingersoll:
Right.
Chris Hostetter:
I’m sure I’m missing features that I am just not able to think of at this time of the morning.
Grant Ingersoll:
Yeah, well operations-wise, things like replication and –
Chris Hostetter:
Oh, right of course, yeah. Yeah, that sort of the internal features things that you don’t necessarily see or notice immediately when you’re dealing with the results, but sort of – yeah, the operational concerns of built-in replication and the distributed searching aspects for most of the really common operations, being able to distribute that across a large cluster of boxes is definitely a big win for a lot of people.
Grant Ingersoll:
Right and I think for non-Java users, they can harness all of the power of Lucene and not have to worry about knowing that much about Java.
Chris Hostetter:
Right.
Grant Ingersoll:
So if I’m a PHP developer, I can have Solr up and running and just hit it and talk to it via the way I know how to do things in PHP.
Chris Hostetter:
Right, that’s that sort of out of the box experiences that you are dealing – you can if you are willing to and you want to take advantage of almost all of Solr’s functionality without knowing the Java or writing really any code, but when you do interact with it, you can interact with it in whichever way you want dealing with that H2P interface. You are not tied like you are with some other products. “Oh, does it have a language binding for my language or is it implemented in my language?” As long as you can run a servlet container, which is fairly trivial for most people, you don’t have to understand why or how it’s working under the covers. You just have to know that when you hit these URL’s, you’re going to get this type of response.
Grant Ingersoll:
You mentioned out of the box thing and I really like your Solr out of the box talk that you gave at ApacheCon. Could you maybe just walk us through kind of what you do during this talk and I’ll put up a link to your slides for the interview?
Chris Hostetter:
Sure, the out of the box talk is something that I started a year ago, mainly because I realized a lot of people didn’t fully appreciate everything Solr could do for them or they thought that in order to take advantage of some of the features Solr could do, they would have to write plug-ins. Admittedly, I’m guilty of most of that because prior to giving this talk, I really stressed how easy it was to write Solr plug-ins; but what I tried to in the out of the box talk is really show, okay, from the moment you download Solr, what are you capable of and how will you use it to take advantage of those features? So the talk really tries to cram into a short amount of time all of the features of Solr with an explanation of what that feature means. We don’t get into a lot of depth. It’s mostly breadth, the theory being that once people know the vocabulary and vernacular, once they know that faceting – what does the word faceting mean? Or the fact that – what highlighting means in terms of generating snippets in terms of where the search terms are, once they get that vocabulary they can then go look at more documentation on a Solr wiki. Once we go through most of that, then I launch into a demo where I start up Solr from a completely empty install and I show sort of an iterative process of taking some data, which the very first time I gave the talk, it wasn’t my data and I hadn’t seen the data. I’d only spent ten minutes looking at it to make sure that I understood that it was in fact a CSV file, but if you do the demo from start to finish to completion, you are taking data that you have never seen before and don’t understand loading it into Solr and using Solr’s built-in tools to inspect the data and to understand what’s in there, to say, “Oh, this field would really make sense, the searchable field. This other field would really make sense as something that I want to sort on or facet on.” Certain fields, you realize, have kind of inherent delimiters that make sense to split on, that kind of thing. All of those are things that I sort of walk through what you can do and what kind of configuration options you might use to really exploit certain aspects of that data. So that when you are all done, you have a really nice, really effective result set from very good basic queries that give you really good faceting and highlighting results as well.
Grant Ingersoll:
Yeah I really like especially that demo part because I think a lot of times what new people, new to search happens is you don’t quite know yet how to map your content into your search engine and so this out of the box demo, it kind of shows you this iterative process up close. Okay, I don’t know anything about this so I start out with just a really dead simple approach and then I iterate through and I look at some samples of the data and then I just dig in deeper and deeper and it shows that whole process of getting to know your data and I think that’s really important.
Chris Hostetter:
Yeah and that goes back to your question about features of Solr. The things that really come into play there are the dynamic field support, where you don’t have to explicitly tell Solr every field name you want to use. It can sort of discover new fields and give them a default treatment. Without really even knowing what fields you have, you can say, “Hey Solr, for now let’s assume everything is text.” Then once you’ve done that, you can take advantage of things like the Luke RequestHandler to say, “Okay, because I’ve indexed it as text, what kinds of terms did I find? What are the common terms? How many documents have terms at all?” Then you can take advantage of the analysis GUI which lets you say, “Well, what if I use this kind of a tokenizer and this kind of a token filter arrangement? Then what would happen to some of the same text that I’m looking at?” Then you iterate back to the schema and say, “Based on the results of looking at the luke output and based on what I think I want to try using the analysis gooey, let’s go refine the scheme a little bit to make the title field be tokenized in a slightly different way.” Then we’ll reload some of that data and look at it again and luke and refine again to the analysis UI and circle back over and over again until we really get the experience that we want.
Grant Ingersoll:
Being a community member, I’m always amazed at kind of the things people do with Lucene and Solr. Can you think of some interesting applications for Solr or Lucene where you are like, “Wow, I would have never thought to do something like that?”
Chris Hostetter:
I think the first really big wow, wasn’t so much a novel use of Solr, just it kind of surprised me. It came out of left field. It was in the early kind of incubation of Solr. Wow, his name is escaping me right now, but it was – he was from Zapos.com and he just sort of chimed in and with very little participation on the mailing list, he just one day said, “By the way, the beta Zapos is using Solr.” It just sort of hit me that it’s like, “Oh wow, here is somebody out there who didn’t really need a lot of help to get this up and running and this is not a toy. This isn’t a trivial usage.” This is something where, even though this guy wasn’t involved from the start like some of the rest of us, he got it and he got how to use it and he got how to make it work. I think one of the most novel uses I’ve seen lately that kind of surprised me isn’t really that novel; it’s more how they went about it. The local Solr components and the local scene was something that kind of surprised me when it was – I’m sure other people knew about it before I did, but realizing how much they were able to accomplish with geographic and spatial searching without really needing to hack the internals very much. It’s always something that I kind of though could be done with Lucene, but I always assumed that there would be inefficiencies involved just because Lucene and Solr hadn’t been built specifically for that purpose and I know that there are a lot of specialized API’s out there for dealing with spatial searching. The fact that they were able to take the component model that exists for Solr plug-ins and really say, “No, no, we can make these spatial algorithms work using this kind of component model and we can make it work well and we can make it fast.” That really impressed me and that really – especially as a committer and as someone who has sort of tried to help flesh out what those API’s should look like, the fact that they were really able to work with that API and didn’t need to jump through a lot of hoops, that was a big surprise to me to realize that we had gotten that as right as we had in terms of making it useful and functional.
Grant Ingersoll:
Yeah, I find the flexibility of the plug-in architecture especially gives you this opportunity to view Solr as a whole text application engine. You can move beyond search even. Search is kind of that foundation for a lot of other things that go beyond it and I think that really speaks well to how flexible Solr is.
Chris Hostetter:
Yeah that’s – Ryan was one of the first people to really kind of open my eyes to the fact that there were people out there who weren’t just using Solr as an application. They were using it as an application platform, sort of using it kind of instead of other model view container frameworks for building their web applications. He was building his applications directly in Solr because – not embedding Solr in something else. Not stealing – or stealing is the wrong word. Not extracting Solr code to put in other applications. He was using Solr and the Solr plug-in framework as his MVC architecture.
Grant Ingersoll:
Yeah.
Chris Hostetter:
Instead of writing controllers, he was writing request handlers. It was very – that was another one of those surprising things to sort of come out of the community that made me realize, you can’t make assumptions about how people can use this sort of things.
Grant Ingersoll:
(Laughter) Right. Just to backfill a little bit, there is also an interview we have with Ryan McKinley, another Solr committer, who talks quite a bit about the spatial search and the context of Lucene and Solr. So listeners of this will want to check that one out as well if they are interested in spatial. What do you think are – you hinted at this a little bit with your out of the box motivations, but what do you think are some of Solr’s most underrated features? Features that people should think about using, but they may not already be using because they weren’t aware of them or they couldn’t quite figure out how to get them to work?
Chris Hostetter:
My niche and my main thing that I think not enough people do is actually write plug-in. It’s not for everyone. Obviously you have to understand some basic Java. I do think though there there’s a lot of people out there who are writing Java clients and using Solr that aren’t taking advantage of the fact that if they implemented some of the logic they are putting in their client as a plug-in, they could take advantage of some more optimizations. They could get some better efficiencies and also just from sort of an architectural purist standpoint, they could have a cleaner flow of data by making sure that functionality they want to happen on every search or on every update, for example, was happening directly sort of in the Solr application as opposed to making assumptions about, “Every client will do this.” That’s one of the things I think more people could take advantage of and admittedly, there are some problems there with the documentation of the code base itself and how to write those plug-ins. We tried to make that documentation usable enough, but at the same time there’s only so many hours in a day to really document well for people how to write those plug-ins. At a certain point, you just have to say, “Well, if they are going to be writing these, they are competent in reading and understanding Java, so they can read and understand the source code to know what they compiling again.”
Grant Ingersoll:
Right.
Chris Hostetter:
That to me is sort of one of the big things where when I see people talk about how they are using Solr and see some of their examples, I just think, “Oh, oh, but if you just change that to be a search component or just change that to be a request handler, I think it would be just as useful and maybe more useful and maybe even generically useful for other people if you contribute it back if you are so inclined.” That, I think is one of the bigger – that catches me every now and then like, “Oh, but if only you did it a slightly different way!” From end user features, I think not enough people really take the time to look at their schema files and look their Solr configs, which control Solr’s entire behavior. A lot of people I think start with the examples and then they fall into these mindsets and assumptions of what different field types exist in Solr. We have, for example, pattern tokenizers that can use regular expressions to decide how to tokenize things and we have lots of configuration options on many of these filters that let you really customize how certain aspects of text are going to be treated when they are indexed, things like that. I don’t think enough people really look at that, so they come away with this assumption that mainly I think because they are used to other search applications where you are not allowed to make those customizations where they have a really inherent built-in notion of, “This is how text is tokenized.” They come away with this assumption that the Solr text field behaves like this, where they think they know. When you are dealing with text in Solr, it always splits on hyphens or something like that and they don’t realize that they can turn that off. They can make that work any way they want. In many cases, without writing a single line of code, just by reading and understanding that configuration file they can make text mean whatever they want it to mean. So I certainly see some people with questions like that or blog posts where people talk about, they don’t like how tokenization works in Solr and they are not talking about the implementation. They are talking about the results they see without realizing that they can customize that and change that and make it observe whatever behavior they want.
Grant Ingersoll:
Yeah, I think even like my biggest example along those lines is the porter stemmer is built right into that text thing.
Chris Hostetter:
Yeah.
Grant Ingersoll:
Not everybody needs or wants porter stemming, but it’s there and it’s the default and it does a pretty good job in most cases, but it may not be for everybody.
Chris Hostetter:
Right, and actually something you just said there – I think even the way you phrase that is kind of guilty of this assumption. You called it the default. There is no actual default.
Grant Ingersoll:
Right.
Chris Hostetter:
If you disable everything in schema that does XML, there is no default behavior that says text will be tokenized in a certain way or that stemming will be applied by default. It’s just that it’s in the example.
Grant Ingersoll:
Yeah.
Chris Hostetter:
It’s what people see the very first time and the reason it’s in the example is just to make people aware that it’s there and it can be used.
Grant Ingersoll:
Yeah, but I know I’m guilty of whenever I start a new Solr project, I copied the example Solr schema right because it already has most everything. Now, I do then go and edit and think about my text fields and all that and that’s where that admin analysis page really helps, but yeah, I’m guilty of that.
Chris Hostetter:
Right and that’s sort of that fine line we walk of the example configs almost turn into a kitchen sink where we throw everything in there imaginable because when – we figure worst case scenario people do copy it and don’t look at it, but maybe six months from now they will. They will go in to tweak something and as they are scrolling through and reading it, maybe they will realize that there’s whole sets of field types and tokenizers and analyzers and token filters that they didn’t even realize because they are associated with field types they are not using, but at least if they are skimming past it, they might see those comments and see those explanation and then actually make a decision, “Oh, do I want to delete this or do I want to look into this and learn more about it and actually start using it?”
Grant Ingersoll:
Definitely. I think maybe one of the things we could do is actually provide in the example, links to – URL’s to the wiki where it explains parts of those things if we don’t have them already.
Chris Hostetter:
Right and that, to me, is one of Solr’s kind of biggest faults at this point. The documentation is largely wiki-based and while that’s great for sort of community involvement of editing that and tweaking that, we don’t really do enough to generate user-level documentation that really aims for completeness. The question that I’ve seen come up a couple of times sporadically over the year or so is, “Where is the list of all of the token filters or where is the list of all the languages that are supported?” and things like that.
Grant Ingersoll:
Yeah.
Chris Hostetter:
I really think that’s – it’s something that I sort of started to try and look into a while back and then kind of fell by the wayside is really try and find a way to generate sort of programmatic documentation aimed at the users and then supplement that with the wiki as opposed to making the wiki be the end-all, be-all documentation for users.
Grant Ingersoll:
Right. So you mentioned that the adding in plug-ins. I think so this year in New Orleans at ApacheCon you actually did a second talk there on Solr, “Beyond the Box”. That covers how to do plug-ins and all that right? Can you talk about that?
Chris Hostetter:
Sure, yeah. It was kind of a dual-purpose talk. On one hand it was to raise awareness about what plug-ins – sorry. How to write plug-ins, but on the other hand it was also – as I was preparing it and I was talking to people about it, I realized that knowing how to write a plug-in wasn’t really going to help people unless they understood the value of writing plug-ins. So the majority of the talk was really going through some case studies of existing plug-ins that were out there; either plug-ins that I had written as part of work or plug-ins that I had found online that other people had written. So it’s really talking about some of the internal API’s that are designed to be overridden by end users, the hook points for people to provide plug-ins, what those hook points can be used for, which ones are worth people looking into because in the common case they are useful; and which ones are really kind of esoteric and not exactly all that helpful even though they have their niche use cases for people to override them. Then the last half of the talk – probably more than the last half actually – we were looking at some examples and looking at the broad spectrum of simple plug-ins and what they do to more complicated examples and even a couple of examples where I got some feedback from Solr users. “Oh, I wrote a plug-in that does X, Y, Z,” and then the more I talked to them about the plug-in worked, I found examples of where they had implemented the plug-in one way, but I would have implemented it completely differently, but through a different type of plug-in, not just a different algorithm to achieve the same goal in a different way with trade-offs in terms of performance and amount of effort, things like that. The talk was really to try to raise awareness about this is all the stuff you can do. This is examples of what other people are doing and here is why it might be beneficial to do it as a plug-in as opposed to doing it on the client.
Grant Ingersoll:
Well yeah that’s – I mean there is so much power in the plug-ins. Maybe you can give us a couple of examples of some plug-ins that you’ve written.
Chris Hostetter:
Sure. The big one for the very start was really faceted plug-ins specifically for CNET’s product catalog. Why we did that as opposed to taking advantage of Solr’s internal faceting, originally the reason we did it was because Solr didn’t have internal faceting at the time, but the reason we did it as a custom plug-in as opposed to just sort of building generic faceting support is we needed to take advantage of metadata. It’s all about category faceting. It’s all about knowing that within a certain category like monitors there are specific facets that make sense for that category, things like diagonal screen size, manufacturer price. Some of those might be generic. They might apply to all products, price manufacturer, but the list of constraints within those facets, the list of manufacturers or the list of price ranges that you want to offer, those are specific to that category. Monitors deal with different prices ranges than laptops, which deal with different prices ranges than power cables. We have all this rich metadata that says for a given category this is the list of facets to offer in order and for each facet this is the list of constraints to offer in order or in a default order. We wrote a plug-in that given a category, goes and actually determines that metadata from special documents that we index, just an XML representation of the metadata, which we then take advantage of Solr caches to store that. So we don’t have to reparse it on every request. Once we have that metadata, we’re generating all of those individual queries just as regular filter queries, etc. Other examples of the plug-ins that I could talk about if you want?
Grant Ingersoll:
Sure.
Chris Hostetter:
The simplest one is dealing with analysis plug-in. There is lots of people out there who have written custom tokenizers or custom token filters to take advantage of special properties of their data knowing that they are dealing with specific languages or knowing that they are dealing with special code. They might take advantage of those by writing in many cases just short two or three line plug-ins for dealing with that sort of thing. That’s probably the most common type of Solr plug-in out there. The newest, biggest thing is search components, things that hook into the regular processing pipeline without completely replacing it like a normal request handler would to augment it. That’s how faceting is built and provided right now. That’s how highlighting is provided and that’s also how more and more people are starting to write their own custom query parsers, by overriding the query components and saying, “Okay, here we go. Here is how I want my query string parced. Here is what kind of special rules I want applied, etc.”
Grant Ingersoll:
Yeah, like I know personally I worked on the spell checking component and it was so nice to just be able to plug in, “Did you mean type spell checking right into the query flow without having to do a separate request, as you had to do with the old spell checking, which of course, was also a plug-in as a request handler. But that’s just an example of the search component’s capability.
Chris Hostetter:
Right. Search components now are really all about creating a pipeline and it’s a pipeline that is iteratively looped over in order support distributed, searching, and components that need to do more than one thing. For example, a component might need to see the request before the search happens, but they might also need to modify the response when the search is done. That component aspect really lets us write a lot more functionality that can be reused and mixed and matched, whereas the request handler API is really more analogous to a servlet API, where you are saying, “I am going to be the end all, be all for handling this request and if I need to take advantage of other functionality, I will go explicitly do it.” With a search component API, it’s more about saying, “This is my precondition. This is my post condition. I don’t really have any expectations of who comes before me or what comes after.”
Grant Ingersoll:
Yeah and I think for listeners out there, there’s a lot of good examples in the Solr source already. You mentioned the faceting. There’s spell checking. There’s some of the newer ones, like I added term vector component, so that allows you to get term vectors. The more like this, the highlighter, those are all implemented as search components. So anybody who needs an example of how to do those things, they can go look at that source.
Chris Hostetter:
Yeah.
Grant Ingersoll:
There’s also pretty good documentation on – I think if you go to the wiki, there is a – is it the Solr plug-ins link? I think that kind of leads to a lot of documentation on the plug-ins.
Chris Hostetter:
Right and like I said before, local Solr. That was what blew me away is that those guys hadn’t really been involved with the component design at all, but they had been able to take a lot of work they had done and where they originally wrote request handlers to provide the sort of spatial searching, the geographic based searching. They were able to implement that as a component, which just substituted in for the existing search component and was able to provide all that functionality, but still worked automatically with the highlighting component, the faceting component, etc.
Grant Ingersoll:
Is there any piece of advice that you would give to someone just starting out with maybe it’s just Solr or Lucene, either one, that you feel, “Gee, I wish I would have known this when I was starting out,” or just to make their life easier?
Chris Hostetter:
I can’t think of anything where I would say, “Oh, I wish I had known this when I was starting.” The biggest thing I can say is absolutely get on the mailing list. There’s a lot of projects I’ve worked with over the years unrelated to Lucene and Solr where I though, “Oh, I kind of have a question about this, but when I look at the mailing list traffic it’s sort of like do I want to bother even subscribing to this list? Am I going to get inundated with stuff that I don’t really care about or I don’t really understand just because I need to ask this one question?” Sometimes I do and sometimes I don’t and when I don’t, I don’t know how well it works or doesn’t work out. When I do, sometimes it’s either there’s no mailing list traffic, it’s a waste of my time, but there’s no downside; or there is a lot of people asking a lot of questions that wind up just being noise to me. I think most of the Lucene lists you’re actually – the biggest risk is that you’re going to get a lot of information that’s going to be really useful, but you’re not going to have time to use it.
Grant Ingersoll:
Mmhmm.
Chris Hostetter:
It’s not – there’s a community that I’ve seen where you might ask a question and you get nothing, ever. No response. No indication whatsoever that anyone even remotely knows what you’re talking about. The Lucene lists are one of the few places that I’ve seen where I think it’s pretty hard to find a question that doesn’t get some response that doesn’t get some – people may not know the answer to your question, but you’re at least going to get some feedback on your approach, particularly if you ask your question in a pretty well-formulated way and you actually give some good information about what it is you are trying to do and you’re not too vague. So participating in the mailing list particularly because – this is one of our faults is not having really, really – we don’t have as tight as documentation as some projects out there where they have a much larger community. We have a pretty big community. We have a pretty active community, but it tends to be focused on features as opposed to really polished documentation. So the mailing list and participating in that mailing list and checking the mailing list archive are really one of the crucial things you really need to do if you are going to spend any time working on the stuff. If it’s a short-term project and you just need to get in and get out, well then you unsubscribe when you’re done, but being involved with the list while you are working on it, just ask the questions and you’ll be amazed at the feedback you’ll get.
Grant Ingersoll:
Moving up to a higher level here conceptually, where do you see Solr headed? What’s your kind of vision for the future of Solr?
Chris Hostetter:
Yeah, if you’d asked me that question two years ago I would have had an answer, but it’s no where near I think where I expected it to be at the time. So I’ve kind of given up on even supposing I understand where this is going. Enough people are using Solr in ways that even six months ago I would have thought were completely novel and are now becoming more mainstream. I’m kind of surprised. For example, a while back, I think someone asked about integrating some named entity detection into Solr so that as part of the indexing process it would do entity tagging and my first thought was, “Oh, you would never want to do that in Solr. You’d want to do that outside of Solr because you’d want to store that information in your primary data store, whether it is the database or XML files on disk, whatever. You want to keep track of that information somewhere more consistent and permanent than Solr.” At the same time, there’s a lot of people who are moving towards using Solr as their primary data store and there’s a lot of people working on making updateable documents be a realistic and pragmatic thing to do in Solr and getting really reliable data storage in Solr, so it’s not just a de-normalized index of data, but can actually be treated as your main document repository. That’s not a feature of Solr that I would have ever anticipated a year or two ago as being really practical and viable, but it’s making progress now. If that becomes the case, then there’s really no reason not to hook a whole host of features I never would have considered into the update process so that it happens every time you add a document. In fact, a lot of that functionality really would make sense at that point because then you are guaranteed every document gets it when you load it.
Grant Ingersoll:
Right, right.
Chris Hostetter:
Those are the kinds of examples of features that I would have never anticipated, but that may be where things are going. It’s hard to say.
Grant Ingersoll:
Yeah, that’s the thing with the open source community is you are not always – ideas come from left field all the time and they really help grow and contribute to the community.
Chris Hostetter:
Exactly yeah.
Grant Ingersoll:
I’m always amazed by that.
Chris Hostetter:
There is no product manager saying, “This is what the product will be.” There is no one doing market research saying, “This is what our customers want, so we need to move towards it.” There’s just people saying, “Here is some code.”
Grant Ingersoll:
“Here is what I did. Here is a problem I solved and I think it’s useful for others.”
Chris Hostetter:
Yeah.
Grant Ingersoll:
So what about – can you talk about what features you’re working on or issues or anything like that?
Chris Hostetter:
Regrettably, I’m not really working on any features at the moment.
Grant Ingersoll:
(Laughter)
Chris Hostetter:
This has been somewhat of a lean year for me in terms of actually committing code to Solr. I think I’m spending the vast majority of my time participating on the mailing lists and just trying to keep up with some of the design discussions, giving feedback on approaches to how people are doing some things. It’s actually kind of been bugging me lately how little I’ve been giving back, but I figure if some people are writing code and some people are growing the community, I’ll do the one that I can do from my mail client because that’s where I spend the most amount of my time.
Grant Ingersoll:
Right and well it ebbs and flows too.
Chris Hostetter:
Yeah.
Grant Ingersoll:
I mean I know I have peaks and valleys in terms of committing too. So depending on what you have at work and what itch you are scratching at some particular time.
Chris Hostetter:
Right. I think if I can find the time and if I really – if I can get back into actually doing some more, I’m not sure what the right word is for this. Most of the commits that I’ve been doing lately have really been committing other people’s patches where it has been small things where I can really kind of quickly and easily validate that this is a decent bug fix or a decent small improvement without needing to invest a huge amount of time and really polishing it or thinking through big EPI changes or architecture changes. If I can actually get back into the swing of committing new and original stuff, I think the main thing I’d really like to focus on to start with is more documentation generation, more making it easier for people to understand what’s going on in there and also making some improvements to some of our development tools, getting code coverage analysis working better; getting bug detection or something like PMD hooked into the build system; little things that have been kind of on my wish list since the early days of incubation, but were always kind of secondary to actually adding functionality. I think we’ve got a really healthy contributor base at this point and we’ve got committers now who are just flying without needing a lot of shepherding, especially since we added a couple of new ones this past year. It kind of becomes – I feel like there are smarter people than me adding features at this point. Where I can really add value is helping get more committers into the project by making it easier for committers to really start using the code.
Grant Ingersoll:
Right.
Chris Hostetter:
That may be the next big commit you see from me if I can get around to it.
Grant Ingersoll:
Can you tell us how CNET and how CBS interactive is leveraging Solr today?
Chris Hostetter:
We use Solr for a variety of different things. I don’t know the specific numbers off of the top of my head, but I would say most search boxes that you’ll encounter when you’re dealing with a CNET property – sorry. Most by far, probably close to 90 percent of the search boxes you are encountering are powered by Solr. We tend to do a lot of what we call meta-searching where we are searching multiple indexes at the same time, but not attempting to blend those results anyway. We are just presenting them as discreet search result sections on the page, but again those are almost all individually powered by Solr. We use the dismax request handler and the dismax parser for dealing with most of our text searching. So when you search articles and blogs and things like that, that’s pretty much what powers that across the board. We do have some custom plug-ins custom request handlers for dealing with things like product search to mix that faceted browsing and faceted searching work using our catalog metadata. As far as how we deploy it, we use the basic master slave replication strategies, implemented use snap polar scripts. We tend to have one master for every index with a hot standby that’s really just kind of sitting there in case the first machine goes down because it’s a little bit of extra redundancy. Then we just go for as many sort of slave machines as we need for capacity planning. Tend to over plan for capacity and then just sort of replicate out with a load balancer in front of all those slaves, but we tend to keep each index, each sort of tier on it’s own set of boxes largely because it lets us scale up and out kind of independently. Does that make sense?
Grant Ingersoll:
Right.
Chris Hostetter:
Then we also – it’s pretty easy for us to reallocate boxes, to take a box out of one tier and put it in another tier and completely sort of rebuild it from the OSF. So even with this kind of approach we have a lot of flexibility to say, “Oh, for some reason new searching is not performing as well as we want. Take a box out of download searching because it’s doing fine.” For the most part, from an end user perspective all the way to the back end, the request goes to a front-end web server, which proxies to an application server. The application server takes care of sort of the front-end presentation issues, but it makes a request through a load balancer, through a Solr server which is receiving a copy of the index replicated at regular intervals. What those intervals are depends on the content. News is much shorter, in the order of a minute or two compared to videos, which might be ten minutes. That all comes from a master, which was updated as a result of publishing process through CMS.
Grant Ingersoll:
Going back a little bit, early on you mentioned the dismax handler, dismax request handler. Tell us a little bit about the use case of why dismax vs. the standard query parser that comes with Solr?
Chris Hostetter:
There are some vernacular problems. I call it the dismax handler because that’s how it got started. Really what we are talking about at this point is dismax query parser which really is exposed through I think – if I remember the name right in the API, it’s a Q plug-in or a Q parser plug-in. What dismax is really all about is the original was seen query parser which was the only syntax Solr supported when it first started. That was really designed to kind of be a starting query parser in the Lucene project, a sort of bare bones kind of basic support for doing simple query parsing, but it kind of covered a good spectrum of what functionality was seen to handle in terms of phrase queries and range queries and prefix queries and term queries. The problem being that it expects a very specific syntax and it generates parse errors if you are to break that syntax; things like forgetting to close a quote or forgetting to escape a character inside of a range. It also tends to be things that 99 percent of the people out there typing into a search box don’t expect or in many cases don’t want. What we try to do with dismax is that was about scratching an itch to say, “Let’s write a query parser that understanding what people really mean when they type into a search box,” and the goal, although it doesn’t really reach that goal as well as I’d like is to never generate an error; to recognize things like a plus sign in front of something means that it’s mandatory and a minus sign in front of something means that it shouldn’t be there at all. But if somebody only has a single double quote, don’t generate a parse error about incomplete phrase descriptions. Instead, just treat it as a literal. If there is an even number of quotes, then treat it as a phrase, that kind of thing. So that was kind of one of the original motivations of writing a new query parser in the first place. The second motivation was at the time one of the really common things for people to do when they wanted to search across many fields – for example to say, “All of my articles have a title and description and a body and an author and I need to search for the word ‘Chris’ in any of those fields, how do I do that?” Two or three years ago, the way people did that is they made a field called catchall or text and they would copy all that text into that one field. If they wanted to make a title worth more, they would search text and they would also in the bully inquiry, search the title field. You would get weird score behaviors there in terms of what happens with the bullying query when some things aren’t found in one field or another. You’d also get weird behavior where certain documents would sort of score higher than they really should because of the way a bullying query would deal with the same word in multiple fields. So what the dismax parser does is it really – the best way I describe it is saying that it’s doing matrix operations. On the set of all fields, which is something that you would define as a person maintaining Solr index and the set of terms provided by the user, so if I search for “Erik Hatcher Lucene in Action” for example, as a user I just type in those words in a text box. As a Solr administrator, I say when someone is using the dismax handler here are the fields that I want them to search across. Here is how important they are. So I might say the title field is twice as important as the body field and the author field is worth 1.5, somewhere in the middle of body and title. What the dismax parser does is it says, “Okay, let’s take all the words the user typed, ‘Erik Hatcher Lucene in Action’ and let’s sort of take the cross product of those terms with those fields. So we’ll look for Erik in all of those fields. We’ll look for Hatcher in all of those fields. We’ll look for Lucene in all of those fields, but the query structure builds takes advantage of something called a disjunction max query, which lets us say, “Only make the word ‘Erik’ worth as much as it is in the highest scoring field.” So even if Erik appears in the body and the author of this document, if we’ve said that the author field is worth more and this particular term because of its TFIDF is really worth a lot in that field compared to the body field, then that’s where the dominant part of the score for the term Erik is going to come from. We’re not going to let an overpopulation of that term across all fields really over-influence the score. So it gets kind of complicated. It gets kind of hairy when you get down into the specifics of the scoring. Looking at the debug output for a dismax query can definitely make your eyes bleed if you are not used to it.
Grant Ingersoll:
(Laughter)
Chris Hostetter:
It does get very big and very complicated very fact, but the end result is attempting to really make the query more significant for that common case of just people typing in words, the user saying, “These are the words I care about,” and the Solr administrator saying, “These are the fields that I care about.” It’s trying to sort of meld those two concepts and have them meet in the middle with something meaningful. Where it doesn’t work very well is I really want my users to be able to type in range queries. It doesn’t support that syntax.
Grant Ingersoll:
Ahh.
Chris Hostetter:
Or I really want my users to be able to type in prefix queries. It doesn’t support that syntax either because those syntaxes don’t work well with that kind of matrix multiplication of terms vs. fields.
Grant Ingersoll:
Gotcha, gotcha. Yeah, that’s great. I mean you explained some things to me that I didn’t even understand about dismax and I’ve used it pretty frequently. I think that’s about all I have unless you have any other things that particularly strike you that you want to talk about Solr and Lucene?
Chris Hostetter:
No, I mean the biggest thing is, like I say, is one of the questions you asked me earlier are things that people don’t take advantage of enough. I think one thing that people don’t take advantage of enough is really just trying it. I can definitely remember quite a few people who have posted a question in the mailing list saying, “How does Solr handle this or can Solr handle this?” and in some cases they are very meaningful and insightful questions about some very fundamental or very nuance aspect of using Solr; but, in a lot of cases, they are things where if somebody just downloaded it and gone through the tutorial, they would have really seen not only can Solr do what they want, but it can do a lot more. The tutorial – again, this is another one of those documentation problems. The tutorial doesn’t even get that deep, but it really is trivial to download Solr and with one line have Solr up and running. One command in your shell, Solr is up and running and you can view it in a browser. A second command in your shell and you’ve got index data that you can now browse and look at. That’s a pretty easy way to really see just the bare bones basic features of Solr up and running that I think not enough people try. They hear about Solr and they say, “Oh, maybe I’ll check into that,” but I think people assume it takes more effort than it really does just to get it started.
Grant Ingersoll:
Yeah, I couldn’t agree more.
Chris Hostetter:
Yeah, that’s the main thing that I can throw out there as something people should look into.
Grant Ingersoll:
That whole java-jar start.jar right on the command line is so nice and easy and its up and running.
Chris Hostetter:
Yeah.
Grant Ingersoll:
You can hit it with your web browser.
Chris Hostetter:
That’s true, yeah, and the second command java-jar post.jar*.xml loads all the data in.
Grant Ingersoll:
I’ll provide links to the tutorial and kind of starting up examples of that. Well, that’s great Chris. I’d like to thank you once again for joining us and talking about Solr and Lucene. I think our listeners will find this really useful. Thank you.
Chris Hostetter:
Yeah, no problem. I look forward to seeing people on the mailing list.

Resources

Here are some helpful resources related to Chris' interview:

  • Solr

  • Solr Wiki

  • Lucene

  • Interview with Ryan McKinley

  • Login or register to post comments

Case Study

Closing the Knowledge Gap: A Case Study - How Cisco Unlocks Communications
Solr Development Case Study: resolutionfinder.org

Whitepapers

Programmer's Guide: Using LucidWorks Enterprise to add Search to your Web Application
Getting Started With LucidWorks Enterprise

DevZone

Latest Blog Post

Lucene Revolution 2012 - Call for Participation...
Mark your calendars today! The largest worldwide conference dedicated to Lucene and Solr will take place in Boston May 7-10. The 2012 conference will build on the success of last...
  • Tutorials
  • Blog
  • Whitepapers
  • Docs
  • Forums
  • Support
Share
Follow Facebook Twitter LinkedIn YouTube
RSS Feed
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Website Feedback
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Sitemap
  • Admin

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2012 Lucid Imagination. All Right reserved.