Wanna see our new website?
Click here to see the Public Beta of the new lucidimagination.com
Grant Ingersoll Talks with Sammy Yu
Download the MP3, Subscribe to the Podcast (Subscribe with iTunes)
Grant talks with Sammy Yu, of Digg.com, the social news website. Sammy discusses how Digg.com uses Solr for everything from deduplication of recommended stories, how they manage a high rate of updates from new content to the index, faceting, and looking ahead to trie-range.
"With Solr, we really treat it as kind of a platform where we can build other kind of things on top of it… We have a very valuable set of data, and we really want to explore new ways of building new features from that data set."
Transcript
- Grant Ingersoll
-
Today I'm speaking with Digg.com senior software engineer, Sammy Yu. Sammy is responsible for Digg's use of Solr. Welcome Sammy and thank you for talking the time to speak with me.
- Sammy Yu
-
Thank you for having me.
- Grant Ingersoll
-
Why don't we start off by having you introduce Digg.com, tell our listeners a little bit about what Digg is, and what it offers, and then also some of your background and your role at Digg.
- Sammy Yu
-
Digg.com is a social news website, that allows people to that allows people to discover and share diverse content from traditional media sites like the New York Times as well as individual blogs. Our community "diggs" –basically that's like a vote – on various content, based on their preference, and we're able to surface stories that are popular or of interest to them.
A little bit about myself, I'm basically responsible for the search infrastructure at Digg.com. Previously, I worked on enterprise management software at Hewlett-Packard. I earned my masters degree in computer science from the University of Illinois at Urbana-Champaign.
- Grant Ingersoll
-
I imagine on a site like yours where you've got lots of content, search plays a key role. Can you tell us how you used Solr at Digg to enable search?
- Sammy Yu
-
Yeah. Certainly. As you said we have a lot of content, and a lot of content is actually missed by our users, or if they wanna share stories to other users it's really becomes critical for them to be – our users to be able to find those stories. So, that's kinda one aspect of search, but another aspect is that if you look at individual story pages, which we call permalink pages. We wanna be able to surface other contents that are perhaps of interest. So, we'll actually make inline searches within those permalink pages, perhaps stories that may have similar keywords to the story that you're currently looking at, as well as maybe there are some from the same source. So, we have this feature called related by source, so if you look at a story from the New York Times perhaps you might also be interested at other stories from the New York Times, and lastly we also kind of expose this out through our API interfaces for other developers that may wanna build on top of our content.
- Grant Ingersoll
-
Great. So, I imagine with the related content do your – are you using something like Solr's more like this, or is this something that you guys have done on your own?
- Sammy Yu
-
So, we are actually doing something externally in terms of figuring out the keywords, but we're actually sending those keywords back to Solr to get those stories. We are actually in the process of using more like this for an enhanced feature that we're adding to our dupe detection. Basically, as you can imagine if there's breaking news out there, all of a sudden there could a lot of users submitting the exact same stories. So, we're actually – we're gonna launch a new feature basically that will – we have our kinda two paths. We'll have a kind of internal algorithm that says, "Hey, is this the same story?" but then we'll kinda use the more like this handler to say, "Hey, these stories are very similar, are you sure you're not submitting a dupe?"
- Grant Ingersoll
-
Yeah. That makes sense. So, on the dupe detection you – I imagine you have some notion of fuzzy duplicate detection whereby a story has to meet some threshold in order for it to be considered a duplicate?
- Sammy Yu
-
Yeah. It's not exact science, but we kind of as you know TF/IDF scores are not, you can't really compare one result to the next result, but we try to see is there a minimum amount of words that it matches, and suggest those as possible duplicate stories.
- Grant Ingersoll
-
This is actually one of the new features that'll be coming out in Solr 1.4 is that there is a plug in mechanism for doing duplicate detection, and it can either do exact duplicate detection or even introduce a fuzziness component. So, be interesting to see if your stuff would fit into that model as well.
- Sammy Yu
-
Yeah. Definitely. We're actually very excited about some of the 1.4 features, particularly in terms of scalability and some of the trie-support, ‘cause as a social news site we have to deal with a lot of date queries being that news is really tied to freshness of a story. We really want to be able to filter certain things within a certain date range, and it becomes critical for us to return those results very quickly, and today we're actually playing some tricks of loosening – creating specific fields that doesn't hasn't as much granularity so we can get those search results back more quickly.
- Grant Ingersoll
-
So, we've kinda talked a little bit about some of your setup. Obviously, you have a lot of documents. Can you give us some round numbers, about number of documents in your index, and what kind of maybe query rate you see? How many different searches per second you're hitting?
- Sammy Yu
-
So, we have approximately like 13 million documents in our index, which is around 8 gigabytes on the physical disk. Our documents them self have quite a bit of fields ‘cause a lot of these fields are really related to how we deal with relevancy. So, everything's basically running in the typical master slave setup. We have approximately ten slaves. We are definitely – we don't need that many slaves, we are just overprovision these things for fail over scenarios, and to handle certain surges in traffic, and basically, all these slaves are behind a hardware load balancer that we have some HTTP caching, and also just to do load balancing among the slaves. We are approximately seeing around like 4.8 million queries a day. Not all of this is related to site search specifically, but as I mentioned before, we have kind of related by source, related by keyword, we also have another thing I didn't mention before related by search.
So, if you are on a search engine like Google, and you search for story that's on Digg we'll actually take those search query terms and generate another search query internally and show those results in line, and then finally as I mentioned before our API; so, approximately 4.8 million queries a day.
- Grant Ingersoll
-
Oh, very nice. So, popularity is obviously a very important thing, and you mentioned relevancy a little bit. So can you tell us how you use Solr to solve your relevance problems?
- Sammy Yu
-
So, being a social news website, one things that I learned very quickly is recency is very important in most scenarios where hey, you wouldn't really care about stories maybe that's like four or five years ago. It simply isn't relevant anymore. Obviously freshness, when the story was submitted, plays a very important factor, but also we – there's other factors at play here. We take advantage of crowd-sourcing. So, our community is basically de-voting stories, or burying the stories if they don't like the story, and given the story has enough diggs, so I'm over-simplifying, it becomes what we call popular. So, that in turn, we're using the community to say hey, this story's – so the more popular a story is, the higher that will be ranked in the search results, as well as the digg counts itself.
So, we're basically taking advantage of the weight factor, and we do allow these things mostly at query time, using the basic weight, but things that digg count, and date created you can't just use a regular weight. We're using the functional query feature to actually weight those things accordingly.
- Grant Ingersoll
-
This obviously leads to the fact that you have probably earned a very high update environment. How are you guys managing your updates, and making sure that as people Digg an item, then that – does that document then need to be updated, and then re-indexed or factored in that way?
- Sammy Yu
-
Yeah, certainly. So, basically we have most of our store today is in MySQL, and we basically a field in our stories that says hey, has this item been updated, and periodically we'll scan for rows in that – within that table that says hey, do you need to be updated? If so, we'll basically batch those updates into a single transaction and send it over to the Solr master. Of course this isn't quite real time, we may see typical delays of five to seven minutes. So, we try to skew – we have a basic CRON job that hits the database hey, hey, is there anything that is dirty? There is also a kind of master slave replication, and other processes.
So, we try to minimize that as much as possible. A lot of that trick is playing with measuring on average how long it takes and try to skew the clocks, so that interval is minimized.
- Grant Ingersoll
-
I've noticed you also on the site you have some faceting and obviously sorting by dates. Can you tell us some more about what you guys are doing in that regard?
- Sammy Yu
-
I think one of the things, it's very nice to be able to search and always present the right results, the result that they're looking for as the first result, but in case that isn't the case we like to be able to allow the users to be able drill down and refine the results, and that's one of the things that has been a really nice feature in Solr that we've taking advantage of. It's – we have a lot of stories and it's very hard to figure out the right magic formula that works for everything, ‘cause as you know I can tweak the weights one way and it may work for certain queries, and completely break other queries. So, it's nice to have this fallback mechanism of having the facets to allow them to refine the results, but furthermore I think it brings another aspect that is of interest or users in terms of discovery. It really allows them to discover things. Okay. Maybe this thing, maybe stories about iPhone happened at this time, iPhone was released on this date, and you can see those things, we actually have this facet in our search page, we actually draw this graph of when the stories show up, and basically we're just using facet query of those dates and month to display that graph.
- Grant Ingersoll
-
Yeah. That's a really handy feature to have.
- Sammy Yu
-
Definitely. And I think since we've re-launched with a Solr 1.3 and we've introduced the way weighting scheme that I talked about and the faceting features, we've gotten a lot of positive feedback. People are saying, "Hey, search actually works. I can find the stories that I'm looking for." For a while, we were using Solr 1.2 before, but really, we weren't taking advantage of the boosting features and some of the faceting features.
Another thing I would like to mention is we actually have some customization. Solr out of box is really effective; some of the things that get you up and running really fast. But we had to do a little bit of tuning in terms of figuring out to deal with specific queries. One of the things that we deal with is searching for stories from a specific host – story from a specific host. And it really becomes important for us to tokenize these things in a very special way. So I think one of the things of importance is, imagine you have a story maybe from, say, blogs.zdnet.com, so one of thing – and one of the things is we try to break this up. So when it comes in it's actually like a full-length URL and we break it up into pieces and so what we do is we try to inverse the individual parts of the domain, so it becomes like topleveldomain.subdomain and then as well as the first part, the www is dropped. So going back to the example blogs.zdnet.com, we generate two tokens, com.zdnet and com.zdnet.blogs. This allows us to avoid the more expensive wildcard queries for postfix queries. And we a lot of times www is not relevant, and this is very important ‘cause a lot of our API users and also as well as SEO consultants, they're looking – they're hitting our website with a lot of these queries. They want us to look for stories from specific domain and that's one of the most common search queries that we have coming in, so – in terms from the API, and we – what really allowing us to build it like that really helps us to get good query times.
And I should mention, yeah, actually, we just launched this API - expose our search through our API last week. And having this Solr infrastructure really allowed us to expose that, ‘cause previously we were very concerned about, hey, would this be able to scale? And we haven't had any issues with Solr. In fact, we've been very impressed in how stable it is and how it really has been able to scale to our size.
- Grant Ingersoll
-
That's great. So in a lot of ways – I mean Solr is – it sounds like Solr has really enabled Digg to kind of take your game to the next level, right?
- Sammy Yu
-
Yeah. Yeah. Most definitely. I think search – we treat search not just as the typical way you look in terms of thinking, okay, I want to search for information from our site. But we really treat it as kind of a platform where we can build other kind of things on top of it -- really extracting value out of our data. We have a very valuable set of data, and we really want to explore new ways of building new features from that data set.
- Grant Ingersoll
-
That's great. Yeah. I often see with people that – by building on top of something open source like Lucene or Solr that it – it really frees you up to innovate in a lot of ways because you don't have to worry about those additional license fee costs if you need some new module or some new capability. You can just go and try it out and see how it works. So like adding more like this or adding some other feature, it can just work and you can then try it out and see how it fits into your business.
- Grant Ingersoll
-
Yeah. That's the exact advantage. I mean at Digg we are big open source shop and we try to avoid commercial things as much as possible. And really, another one thing of that is like really I think it's – Solr Lucene community is quite active just, you know, they're looking at the mailing list. Even though it's open source, it's really a big community. If you post your questions within the mailing list, you get a response relatively quickly. And a lot of people have built solutions on top of Solr, and really, you can build on top of their experience and learn a lot of those lessons, and basically, I was not really in the search, before this I did not have a lot of experience in search field, and have really been able to get up and running in a relatively short amount of time.
- Grant Ingersoll
-
Yeah. I think I – we hear that a lot of people who don't necessarily know a lot about a search coming in. They're smart people, they understand how to program and they get Solr, and they add it to their site and then can iterate, they have good out of the box experience, and then they start to see all the things that can open up to them. Yeah. So, I think your experience reflects a lot of peoples', in fact it reflects my own experience, and that's how I started too, not knowing much about search way back when, and trying out Lucene and some other tools and the rest is history.
Well, Sammy I – that's about all the questions I have for you. If there's anything else, you feel like adding about how Digg takes advantage of Solr feel free to jump in.
- Sammy Yu
-
Yeah. Sure. I think we're really, as I mentioned, really looking forward to Solr 1.4, I think the trie support will really help us scale. We've done some tests internally within using Solr 1.4, and it's really – we're really have been really impressed by it, and some of the stuff that I think you have been working on, Grant, in terms of using Carrot2 in terms of clustering the results. I think that that holds a lot of promise for us in terms of we're really looking at building other sort of features on top of our data. I think that that holds a lot of promise.
So, yeah, overall we've been a very happy user of Solr, and we're looking for new features in Solr 1.4, and really move it, take it forward from there.
- Grant Ingersoll
-
For the listeners out there, the Carrot 2 stuff that Sammy mentioned is in 1.4, it's some new clustering capabilities or some people sometimes call it is dynamic faceting, whereby results are grouped not according some predefined field values as faceting is, but actually takes a look at the documents that are in the search results, and then tries to group them according to the information contained in the document, and none of it's predefined. So, that's gonna be in Solr 1.4, and also the tree or sometimes people call it tri range capabilities allow you to do better numeric range queries faster, more capabilities around numerics and dates, and things like that, which I've – for you guys I imagine the date capability is quite important.
- Sammy Yu
-
Yeah. Yeah. That's one of the – that's what's getting us very excited about Solr 1.4.
- Grant Ingersoll
-
Great. Well maybe I'll be able to follow up with you after you – after 1.4 is out, and you've deployed it and we can talk about some of the improvements that you've made.
- Sammy Yu
-
Yeah. Yeah. Most certainly.
- Grant Ingersoll
-
Alright, Sammy, well I appreciate you taking the time to talk with me.
- Sammy Yu
-
Thank you for having me.
- Grant Ingersoll
-
Once again, that's Sammy Yu senior software engineer from Digg.com. Thanks Sammy.
- Sammy Yu
-
Thanks.


