Interview with Doug Cutting
Table of Contents
Lucene, Nutch and Hadoop creator Doug Cutting (search for Doug) speaks with Lucid's Grant Ingersoll Lucene's history, features and use cases. Doug also catches Grant up to speed on what is going on with Apache Hadoop (his latest project) as well as how he came up with the names for Lucene, Nutch and Hadoop.
- Grant Ingersoll:
- Today I’m speaking with Doug Cutting, creator of the Apache Lucene project. Welcome Doug and thank you for speaking with me.
- Doug Cutting:
- Well thanks for having me, Grant.
- Grant Ingersoll:
- Why don’t we start off by having you introduce yourself and your background?
- Doug Cutting:
- Sure. My name is Doug Cutting as Grant already covered. I originally wrote Lucene a little over ten years ago; maybe 11 years ago; something like that. Before that I’d written a number of search engines and been involved with search for awhile. I started out working at Xerox PARC learning information retrieval work. Worked with Apple for awhile; wrote some search technology there, which I believe is still in use. Then went to Excite when web search started to happen and worked on Web Search and Excite for a few years. In my spare time during some lulls in the internet boom, I wrote Lucene. In part to teach myself Java and in part to – because I thought there was an opportunity there to have a search engine that was written in Java, a search library. It sat on the shelf for a few years and then turned it into an open source project around 2000 and went to Apache in 2001. Maybe I’m gettin’ ahead of myself. Do you want Lucene history?
- Grant Ingersoll:
- No; that’s fine. We can talk a little bit more about Lucene. Do you remember what was the really difficult part in writing that? You said you’ve done it a number of times, but did you see anything strike you with Lucene how this was a really cutting edge or this was really hard at the time?
- Doug Cutting:
- No; I wouldn’t say it was difficult. It actually went pretty quickly. I wrote the first version of Lucene in I think three months working two days a week on it, but that was because I’d thought about it for a long time in advance about if I were to start again and write a search engine from scratch how I would do it. I’d written search engines from scratch a few times at Xerox as I was sort of learning about search technology and again at Apple. At Excite I’d rewritten most of the search technology there. So it was something I’d done a few times. Then I thought through this a fair amount. Based on those other ones I came up with a design which was different. So I think there’s something about Lucene’s design which is different than most search engines that came before. There was a few things at least that I’d been involved with, but I don’t know if it necessarily was harder for me. The hard part was I suppose all the thought work that went in before I started writing it.
- Grant Ingersoll:
- So is it different in terms of the APIs or compression or any of those things?
- Doug Cutting:
- To me the central idea, or at least one of them, was the way the index is structured; that it works by copying and merging segments. So at least theoretically what happens is you build indexes for a single document and then you merge them into indexes for larger documents. You basically have two fundamental operations when one is build an index for a single document. Then the other is merge a couple of indexes or merge some number of indexes. Using the two operations you can maintain an index. The way that you impose those operations over time was also sort of a novel approach that I hadn’t seen done before so that you can keep a minimum number of indexes around at a the time to search to make searches go pretty fast at the same time as not spend a lot of time doing merging. So you want it to have the same order of algorithms as sorting. You want it to effectively be n log n like a sorting algorithm, but at the same time add all these immediate dates before you’ve added the last document, keep it searchable and so finding that sort of – the compromise there was something I hadn’t seen done before. At least not in the particular way that Lucene does it.
- Grant Ingersoll:
- So you must have done a lot of performance testing, too, as you went through and built this out or after the fact.
- Doug Cutting:
- Oh, I definitely did some as I went. Started out at the beginning doing some micro-benchmarks of the basic data structure, trying to figure out how to do the index thing and then also for searching. There was definitely benchmarks as I went trying to keep up the performance. Since then people have done a lot more; gotten faster and faster through the years.
- Grant Ingersoll:
- To your credit it’s amazing how much of what you built ten plus years ago is still central and crucial to Lucene underneath the hood. A lot of the algorithms and all that are still intact and even though its evolved and gotten faster the central ideas are still there.
- Doug Cutting:
- There were a couple of simple ideas. I think it also helped that I came from Web Search to building a smaller scale search technology in Lucene from doing large scale search. I knew what scaled well as you went up and what didn’t. I’d sort of been up a ways past where I expected Lucene to scale and so didn’t want to hamstring Lucene with any things that didn’t scale well in the long run given my experience there. So yeah, there’s a few simple themes. One was the way that the merging is managed. Another one is the way searching is done generally by streaming through and merging lots of documents. A lot of search engines do that differently. They evaluate one clause in the query and then the next clause and then the next clause. They do them serially whereas Lucene does it through merging all parallel and that’s again a scalability issue more than anything else to never have all of the documents or that match a query in memory at once, but they may not fit. You shouldn’t assume that they’ll ever fit from the beginning construct things so that you don’t have to have that – have to try to fit them.
- Grant Ingersoll:
- So looking back on – so I guess its been – you’ve been in the Apache Software Foundation for over eight years and then you said I think a few years before that Source Forge. What’s your impression of Lucene’s evolution and I guess your own evolution, too, in that time frame. This is something you started as well gee, I think there’s an opportunity here and its really evolved. Can you tell us about that?
- Doug Cutting:
- Sure. I’ve evolved a lot and Lucene’s evolved a lot. It took awhile for Lucene to get going for their – the people – I think it takes out --people used it right off. It was useful out of the box, but for people to become adept at modifying it. It took awhile for the people who were confident and understood the internals enough to really dig in and start extending it. Took a couple years before I felt like I could really entrust it without closely watching everything that happened, which was fine. It was sort of a hand off period. These days the folks who are working on it know it much better than I do anymore. I’m not nearly active as some people. Its really taken on a life of its own, which is neat to see. Its got – added lots of new features, lots of good performance. It’s a much better product now than it was when I left it some years ago. I haven’t left it, but stopped working on it as a primary project. I hadn’t done any open source work before making Lucene open source and to me it was sort of a revelation getting involved in open source. I’d been involved in closed source software for – I don’t know – 15 years or so before that and had many times worked on software and then moved onto another company and my software was no longer available to me to look at. Often times I’d want to do something similar and I wouldn’t be able to go back and say well how did I solve this problem before. Often times I’d be happy to reuse the software that I’d written before, but I couldn’t ‘cause it was owned by another company. In the case of Excite, the software had disappeared entirely into some sort of intellectual property black hole. Who knows where all the software written at Excite is today, but it’s not available for anybody to use to my knowledge. I don’t think I chose open source because of that, but after starting to work on open source I realized that it solved that problem and that you could continue to use your software as you change jobs if you wrote open source software and that you can – and that chance of it disappearing in a sense published forever. Even though open source software projects can die, you can still go and find the software and legally pick it apart if you want. So that was a revelation to see that you didn’t have to disassociate yourself from your software when you moved around your career. Also the style of work, of using e-mail effectively, working with people in a way where there’s not really a control structure. It’s much more of a collaborative system. It’s really great. It’s where everyone’s a peer rather than having workers and bosses. The whole system to me has been a vast improvement.
- Grant Ingersoll:
- On that line, we didn’t meet for probably a good two years. I had been involved with Lucene since like 2004. I don’t think we met till maybe 2007. So just that whole distributed structure is really – it’s really neat. It’s challenging on one hand, but really powerful on another that you could build something this sophisticated without ever having seen the other person.
- Doug Cutting:
- Right. I still haven’t met many of those people who were very involved in Lucene and whom I've know for years there and that’s kind of neat. It works just fine to not meet folks and have people in different time zones and people who work from home, who work from offices can also collaborate effectively.
- Grant Ingersoll:
- You mentioned that you haven’t been involved with Lucene as much, but you still seem to be really current on that. I know you’re also doing a lot with Hadoop these days. How do you manage to do so much across all those projects? Why don’t you also tell us a little bit about what you are doing with Hadoop?
- Doug Cutting:
- Well, how do I manage? I manage to look like I’m doing a lot. I’m not really doing anything.
- Grant Ingersoll:
- I won’t tell your manager.
- Doug Cutting:
- I do like to still keep watch on Lucene projects. It’s something I like so I do try to follow it and see what's going on. I wouldn’t call myself very involved though at this point, I give Lucene a little. I started working on other open source projects around I don’t know what year it was – 2004 I guess 2003 – started working on a project called Nutch, which is built on Lucene to build a crawler and tools for indexing web content where Lucene sort of just a search library for text and doesn’t do anything about a particular application. That just is a particular application that is searching text that’s on the web and is connected by links and is structured with html and lots of tricks that that you use when you know the application area like the web and Many of these tricks I'm familiar with from working at Excite and so it seems a good thing to have. There wasn’t a good open source crawler based web search technology out there. So I set out to build one and Nutch with a couple of folks. That’s done pretty well. We got to the point where we were able to handle maybe up to 100 million web pages and it was a lot of work to do that. It took four machines and copying files back and forth a lot and a lot of manual maundering of processes. Then we saw some papers published by Google about their file system, the Google file system GFS and about their map reduce distributed computing metaphor and this seemed like that’s the things that we needed in Nutch to make it really scale. We wanted to handle billions of papers instead of just 50 or 100 million. Then we thought about it. We tried to reproduce those and that going. At the time I was working as an independent contractor doing work on Nutch, as well as Lucene for people that sort of a hacker for hire and after we started doing the distributed file system and map reduce work, Yahoo contacted me and said that they were interested in building on this further. So I went to work for Yahoo and we moved part of Nutch out of Nutch into a new project called Hadoop that is a distributed file system. It’s the Hadoop distributed file system, HDFS and map reduce servers. So that’s gone pretty well. In the three years I’ve been at Yahoo, Yahoo’s had a number of engineers involved in this. It’s I think pretty easily Yahoo’s biggest involvement in open source and that Yahoo’s writing lots and lots of software that’s going directly into Apache open source projects. It’s the only way I think – I think there are a couple of us working on it part-time. I don’t think it ever would have gotten to the scale that it needed to be able to run on thousands of computers without Yahoo. Yahoo’s used it to build its web indexes now, built by Hadoop and scores and scores of other things within Yahoo. Its really taken off as company as a computing bed. It’s used for lots of products and more and more all the time. It’s pretty amazing for me to see how big its gotten, even with in Yahoo. It been used heavilyprobably by Facebook and lots of other companies have started adopting how to – so that’s been a really great success.
- Grant Ingersoll:
- Yeah; its been really interesting to watch the whole evolution of Lucene to Nutch and then Hadoop out of that. Then you see other related projects like Mahout and things like that and Hive and Pig and all these other ones on top of Hadoop and just this whole ecosystem of large scale distributed computing that has evolved out in open source that so many people can leverage. That’s really been fun to watch and useful for a lot of people.
- Doug Cutting:
- It's tough getting a good foundation Getting Hadoop to really work and really scale and have a certain level of performance and reliability and scalability, but before all these things could take off.
- Grant Ingersoll:
- Yeah; it really will enable a lot of people going forward to then build even better applications. I think that really opens the door for a lot of people whereas having that kind of knowledge in-house of distributed technologies can be daunting. So we mentioned Lucene, we mentioned Nutch, we mentioned Hadoop. Where’d those names come from? I know you came up with them or there’s some history to them. Can you fill us in on the names?
- Doug Cutting:
- Sure. I don’t claim any great system or significance to the name. Mostly I look for names that are meaningless to most people. I’m not a fan of trying to use acronyms for names or names that sound really similar to the subject of the work or something like that ‘cause I think you should allow that projects may change what they do, like a company, and your better off with a fairly abstract name. Well that’s I suppose my naming philosophy. It's kind of meaningless, but also that they aren’t widely used. Therefore they’d have to not be widely used for other things or else they’d mean something to people, but at the same time you want something that’s short and easy to spell. So that’s the challenge that I see in coming up with names is the chance that it’s easy to come up with something long that’s meaningless and that’s hard to spell that’s meaningless and generate something random, but to come up with something that’s short and not used that much, to start writing things in and doing searches and doing domain name searches. Lucene is my wife’s middle name. It’s also reasonably easy to spell and gets me in good with my wife. Comes from her maternal grandmother. It was her first name. It’s a first name that I think was popular 100 years ago. You don’t see it much these days. That’s where Lucene came from. Nutch was the way my oldest son when he was two, I think it came from lunch, but it was the word that he used for any meal and he would – it was one of these words that kids have that everybody goes to ‘cause they only have about five words when they’re two and they use them extensively. You get to know those words and at that time when he was two I was typing in words to try to find unique domain names and sure enough Nutch, nobody had used Nutch and it seemed like it was short and reasonably easy to spell. It was just one of the things I plugged in. Hadoop similarly was that same boy’s stuffed elephant. He had this little stuffed elephant, which was – he was probably three or four by this time. One day we asked him you know, I don’t’ know why we thought to ask him this, but what’s the elephant’s name and he turned as if everyone knew and said it’s Hadoop. Like oh, it just came out of his mouth fully formed. You know, like why, where did that come from? He said, ‘I don’t know. Hadoop.’
- Grant Ingersoll:
- He had thought about it.
- Doug Cutting:
- Yeah; oh yeah. He knew. It was very clear that he knew that that was the name of the elephant. Pretty striking. You expect kids are going to come up with names like boo-boo. It’s a little more complex than that. Anyway so when he said that I started thinking gees, that would be a good product name. So the next time I had a project come along, I said well, there we go. There's one for it. It even had a mascot. So that was handed to me on a platter.
- Grant Ingersoll:
- There ya’ go. You could pay him all your royalties, right?
- Doug Cutting:
- Right; that’s right. I think this means now I’m on the lookout for such words. I'm always thinking about it and an icon. It's always go have a personal connection, that can be kind of fun to have a story and you’ve gotta keep telling the stories over and over.
- Grant Ingersoll:
- Well and they are all nicely having your search background they’re all nicely findable as you said, right, because they’re unique and meaningless. They instantly have top rank in any search engine whenever anybody searches for them. So that’s quite nice.
- Doug Cutting:
- Lucene, you’ll find if you search for it you’ll find things in genealogy databases is the only other thing I find for that. Nutch was a candy somewhere in Asia that was called Nutch. I’ve also found there was like one or two people that might have had a nickname of Nutch. It’s a typo you find now and then. It’s used a little bit, but not that much. Hadoop – I don’t think there’s anything out there. I think there was zero hits on that when I first did it. So that one was great.
- Grant Ingersoll:
- That is great.
- Doug Cutting:
- Once he told me the elephant’s name I had to keep it secret from the rest of the public ‘cause I didn’t want to, I had to save that name.
- Grant Ingersoll:
- Yeah; you wouldn’t want it to show up somewhere else just as you were about to launch Hadoop, right?
- Doug Cutting:
- That’s right.
- Grant Ingersoll:
- So going back to Lucene, what are you most proud of Lucene wise? I’m sure you’ve got – there’s got to be a good sense of accomplishment. Hey, I took this from scratch and its grown into a pretty big ecosystem.
- Doug Cutting:
- Oh yeah; no; I’m very pleased with what has happened with Lucene. It’s neat how much its been used. One of the things I’ve realized is that when I first started, well, maybe I could try to put up a business around this thing, license it and treat it as commercial software before I made it open source. Then I pretty quickly realized it doesn’t sound like fun to me. I don’t want to get into this regurgitating contract and hiring sales people and support it and all that. It’s like ugh. That’s expensive, but boring part of software that I hate. So, I'll just open source it and see what happens. Consequently I think its gotten to be much bigger, far, far bigger than it ever would have been if anybody tried to make a business around it. In particular I can quantify that a little, I think it produced more revenue, more income, more – what’s the right – its had more impact on the economy as open source than it ever would have as commercial software is my belief if that makes sense. More people have made more money off of it than one company ever could have made trying to sell it and support it that way and then other companies _____ _____ using it because far fewer people would have used it, but the barrier to entry being zero really many people _______ ________ ______ drilled. I think not only was it free, but the fact that it was useful didn’t hurt either.
- Grant Ingersoll:
- Right; and at the same time it’s not something a lot of companies – they know they need it, but they don’t necessarily need to build their own. So it fits that kind of perfect niche of hey, we need this, but we’re not IR experts or search experts so let’s just go get this from somebody who already knows what they’re doing.
- Doug Cutting:
- It’s interesting to talk to companies that were moving from proprietary search libraries to Lucene to find out what was going through their heads. A lot of time what happens when I talk to these folks and sometimes there were features that Lucene had that these other things didn’t, but more often it was just that these other ones were expensive to start with, but they were also expensive each time they wanted to use a new feature or deploy it slightly differently because the documentation wasn’t great, debugging was hard, and you couldn’t go look at the source and figure out what it was doing. You couldn’t just ask on a mailing list. There wasn’t good sort of free user support. So what you had to do was hire consultants to do anything and the consultants were from the company and the consultants were usually expensive. They found that by using Lucene they could piece together and prototype what they wanted and the transparency of the code, the fact that they could actually look and see oh, this is why this isn’t working. We need to use this this way even though the documentation you might think it works this way. Really it works this way. That kind of stuff they couldn’t do it with the proprietary software. That was at least my best guess there certainly a lot of folks who have switched using Lucene. I don’t think price alone was the biggest factor. It was a factor, but I think the ease of I think open source made it easier to use because of the source.
- Grant Ingersoll:
- On the flip side, do you have any regrets or anything you kind of wish you did differently with Lucene just early on or do you feel that all of those have been addressed so far?
- Doug Cutting:
- Oh, I don’t know. There’s some tiny little regrets probably here and there, but nothing major. I think its gone much better than I ever could have expected.
- Grant Ingersoll:
- In your wildest dreams did you, oh yeah, this is really gonna take off and be big or was it just oh, if it’s useful to some people, great; otherwise I’m happy just to have done it.
- Doug Cutting:
- That would have been good enough for me. My hope was that people would use the software. To me that’s the goal of any software is that people run it. Open source seemed a natural way to reduce the barrier to getting people to run it and use it so that it would get used more and that would make it more successful. That worked wonderfully. More people are running it than I ever would have hoped. Regrets – there’s a few things. I started it in relatively early days of Java. I don’t remember when Java first came out. There were some programming styles that I invented which turned out to not catch on, which maybe I shouldn’t have tried, but it was fair at the time. I think it wasn’t real clear what the established Java conventions would be.
- Grant Ingersoll:
- Was there a point in time where you’re like aha, people are using this in ways that I never imagined. A lot of times I’m on the list and you see people saying oh, I’m trying to do this in Lucene. You sit there and you read their e-mail and you’re like wow, I never would have thought Lucene could do that, but yeah, I can. Do you have those kind of moments and did you have that as it evolved where you said oh yeah, Lucene is all of a sudden – this is it. People are doing this beyond whatever I imagined.
- Doug Cutting:
- I don’t have a database background. I’ve never really studied or used it much a little I've done some SQL, done some database work whereas I’ve done text search engines a lot. So I’m used to thinking about well how would I solve this problem using a text search engine. But to me I’m less surprised by that, but I am surprised I guess by the -- by most of the time by the fact that people try to do things with a text search engine besides just searching text. But I am surprised that they find that it does it better than databases in many cases. It seems like again and again people will shove more and more of their search functionality not just text search, but other kinds of stuff, searching for fields, searching for a numeric search, geographic search, things which I’d expect a general purpose database that has been worked on for years by commercial companies and then it’s fairly established product would handle better. It seems like they don’t in these cases. For example, the bug tracker we use at Apache, JIRA, stores all of its data in an SQL database, but it also has a Lucene index. Over time I think they found that they pour more and more data into the Lucene index as it changes and just use the SQL as a sort of data repository to sort of back everything up and that more and more of the searches and things that don’t even real seem like searches with like the most recently updated plugs and a lot of the screens that are displayed are generated with Lucene searches. Because that ends up being much lighter weight and much more able to generate these pages quickly and efficiently than the SQL database, which surprises me. Again, it’s because I don’t know a whole lot about how SQL works, but I’m pleased and certainly surprised in a good way that Lucene’s able to do these things, but there are classes of queries which I think aren’t well suited at certain times like joins and such, but there seem to be less common – seems like the most common kinds of searches people do can be done with Lucene which is great.
- Grant Ingersoll:
- Yeah; and denormalizing your data. Something that when you come from database land the thought of normalizing your data is always drilled into your head. Then when you come into Lucene or from the search side, all of a sudden it’s undo those things. I think people coming from database land, that takes a little getting used to, but once you get over that hump you really have a lot of power available to you. I think how people use Solr with faceting and full text search that really opens up that kind of realm a lot to people.
- Doug Cutting:
- Yeah; Solr is a great addition to Lucene. If people don’t have to get into the depth of Java APIs and how to combine various things together to fullfill their obligation, it pretty much out of the box lets you throw documents in and get all the search engines you’d like.
- Grant Ingersoll:
- Pre-Solr, I think every one of us committers had written something Solr-like if you will. It may not have the exact features, but it had very similar things in terms of the stuff you have to build around Lucene to really make it a part of your application versus just being a core search library. What do you see kind of ahead for the whole Lucene, the sub-projects? We now have Tika, we have Mahout, we have Nutch, we have Solr. How do you see kind of those all playing together and what they have to offer people?
- Doug Cutting:
- I try not to be a visionary and predict the future. Future predictions always seem to be more of the same.
- Grant Ingersoll:
- Lucene, the whole project has grown with all these sub-projects to really try to address more of people’s needs with search. So where the Lucene core was just about pure search at a very raw level almost. Now you have like Tika which can do content extraction and Solr provides kind of this framework plus faceting and all these other nice features around it. So, just more of the same.
- Doug Cutting:
- Yeah; I think its got a nice ecology of things that sometimes we’ll find part of one project which is useful to other projects and pulled out of a project on this other one. Solr is sort of like that in some ways. It replaces some functionality that was in Nutch. It also pulls in a lot of code from outside Nutch to do that, but that’s a great thing to have and it’s useful to applications besides Nutch. So that’s a good one. Then there’s been bits of Solr which have moved back over to Lucene and I think it’s great to see all that kind of evolution of adding new projects when there’s functionality that’s useful independently. Moving thing from one place to another when trying to get the layer in correctly so that you don’t have duplicated code. I think thats alway works well. Nice to see these ecosystems built out.
- Grant Ingersoll:
- Definitely grown to be a nice interplay between all the projects I think and a lot of the cross realization that makes all of them better. Is there any functionality in Lucene that you think the average user should know about or be using, but probably isn’t? I often think people could benefit more from span queries for instance. Knowing where exactly a match occurs. Do you have a sense for any of that?
- Doug Cutting:
- Yeah; span queries are kind of neat. I think the thing is that people are used to search results that only show document level matching and so people aren’t interested in matching fragments within documents and deploying those so it tends to get downplayed. Some of the vector stuff, this is a classic information retrieval thing is that people don’t use more like this kind of searches during development feedback and the tools are there more or less in Lucene, but its never taken off. That’s, as I said, the lament of IR that relevance feedback hasn't taken off despite everyone knowing that it’s the best way to improve searches.
- Grant Ingersoll:
- Yeah; the thought of doing an extra query, especially an extra large query seems to scare people away. They don’t necessarily worry as much about that extra little level of precision in their results versus the speed of their results.
- Doug Cutting:
- Generally I’m tickled to see all the ways that people are using Lucene. Its been deployed in so many different ways; web search, product searches within web sites, search for shrink wrap software kind of things; all of those.
- Grant Ingersoll:
- So, how about Lucene and Hadoop? I know there’s some people that work on Lucene on Hadoop. Of course Nutch was the genesis of Hadoop, but I know there’s some independent things with Lucene and Hadoop as well. What do you see happening in the realms of distributed search these days?
- Doug Cutting:
- The core Hadoop technology aren't really front real time stuff. They’re back end batch processing, which isn’t terribly amenable to the search problem. Searching is really a front end thing and not a back end batch thing. Indexing can be a back end batch process. So using Hadoop to do indexing, there are some tools to do that and various people are doing that. That’s in fact what Yahoo’s even doing for one application is building a big web index, but as far as Lucene and Hadoop, I think there is a hole in the portfolio of technology in that there isn’t a good distributed search solution. Ideally you’d have something where you could turn on more boxes and be able to support either more queries per second at a greater search volume or more documents or both and just to have one software installed that you put on the box and then turn it on and have it join the grid and have it be fault tolerance if you lose boxes then you don’t lose much storage capacity or collection size. The major search engines have things like this, Yahoo and Google and so on, Microsoft, etc. If you want to be able to do real-time updates, as LinkedIn. It seems like a natural area. Its a difficult one, I think in order to make it work well you need to have somebody who’s really committed to using it and deploying it on a large scale and devote significant engineering resources, as Yahoo has done with Hadoop. Yahoo has its internal technologies in these areas so it isn't interest in either rebuilding them in open source or taking existing technologies and having them turning them into open source, but hopefully somebody will come along. I think there’s a real opportunity there. I think it will happen that we’ll get a truly distributed indexing and search engine out there.
- Grant Ingersoll:
- Right; like parts of it are emerging in Solar for instance, which recently added distributed search, but the whole picture isn’t complete yet like you said.
- Doug Cutting:
- Right; I mean you want to automatically partition things and replicate things and there’s a lot of things that are tricky to do well to really make it reliable and bullet proof.
- Grant Ingersoll:
- You can manage a certain number of machines just somewhat through an optimized manual process, having your sysadmin know what, where and when, but once you reach a certain level of machines it just needs to be more automated. I think that’s the bit that’s lacking right now like you said.
- Doug Cutting:
- there have been enough related systems that have been built, but I think it’s not rocket science and that there’s ideas out there about how to do it. It’s just following through on some of them and there’s a lot of engineering work to build it and to make it work well.
- Grant Ingersoll:
- Definitely; and you need access to those kind of resources just to compute resources as well. I know like you’re fortunate enough to be at Yahoo where you’ve got 2,000 clusters you can run on.
- Doug Cutting:
- These days you can get those ala carte at Amazon for testing and debugging. You can fire up 100 node cluster for an hour and only pay ten bucks, if I have my numbers right, which would be good enough to try things out. I don’t think that’s quite the barrier that it used to be. Amazon EC2 really makes that kind of stuff usable.
- Grant Ingersoll:
- Well Doug, that’s about all I have for questions. Is there anything else you wanted to talk about?
- Doug Cutting:
- Not that I can think of. Hope folks keep using Lucene, keep enjoying it, keep enhancing it. I’ll keep sending e-mails. That’s my life.
- Grant Ingersoll:
- Well great, Doug, I really appreciate you taking the time to speak with us today. Hope you have a happy holiday.
- Doug Cutting:
- Alright; you’re welcome. It's good to talk to you Grant.
Here are some helpful resources related to Doug's interview:
DevZone
Latest Blog Post
Mark your calendars today! The largest worldwide conference dedicated to Lucene and Solr will take place in Boston May 7-10.
The 2012 conference will build on the success of last...
