• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
    • Certified Distributions
      • Certified Solr
      • Certified Lucene
    • Apache Releases
      • Apache Solr
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Training
    • Solr/Lucene Certification
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Technical Leadership
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . DevZone . Videos & Podcasts

Authors' Podcast: Lucene in Action, Second Edition

Download the MP3, Subscribe to the Podcast (Subscribe With iTunes)

Lucene in Action Second EditionSince its first publication in 2004 Lucene in Action has been the Bible for those wanting to learn more about Apache Lucene. Recently, Manning Publications the book's publisher has released a long awaited second edition of the book. With over 100 pages of new content, Lucene In Action Second Edition is and remains the definitive guide to effectively integrating the Lucene java libraries into your applications. The book details this mega-hit superfast search technology, Lucene's easy-to-use API, features like numeric fields, payloads, and near-real-time search capabilities and much more.

Now, listen in as Grant Ingersoll speaks with Lucene in Action co-authors, Mike McCandless Erik Hatcher, and Otis Gospodnetic, covering the book's goals, new material and new features, how the book can help search application developers working in Solr, and more. 

 

Transcript

Grant Ingersoll

Welcome to Lucid Imaginations pod cast series. My name is Grant Ingersoll and today I'm speaking with Lucene in Action authors, Erik Hatcher, Mike McCandless and Otis Gospodnetic.

Since 2004 Lucene in Action has been the Bible for those wanting to learn more about Apache Lucene. Recently the book's publisher, Manning Publications, has released a long awaited second edition of the book.

Today I'm fortunate to have the opportunity to interview all three of the co-authors. So why don't we get started. Why don't we start off then, have each of you introduce yourselves? Why don't we start with Mike and then Otis and then Erik?

Mike McCandless

Okay, my name is Michael McCandless. I have been working with search engines for a very long time. I think over a decade. I got my education at the Massachusetts Institute of Technology, got a PhD there, and after that founded a company called iPhrase Technologies which built search software.

Did not use Lucene at the time, but we were watching Lucene very closely. IBM acquired iPhrase almost – I guess six years ago now, and right after that I got involved in Lucene and I haven't looked back. I keep working very hard and after a couple years Manning got in touch with me to revise the Lucene in Action, the original book, which was very popular. They wanted to do the revision, and I said yes and that brings us to today.

Grant Ingersoll

Otis?

Otis Gospodnetic

My name is Otis Gospodnetic, and I've been involved with Lucene for about 10 years now, I think since before it joined Apache. Since I started working with Lucene in 2000-and some, written Lucene in Action with Erik, which was, I think, a big hit and valuable for lots of readers.

After writing the book I decided to start the business around Lucene and offering services around Lucene and search in general. So I created something called, very imaginative, Lucene Consulting, which I ran for a little while and then that grew into something that's called Sematext and today that's what I do.

Grant Ingersoll

Erik?

Erik Hatcher

Okay, my name is Erik Hatcher. I have been involved in Lucene since around 2002. I began tinkering with it when I was working on the Ant book and playing with some personal search projects. So I certainly don't come with the search background that Mike has here, but I came from just being interested in the search technologies and discovered Lucene, have been a fan of open source for a long, long time. So those two fit really well.

And fell in love with Lucene, and then Otis and I did the first edition of the book and like he said, I think it was a big success. In fact, I think Otis and I and Mike can certainly owe a large part of our careers to Lucene and the greatness that it is.

And in fact, our careers got so busy; I got so busy myself that we did need to bring on a co-author. And so that brings us back to Mike, who basically single handedly did the second edition here. So big thanks go out to him for really knocking it out of the park. He bit off way more than he wanted to chew, I'm sure. But where I'm very thankful that he did do that.

Grant Ingersoll

Great, thank you. Well, Otis, as you said you're probably the veteran of Lucene here, having been around it since before it even joined Apache. So why don't I start off with you and could you just give a brief introduction to our listeners. What is Lucene? Why should I care about it? How's it useful to me as an application developer?

Otis Gospodnetic

Okay, Lucene is a low-level text indexing and searching toolkit that allows one to add text indexing and searching functionality into an application. It's very mature, it's been around for 10-plus years, it's been developed by a number of very smart people. I shouldn't say that about us, but a number of people. It's pretty high performance. It's used by thousands and thousands of organizations around the world. And as far as I know there's nothing that's free and of this quality out there today.

Grant Ingersoll

Great, thank you. Well, so, Erik, Lucene's been out there, it's been a while, it's used in a lot of organizations, people are adding search capabilities to things ranging from their website all the way up to large scale document repositories. So, of course with that great code you need somebody who explains it. So enter Lucene in Action.

Why don't you take us through at a high level? Tell me about the book. What is it? What's spurred you to write the first edition, update the second one? What does it cover and most of all, who's the intended audience?

Erik Hatcher

Okay, so first I'll just start with the intended audience because that kind of lays the foundation for how the book was structured. But this is designed for developers. This is for search engineers that are going to be integrating a search technology into their existing application, into their website, into their desktop product, into whatever type of service they want to wrap around Lucene into their application.

So it was designed specifically for those use cases there. So we designed the book with a few sections here to really get you started with Lucene to show how easy it is. That's one of the things that I really enjoy about Lucene is its ease of use. So the first chapter really just sets the stage and says, "Here's some data, we're gonna index it, and here's how you can use Lucene's very simple API to search the data that you indexed."

So and then from then on we go on into more complicated topics such as, you know, query parsing, textual analysis, how to customize things like the query parser itself use some advanced features like payloads. So that's something that I actually learned a lot about from Mike as he wrote this book.

So that's the – kind of the high level overview. We can drill in more if you'd like or –

Grant Ingersoll

Great, well, I think that actually leads to my question for Mike which is, you know, I mentioned earlier this is actually the second edition, and by my calculations of comparing on Manning's website I think it looks like there's about 80 pages of new content, and I imagine there's quite a bit more of updates and revisions because the old APIs have changed a lot.

So, you know, what is our audience really have to look forward to in this new version, Mike? What are some of the new things we can learn about Lucene that weren't covered in Lucene in Action 1?

Mike McCandless

Oh, 80 pages sounds like an undercount, Grant. There's a tremendous amount of new stuff. Lucene has changed a lot in the five years since the first edition was published. Lucene in Action 2, has entirely new chapters. There's a chapter for Tika. If you want to index text out of binary documents Tika is a wonderful subproject or now it's a  top level project at Apache that has been created for expressly that purpose and it does a very good job with that.

There's a new chapter for the administrative aspects of Lucene, how do you perform a hot backup of your index taking a backup without pausing indexing. How do you manage threads, how do you test for performance? There's a lot – there's a whole side of Lucene that's sort of administration details that's covered in that chapter.

We have entirely new case studies, very interesting applications of Lucene. There's four case studies that replaced the original case studies. Those are the new chapters. New APIs we have tons of stuff. The – that have been revised and fully new APIs, the custom collector API, if you want to do something other than grab the top results by score or by – sorting by field you need to use a custom collector and that has changed.

The new reusable analysis API which has been a substantial change in Lucene as of 2.9 that heavily affected Chapter 4. Everything had to change there. Near real-time search is a new feature that's available. Payloads is another example, if you want to encode some custom information for every term position in your index payloads gives you a completely – it's a complete opaque too Lucene means of storing something inside there which you can then access at search time.

Indexing has gained a lot of richness for advanced use cases, especially transactional use cases. If you want to make a commit point and make some changes, roll back to a prior commit point, the indexing API has become fully transactional and we cover that in Chapter 2.

I think there's many other things that have changed. It's a long tail. Lucene has really been a very active open-source project and has grown a lot over the past five years.

Grant Ingersoll

Yeah, I can just relate in my reading the second edition there was just – even as a committer there's so many new things to be covered. So that's great. Well, so going back to Erik here, you know, what – in picking up the book, you know, if I'm a newbie to Lucene, what part of Lucene do you think is the hardest for new readers to grasp, and is there a part of the book in particular that helps someone better understand that?

Erik Hatcher

That's a great question. What is difficult? What are the difficult topics here? You know, most of the search stuff is fairly straightforward in terms – conceptually. The devil is really in a lot of the details, and thankfully that's the kind of thing that books like this answer is it really gets you down to the details of, you know, such things like we talk about the payloads, in terms of the scoring API, things that really make or break these search applications in terms of how good they are.

People always want to tune and fine-tune the relevancy for their particular cases. So it really is helpful to understand how the indexing works, how payloads work, how the scoring functions work. So I think that's probably the trickiest part for new readers.

Grant Ingersoll

Great. Mike or Otis, anything – any of your thoughts on what's useful to new readers? No? Okay.

Otis Gospodnetic

No.

Grant Ingersoll

Okay, so Otis, kind of switching gears from the new reader. You know, I'm an old hat at Lucene, I've been doing this since 2004, I guess. Just like you guys Lucene in a lot of way has paved a career for me. What do you think Lucene in Action 2 offers the more experienced Lucene developer? In other words, you know, what new tricks can you teach this old dog?

Otis Gospodnetic

I would say it's great for somebody who has not kept a close eye on Lucene for say last year even, or two. I think enough things have been added to Lucene or have been changed that one really needs to – even if they've read Lucene in Action the first edition, that they could – they should read Lucene in Action this new edition of Lucene in Action and they'll learn a ton of stuff.

Things like payloads or per segment things or check pointing, all those things are new. Performance troubleshooting around Lucene I find in my work is something that lots of people, even if they know how to use Lucene don't understand. So those are valuable things in the book that people should, you know, catch up on.

Grant Ingersoll

Yeah, definitely in my mind the section on performance was just so valuable, even as somebody who, you know, like you said, you've got all of this experience with Lucene but it just really helps solidify the concepts and even gives you ways to teach them to other people.

So switching gears now to Mike, is there anything that you didn't cover in the book or felt you didn't cover enough of that you wish you had – could have given more time and space and maybe you've, you know, used this time here as a little bit to expand on that?

Mike McCandless

Good question. I tried very hard to be thorough going through the Lucene's changes, going through the emails, the user questions that the people had and tried to make sure I covered everything that came up through the 3.0 release of Lucene.

Now, of course, Lucene is not frozen, it's continuing to change after that. There are many subsequent changes, but because of Lucene's backwards capability the source code in the book should work throughout the 3.x series of releases. So I would say that the book does a good job covering what all the new stuff that has appeared in 3.0.

That said, it's entirely possible things have slipped between the cracks and I've missed something. It's sort of a long tail.

Grant Ingersoll

Yeah, you're right, there's just a lot there. I mean, I even was amazed at how much coverage you give to all the – a lot of the contrib modules and some of the added on features on top of Lucene. So that's great.

Mike McCandless

Yes, although I would thank the contributors, in that case – in a number of cases I went to the original authors of those contrib modules and asked them to write it up which was very helpful and so it was thank you to them.

Grant Ingersoll

Delegation is the hallmark of any good leader, right?

Mike McCandless

Yes, yes, very useful tool.

Grant Ingersoll

Well, Erik, you know, as you know, many people these days who are using Lucene are actually using Lucene via Solr or some other tool like Solr. You know, so if I'm a Solr user, you know, which like I said, uses Lucene, how will Lucene in Action help me?

Erik Hatcher

First of all, let me just describe what Solr is. Solr is a layer on top of Lucene that adds a lot of value in terms of adding features such as faceting. It integrates in many of the contrib pieces of Lucene as well, the highlighter, spellchecker, those types of pieces that you can do with Lucene at an API level. However Solr makes these capabilities very performant to – through HTTP interface. So applications can integrate with this through any language or environment out there.

So Solr really encapsulates kind of the best practices of Lucene in a lot of ways. Many people that build applications with Lucene themselves end up effectively creating something like Solr in terms of caching and other things like managing the index writer, those types of things.

So these are the types of things that you'll learn from this book is you'll kind of understand a lot more about what makes Solr tick in a lot of ways, but you'll also have things such as the query parser know-how to understand how to customize some of these things.

Many – pretty much every piece of Solr is pluggable in some way and knowing the Lucene API gives you the ability to tweak these plug-ins and write little custom search components or query parsers or token filters and tokenizers, those types of things. So all of these Lucene know-how really comes into play when you're dealing with Solr.

Grant Ingersoll

Great, yeah, I mean, you know, I always tell people that, you know, a lot of the concepts are the same, it's just how you then – they actually get delivered. You don't have to necessarily write code. So definitely – this book is definitely a boon to Solr users as well.

I should also say I know, Otis, you are actually under contract to be writing Solr in Action, too, correct?

Otis Gospodnetic

That's correct.

Grant Ingersoll

Yeah, so that – we have that to look forward as a complementary book for Lucene in Action coming out from Manning at some point in the future.

Otis Gospodnetic

Um-hum.

Grant Ingersoll

Going back to you, Otis, you know, I mean, one of the things I'm constantly amazed by with Lucene is the variety of ways people are using it. I believe you were part of helping write up the use cases for the book, at least in the first edition, and possibly in the second one here.

So walk me through some of the use cases you've seen in your – either in your day job or through the mailing lists or things like that, that might help people who are thinking, "Oh, well, I'm not sure if I can use Lucene, so I'm not sure if this book is right for me." But, you know, maybe it helps them say, "Oh, I've got a problem similar to that. I can pick up this book and then go and try it out."

Otis Gospodnetic

Well, I find that some Lucene users do use Lucene for what it was, I guess originally intended for which is to search chunks of text.

Grant Ingersoll

Yeah, imagine that.

Otis Gospodnetic

But sometimes I find that people come to me and when they describe what they need to search for, whether it's Lucene or Solr or something – some other library, it turns out that they're not really searching text as in large chunks of text, they're really just doing matching.

Then I've heard of people as in, you know, searching for a document that have a specific number value in some field than some Boolean value in some other field and so on. So there's no really scoring or ranking involved, just does something match or not?

Then I've heard of people using Lucene and Solr, again, for reporting purposes or as key value stores simply because it's so fast to retrieve something from the index if you know the value or field which ends up acting like a key. They use it as a database, as a key value store, for aggregation of data instead of I guess relational or non-relational databases. They use it for classification, for clustering, and simple machine learning things, all kinds of things.

Grant Ingersoll

Yeah, yeah, I've seen, too, like even recommendations systems as well.

Otis Gospodnetic

Yeah.

Grant Ingersoll

You know, it's interesting, and I think, Mike, you can relate to a lot of the work that's been done on the transactional stuff is I think you're starting to see more and more people on – and it goes along with the no SQL movement, if you will of using Lucene and Solr as the authoritative store, or as their, you know, their primary store for their data and then that's the only way they access it.

Whereas it used to be the case that, you know, they were doing everything through a database and search was the side thing that they did.

Mike McCandless

Yes.

Grant Ingersoll

So, Mike, you know, just to kind of finish up here. I think we're at around 20 minutes or so, which is where I like to keep these at. You mentioned, of course, this is an open-source project. It hasn't been frozen in time here. Obviously, Lucene is moving forward, and in fact, moving forward at a pretty rapid pace.

Give us a taste of what we'll be writing about in – or reading about, actually, in Lucene in Action 3, and presumably what you guys are going to be writing about in Lucene in Action 3 in a year or two.

Mike McCandless

Oh, my. There have been a tremendous number of changes. One of the biggest changes is flexible indexing has arrived on Lucene's trunk which means it'll be available in the 4.0 release of Lucene. That is a very large change. It allows applications to control, implement a codec which will control how the very low level data, the postings data, documents in positions and payloads and that sort of thing, are written into the index and read from the index.

It also switches terms to be purely binary instead of textual in nature, which enables more efficient encoding of numeric fields and collated fields, for example. And it changes the enumeration API. So there's a lot of low-level stuff associated with those changes.

Another big change is all of the work that Robert's been doing lately, coalescing the analyzers modules, Lucene and Solr have traditionally have a rather scattered analyzer. Some were in Lucene's core, some were in Solr's core, some were in Lucene's contrib, some were in Solr's contrib, and Robert has been doing a lot of work lately to pull all those together into a separate module which is the one place you can go to to find these analyzers.

And in addition to that he's been folding in additional languages, a lot of other languages which were not – had no coverage whatsoever in Lucene before are now gaining coverage.

I know spatial has been undergoing a lot of changes. Grant, you've been involved in that heavily.

Grant Ingersoll

Yeah.

Mike McCandless

So that'll be completely different from how it looks right now in Lucene in Action. I can't think of the other changes, I'm sure there's tons of changes and a few years from now when we're looking back over the changes we'll be writing them up for Lucene in Action 3.

Grant Ingersoll

Yeah, that's great. We won't put you on the time schedule yet for when you have to deliver that. We'll let you rest on the fact that you guys did a great job in putting out Lucene in Action 2.

With that, that's about all the questions I have, I wanted to first off take the time to thank Mike and Otis and Erik, not only for the – taking the time to talk with me today, but like I said, I mean, Lucene has been a good career boon for me and it all started by the very first day I walked in my job and they handed me Lucene in Action and they said, "Go read this, and go build us a search engine on top of Lucene when you get done."

So obviously I can't thank you guys enough for that because it's been great for me, but also, you know, over the years we've grown together as a team in open source and worked together on a lot of good things. So I just want to, again, extend my thanks to all of you and appreciate you taking the time with me today.

Mike McCandless

Thank you, Grant.

Erik Hatcher

Thank you, Grant, very nice.

[End of Audio]

 

  • Login or register to post comments

Case Study

Closing the Knowledge Gap: A Case Study - How Cisco Unlocks Communications
Solr Development Case Study: resolutionfinder.org

Whitepapers

Programmer's Guide: Using LucidWorks Enterprise to add Search to your Web Application
Getting Started With LucidWorks Enterprise

DevZone

Latest Blog Post

Lucene Revolution 2012 - Call for Participation...
Mark your calendars today! The largest worldwide conference dedicated to Lucene and Solr will take place in Boston May 7-10. The 2012 conference will build on the success of last...
  • Tutorials
  • Blog
  • Whitepapers
  • Docs
  • Forums
  • Support
Share
Follow Facebook Twitter LinkedIn YouTube
RSS Feed
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Website Feedback
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Sitemap
  • Admin

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2012 Lucid Imagination. All Right reserved.