Extending Apache Tika Capabilities

Apache Tika is a toolkit for extracting metadata and textual content from various document formats. Tika itself provides implementation for parsing some document formats while it relies on external libraries (such as Apache PDFBox and Apache POI) for parsing many more.

Tika provides a uniform Java API for all of the supported document formats to make life easier for the user.  Additionally, Tika provides functionality for detecting document type and content language.

In my earlier…

Read more...

Berlin Buzzwords Recap

Back from Berlin Buzzwords and finally over the jet lag, so I thought I would put up some feedback.  First off, it was a well organized conference with a nice focus on searching, storage and scaling.  Kudos to Isabel, Simon and Jan for all their hard work.  It also had great wi-fi coverage, which is always a struggle at every conference I’ve ever been too.

As for the talks, I gave the Keynote on…

Read more...

Apache Lucene EuroCon Agenda – The Revolution is On!

After reviewing a lot of great talk proposals, we’ve announced the agenda for Apache Lucene Eurocon: Apache Lucene EuroCon – Europe’s Premier Lucene and Solr Search User Conference.

One of the things I really like about this agenda is it is a great mix of basics, use cases from all over the search map (CMS, news, social media, advertising), business decisions (see last list and next list) and advanced topics (NLP, collab filtering, machine…

Read more...

News Flash: Apache Lucene gives birth to triplets!

Apache Lucene (the Lucene top level project, not Lucene the Java search API.  I know,  it’s confusing sometimes) has once again proved to be a fertile area for innovation (having already given birth to Apache Hadoop a few years back), as it once again has given birth, this time to three new Apache Top Level Projects (just approved by the Board at Apache): Apache Mahout, Apache Nutch and Apache Tika

Read more...

Apache Lucene Connector Framework now in Incubation at the ASF

Short Version

The Apache Lucene Connector Framework project has officially entered incubation.  LCF, for short, is going to be a framework for connecting to content repositories like Sharepoint, Documentum, etc. and will make it easy to hook into Lucene, Solr, Nutch, Mahout, Tika, while, of course, remaining agnostic of the final destination of the data.  See the Connectors website and the original proposal for more info.  Help wanted!

Long Version

Background

A while…

Read more...

The Apache Lucene Ecosystem: My view of 2009

It’s that time of year, so I thought I would take a look back at the year that was for the Lucene Ecosystem and maybe look ahead just a little bit too.

First and foremost, it should be obvious to even the most casual observer that the Apache Lucene communities are thriving.  Not only is it a great time to be involved in open source, it’s a great time to be involved in Lucene. …

Read more...

SF Bay Area Meetup Slides Available

Slides from the first Lucene/Solr SF Bay Area meetup are now available here.

Thanks to everyone who participated.

Read more...

ApacheCon Europe Follow Up

Another year, another successful ApacheCon Europe, at least as far as Lucene, Solr and I are concerned.  This year, like last, Erik Hatcher and I had trainings on Lucene and Solr.  Both were well attended, despite the economy, showing once again the power of open source and the fact that people are still invested in search.  (If you missed the training, see here for alternatives.)

During the conference, there were several talks on…

Read more...