June 18, 2010
Apache Tika is a toolkit for extracting metadata and textual content from various document formats. Tika itself provides implementation for parsing some document formats while it relies on external libraries (such as Apache PDFBox and Apache POI) for parsing many more.
Tika provides a uniform Java API for all of the supported document formats to make life easier for the user. Additionally, Tika provides functionality for detecting document type and content language.
In my earlier…
Read more...
June 11, 2010
Back from Berlin Buzzwords and finally over the jet lag, so I thought I would put up some feedback. First off, it was a well organized conference with a nice focus on searching, storage and scaling. Kudos to Isabel, Simon and Jan for all their hard work. It also had great wi-fi coverage, which is always a struggle at every conference I’ve ever been too.
As for the talks, I gave the Keynote on…
Read more...
April 22, 2010
After reviewing a lot of great talk proposals, we’ve announced the agenda for Apache Lucene Eurocon: Apache Lucene EuroCon – Europe’s Premier Lucene and Solr Search User Conference.
One of the things I really like about this agenda is it is a great mix of basics, use cases from all over the search map (CMS, news, social media, advertising), business decisions (see last list and next list) and advanced topics (NLP, collab filtering, machine…
Read more...
April 21, 2010
Apache Lucene (the Lucene top level project, not Lucene the Java search API. I know, it’s confusing sometimes) has once again proved to be a fertile area for innovation (having already given birth to Apache Hadoop a few years back), as it once again has given birth, this time to three new Apache Top Level Projects (just approved by the Board at Apache): Apache Mahout, Apache Nutch and Apache Tika…
Read more...
January 20, 2010
Short Version
The Apache Lucene Connector Framework project has officially entered incubation. LCF, for short, is going to be a framework for connecting to content repositories like Sharepoint, Documentum, etc. and will make it easy to hook into Lucene, Solr, Nutch, Mahout, Tika, while, of course, remaining agnostic of the final destination of the data. See the Connectors website and the original proposal for more info. Help wanted!
Long Version
Background
A while…
Read more...
December 24, 2009
It’s that time of year, so I thought I would take a look back at the year that was for the Lucene Ecosystem and maybe look ahead just a little bit too.
First and foremost, it should be obvious to even the most casual observer that the Apache Lucene communities are thriving. Not only is it a great time to be involved in open source, it’s a great time to be involved in Lucene. …
Read more...
June 5, 2009
Slides from the first Lucene/Solr SF Bay Area meetup are now available here.
Thanks to everyone who participated.
Read more...
April 1, 2009
Another year, another successful ApacheCon Europe, at least as far as Lucene, Solr and I are concerned. This year, like last, Erik Hatcher and I had trainings on Lucene and Solr. Both were well attended, despite the economy, showing once again the power of open source and the fact that people are still invested in search. (If you missed the training, see here for alternatives.)
During the conference, there were several talks on…
Read more...