Integrating Apache Mahout with Apache Lucene and Solr – Part I (of 3)

Introduction

As Apache Mahout is about to release its next version (0.3), I thought I would share some thoughts on how it might be integrated with Apache Lucene and Apache Solr.  For those who aren’t aware of Mahout, it is an ASF project building out a library of machine learning algorithms that are designed to be scalable (often via Apache Hadoop) and licensed under the Apache Software License (i.e., commercially friendly). …

Read more...

Accessing words around a positional match in Lucene

From time to time, users on the Lucene mailing list ask a variant of the following question:

Given a term match in a document, what’s the best way to get a window of words around that match?

Getting a window of words around a match can be useful for a lot of things, including, to name a few:

  1. Highlighting (although I’d recommend using Lucene’s Highlighter package for that)
  2. Co-occurrence analysis
  3. Sentiment analysis
  4. Question Answering

Read more...