• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
    • Certified Distributions
      • Certified Solr
      • Certified Lucene
    • Apache Releases
      • Apache Solr
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Training
    • Solr/Lucene Certification
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Technical Leadership
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . Blog

March 16, 2010

Integrating Apache Mahout with Apache Lucene and Solr – Part I (of 3)

Posted by Grant Ingersoll

Introduction

As Apache Mahout is about to release its next version (0.3), I thought I would share some thoughts on how it might be integrated with Apache Lucene and Apache Solr.  For those who aren’t aware of Mahout, it is an ASF project building out a library of machine learning algorithms that are designed to be scalable (often via Apache Hadoop) and licensed under the Apache Software License (i.e., commercially friendly).  Mahout has a variety of algorithms already implemented, ranging from clustering to classification and collaborative filtering.  For more on Mahout, see my TriJUG talk or my developerWorks article.  Instead of going over the litany of things implemented in Mahout, I’ll give a quick recap of what the primary features of 0.3 are:

  1. New math, collections modules based on the time tested Colt project
  2. LLR (Log-likelihood ratio – See Lucid Imagination advisor Ted Dunning’s blog entry for more info) co-location implementation
  3. Hadoop-based Lanczos SVD (Singular Value Decomposition) solver — good for feature reduction, which is a common requirement at scale
  4. Shell scripts for easier running of algorithms, examples
  5. Faster Frequent Pattern Growth (FPGrowth) using FP-bonsai pruning
  6. Parallel Dirichlet process clustering (model-based clustering algorithm)
  7. Parallel co-occurrence based recommender
  8. Code cleanup, many bug fixes and performance improvements
  9. A new Logo:



Enough of the background; let’s get to what we can do right now.  I’ll break it down into three groups:

  1. Lucene/Solr as a Data Source for Mahout batch processing
  2. Document/Results Augmentation (clustering, classification, recommendations)
  3. Learning about your data and your users (log analysis with Apache Mahout)

In Part I (this post), I’m going to focus on #1 as a way for people to get started without having to do any coding.  In Part II, I’ll focus on #2 and finally, as you might guess, Part III will focus on #3.

Lucene/Solr as a Data Source for Mahout

Most Apache Mahout algorithms run off of Feature Vectors.  For those in the Lucene world, a feature vector should feel very familiar.  It is, more or less a document, or some subset of a document.  Specifically, a feature vector is a tuple of features that are useful for the algorithm.  It is up to you to determine what features work best.  In many cases for Mahout, a vector is simply a tuple of weights for each of the words in a document.  In other cases, they might be the values from the output of some manufacturing process.  Do note that the features for having good search capabilities are often different than those needed for good machine learning.  For instance, in my experiments with Mahout’s clustering capabilities, I need far more aggressive stopword removal to get good results than I do for search.  (In fact, for search these days, I often don’t even remove stopwords, but instead deal with them at query time, but that is a whole other post.)

There are two different ways for Mahout to use Lucene/Solr as a data source:

  1. Utilize Lucene’s term vector capability to create Mahout feature vectors.
  2. Programmatically access low level Lucene features like TermEnum, TermDocs, TermPositions, etc. to construct features.

For this post, I’m going to focus on #1, as I have yet to even have a need for #2, even though in theory it could be done.

Mahout Vectors from Lucene Term Vectors

In order for Mahout to create vectors from a Lucene index, the first and foremost thing that must be done is that the index must contain Term Vectors.  A term vector is a document centric view of the terms and their frequencies (as opposed to the inverted index, which is a term centric view) and is not on by default.

For this example, I’m going to use Solr’s example, located in <Solr Home>/example

In Solr, storing Term Vectors is as simple as setting termVectors=”true” on on the field in the schema, as in:

<field name=”text” type=”text” indexed=”true” stored=”true” termVectors=”true”/>

For pure Lucene, you will need to set the TermVector option on during Field creation, as in:

Field fld = new Field(“text”, “foo”, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.YES);

From here, it’s as simple as pointing Mahout’s new shell script (try running <MAHOUT HOME>/bin/mahout for a full listing of it’s capabilities) at the index and letting it rip:

<MAHOUT HOME>/bin/mahout lucene.vector –dir <PATH TO INDEX>/example/solr/data/index/ –output /tmp/foo/part-out.vec –field title-clustering –idField id –dictOut /tmp/foo/dict.out –norm 2

A few things to note about this command:

  1. This outputs a single vector file, title part-out.vec to the target/foo directory
  2. It uses the title-clustering field.  If you want a combination of fields, then you will have to create a single “merged” field containing those fields.  Solr’s <copyField> syntax can make this easy.
  3. The idField is used to provide a label to the Mahout vector such that the output from Mahout’s algorithms can be traced back to the actual documents.
  4. The –dictOut outputs the list of terms that are represented in the Mahout vectors.  Mahout uses an internal, sparse vector representation for text documents (dense vector representations are also available) so this file contains the “key” for making sense of the vectors later.  As an aside, if you ever have problems with Mahout, you can often share your vectors with the list and simply keep the dictionary to yourself, since it would be pretty difficult (not sure if it is impossible) to reverse engineer just the vectors.
  5. The –norm tells Mahout how to normalize the vector.  For many Mahout applications, normalization is a necessary process for obtaining good results.  In this case, I am using the Euclidean distance (aka the 2-norm) to normalize the vector because I intend to cluster the documents using the Euclidean distance similarity.  Other approaches may require other norms.

Obviously, this script above can be run at any time, but I think it is even more interesting to hook it into Solr’s event system, with caveats.  For those who aren’t familiar, Solr provides an event call back system for events like commit and optimize (see also the LucidWorks Reference Guide).    Hooking into the event system is as simple as setting up the appropriate event listener.  For this example, I’m going to hook into the commit listener by having it call out to the Mahout script above:

<listener event=”postCommit”>
<str name=”exe”>/Volumes/User/grantingersoll/projects/lucene/mahout/clean/bin/mahout</str>
<str name=”dir”>.</str>
<bool name=”wait”>false</bool>
<arr name=”args”>
<str>lucene.vector</str>
<str>–dir</str>
<str>./solr/data/index/</str>
<str>–output</str>
<str>/tmp/mahout/vectors/part-out.vec</str>
<str>–field</str>
<str>text</str>
<str>–idField</str>
<str>id</str>
<str>–dictOut</str>
<str>/tmp/mahout/vectors/dict.dat</str>
<str>–norm</str>
<str>2</str>
<str>–maxDFPercent</str>
<str>90</str>
</arr>
</listener>

From here, one can easily extrapolate how a script could be written to then call Mahout’s other methods, namely things like clustering and Latent Dirichlet Allocation (LDA) for topic modeling.  Alternatively, one could set up a process to watch for changes to the vector and then spawn a process to go and run the appropriate Mahout tasks.

So, what are the caveats with the above approach?

  1. If you are running in a commit heavy environment, you may not want to run Mahout on every commit.  Mahout is designed for batch processing (well, most of it is, anyway) and most of these jobs are designed to run on Hadoop clusters.  In order to do that, you would have to modify the above paths, etc. to have it output to Hadoop’s HDFS, which I’ll leave as an exercise to the reader (the Mathematician in me always enjoys saying that!)
  2. If you are running Solr in a distributed environment, you’re going to have to set things up appropriately on each node.  Hopefully, as the Solr Cloud stuff matures, this will become even simpler and we should be able to do some really smart things to make Mahout and Solr work together in a distributed environment.  For now, you’re on your own.

In the next posting, I’ll look at how we can more closely hook in Mahout into the indexing and search process.  As a teaser, think about how you could use Mahout to classify and cluster large volumes of text and then have that information available for things like faceting, discovery and filtering on the search side.

As always, let me know if you have any questions or comments.

References

  1. Mahout In Action by Owen and Anil.  Manning Publications.
  2. Various Solr and Lucene books, all linked via Lucid Imagination.
  3. http://lucene.apache.org/mahout
  4. http://cwiki.apache.org/MAHOUT
  5. Grant’s Blog has a number of articles on Mahout
  • Share this:
  • Email
  • Facebook
  • Digg
  • Share
  • Print
  • Reddit
  • StumbleUpon

Category: apache, Hadoop, Lucene, Mahout, Solr, term vectors

20 Responses to “Integrating Apache Mahout with Apache Lucene and Solr – Part I (of 3)”

  1. [...] Lucid Imagination » Integrating Apache Mahout with Apache Lucene and Solr – Part I (of 3) As Apache Mahout is about to release its next version (0.3), I thought I would share some thoughts on how it might be integrated with Apache Lucene and Apache Solr.  For those who aren’t aware of Mahout, it is an ASF project building out a library of machine learning algorithms that are designed to be scalable (often via Apache Hadoop) and licensed under the Apache Software License (i.e., commercially friendly).  Mahout has a variety of algorithms already implemented, ranging from clustering to classification and collaborative filtering. (tags: mahout lucene apache solr tutorial todo oekeleboekie) [...]

    April 22, 2010 14:02 — Webhamer Weblog: Search & ICT-related blogging » links for 2010-04-22

  2. how to apply the clustring algorithms on the vectors after being formed and how to convert vectors to sparse vectors :programming wise of course

    May 4, 2010 00:02 — mohamad

  3. Hi Grant,

    Thanks for great topic…

    We are waiting for next part of this topic……

    Could please tell us we can see that…

    June 30, 2010 04:48 — Amit

  4. I’ve been working on it, slowly but surely, but unfortunately other more pressing issues have gotten in the way. Hope to get it out soon.

    June 30, 2010 05:05 — Grant Ingersoll

  5. Thanks Grant. Some really interesting possibilities with this combination. I’m really interested in how you can integrate Solr with Mahout’s clustering. Can’t wait to read the next part in the series!

    Matt

    July 27, 2010 02:29 — Matt Mitchell

  6. Hi Grant,

    Thanks for an interesting topic!

    I got the warning “No lucene.vector.props found on classpath…” when running the command below. Can you please advise? Thanks!!!

    /bin/mahout lucene.vector –dir /example/solr/data/index/ –output /tmp/foo/part-out.vec –field title-clustering –idField id –dictOut /tmp/foo/dict.out –norm 2

    WARNING: No lucene.vector.props found on classpath, will use command-line argume
    nts only
    Aug 5, 2010 11:17:40 AM org.slf4j.impl.JCLLoggerAdapter error
    SEVERE: Exception
    org.apache.commons.cli2.OptionException: Unexpected 2 while processing Options
    at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
    at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:125)

    August 5, 2010 10:20 — Khoa

  7. @Khoa: That’s is a bug. See https://issues.apache.org/jira/browse/MAHOUT-501. If you rename conf/lucenevector.props to conf/lucene.vector.props it will work.

    September 21, 2010 05:18 — Frank Scholten

  8. [...] and will also be releasing Part II of my series on integrating Lucene/Solr with Mahout (part I is here) shortly after I get [...]

    October 31, 2010 13:30 — Lucid Imagination » Apache Mahout 0.4 Released

  9. Hey,

    Is there going to a part 2 and 3 of this series it very interesting

    Regards,

    Dave

    November 16, 2010 02:57 — David

  10. Thanks for the interesting post. Will surely keep checking for the second part, even thought the chances seem slim.

    February 6, 2011 22:30 — Joyce Babu

  11. Hi Grant,
    If you eventually wanted to dump results into Solritas (VelocityRepsonseWriter), what would the flow of data need to look like? Raw Data->Lucene->Mahout->Solr?

    Thanks,
    Matthew

    February 9, 2011 20:26 — Matthew Sacks

  12. Hi, great post.
    Have a Q though – I’m running the MAHOUT through the Eclipse and I created the vector from my Lucene index. Two file were created:
    1. The vector file.
    2. The Dict file.

    When running the FuzzyKMeans on the vector file – I got Exception while the job was parsing it – NotANumber Exception – for the vec file is a ‘compiled’ file. Any ideas how to make it work?

    March 29, 2011 09:41 — Moshe Lichman

  13. I’m having some trouble getting this to work with my own data. I issue the following command:

    mahout lucene.vector –dir /home/markr/shgs/apache-solr-3.4.0/example/solr/data/index/ –output /tmp/part-out.vec –field content_encoded –idField id –dictOut /tmp/dict.out –norm 2

    My intent is to generate term vectors for the content_encoded field whose schema.xml entry has the termVectors=”true” attribute setting. There is also a field named ‘id’. My data was imported into a sqlite3 db, and id is ‘not null’, but content_encoded may be null. When I run, I get the SLF4J multiple binding warning (just a warning?), and then the following exception:

    Exception in thread “main” org.apache.lucene.index.CorruptIndexException: unrecognized format -3 in file “_b.fnm”
    at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:351)
    at org.apache.lucene.index.FieldInfos.(FieldInfos.java:71)
    at org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:72)
    at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:114)
    at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:92)
    at org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:113)
    at org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:29)
    at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:81)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:750)
    at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:75)
    at org.apache.lucene.index.IndexReader.open(IndexReader.java:428)
    at org.apache.lucene.index.IndexReader.open(IndexReader.java:288)
    at org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:84)
    at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:250)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)

    Advise on how to debug this problem would be greatly appreciated.

    Mark

    October 20, 2011 10:14 — Mark Rosenberg

  14. Hi Mark,

    the issue here is likely a version mismatch between the Lucene version in Mahout and the Lucene version you created your index with. If you sync those up, you should be fine.

    October 20, 2011 12:24 — Grant Ingersoll

  15. Hi Grant,

    Thanks for the quick response! We seem to be in an awkward situation WRT Mahout and Solr Lucene version dependencies. I’m using Mahout 0.6 snapshot, which has a Lucene 3.3.0 dependency. Due to Oracle Java 7 sabotage, Lucene users are being advised to upgrade to 3.4.0. Do I have an alternative to using the Mahout 0.5 release?

    October 20, 2011 13:59 — Mark Rosenberg

  16. You should be fine upgrading Mahout’s version. In fact, we should do it in Mahout. Feel free to open an issue there. Although, the Java 7 issue and the 3.4.0 issue are separate. The 3.4.0 issue was due to a fsync issue in Lucene 3.3.0

    October 20, 2011 22:36 — Grant Ingersoll

  17. I have the same problem, seems that Lucene version is out of sync between Solr and Mahout. Question is how exactly do I make them in sync? I have mahout having lucene-core-3.1.0.jar in mahout/lib directory. I have Solr 3.4. I downloaded Lucene 3.4 jar files and replaced lucene jars inside mahout/lib but that did not work (doesnt seem that mahout loads those lucene jars at all). So how to I make sure they use the same lucene version? I am somewhat new to java/linux world.

    October 27, 2011 07:02 — Bob Stewart

  18. Mahout trunk should now be on Lucene 3.4. In general, if you are replacing the jars, I think you need to make sure they are packaged in to Mahout’s Job jars correctly.

    October 29, 2011 04:26 — Grant Ingersoll

  19. Trying the cookbook example provided by the article with Mahout trunk and Solr 3.4.0. Looks like –field title-clustering doesn’t have enough term vectors so I may be running afoul of https://issues.apache.org/jira/browse/MAHOUT-675.

    11/11/02 14:16:41 ERROR lucene.LuceneIterator: There are too many documents that do not have a term vector for title-clustering
    Exception in thread “main” java.lang.IllegalStateException: There are too many documents that do not have a term vector for title-clustering.

    If I use –field text then mahout completes normally and writes 17 vectors. The recommendation to use copyField to accumulate field contents in a new title-clustering field appears to be mandatory if the article’s mahout command line is to be used without modification.

    November 2, 2011 13:24 — Mark Rosenberg

  20. Hi,

    I am new to Mahout and Lucene. I want to do clustering of users. I have 7 dimensions (features) in data. I have tried kMeans clustering taking data from csv. Now I want to get data from Lucene. I have one question that while converting lucene documents to vectors, how will it consider dimensions? How should I generate Lucene documents if I want to generate vectors with n dimesions (features)?

    January 12, 2012 01:34 — Janki

Leave a Reply

Go to Blog Front Page

  • Recent Posts

    • Lucene Revolution 2012 – Call for Participation now open!
    • SolrCloud is Coming (and looking to mix in even more ‘NoSQL’)
    • Our Solr Reference Guide updated for v3.5
    • Enhancing Discovery with Solr and Mahout – session slides now available!
    • Solr and LucidWorks feature matrix available
    • LucidWorks Enterprise latest version 2.0.1 released!
    • Why Not AND, OR, And NOT?
    • Options to tune document’s relevance in Solr
    • Dallas JavaMUG December 14th 2011
    • Apache Mahout user meeting – session slides and videos are now available!
  • Archives

    • January 2012
    • December 2011
    • November 2011
    • October 2011
    • September 2011
    • August 2011
  • Tags

    acts_as_solr apache Apache Mahout best practices chump code4lib dismax drupal enterprise search Erik Hatcher field collapsing function query Grant Ingersoll hoss image isfdb local params Lucene lucene revolution LucidGaze lucid imagination Mahout Marc Krellenstein Mark Miller nested queries nutch Open Source Open Source Search qparser query parser queryparser Rails release result grouping Richmond Ruby schema design sint Solr solr 3.1 solr 4.0 solr cloud sortable Tika VA
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Log in

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2011 Lucid Imagination. All Right reserved.

loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.