• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
    • Certified Distributions
      • Certified Solr
      • Certified Lucene
    • Apache Releases
      • Apache Solr
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Training
    • Solr/Lucene Certification
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Technical Leadership
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . Blog

September 28, 2009

Solr’s New Clustering Capabilities

Posted by Grant Ingersoll

Introduction

One of the new things in Solr 1.4 that I am particularly excited about is the new document and search results clustering capabilities.  This is an optional module that lives in Solr’s contrib/clustering directory and was added via SOLR-769.  The module is designed to allow people to either use the existing clustering capabilities, currently only search result clustering is offered via Carrot2, or to plug in their own capabilities.  While some of the public APIs still need to be hashed out via multiple implementations and as they related to whole document collections, I thought I would share a quick getting started on search result clustering using Carrot2, since it is included in Solr and easy to get up and running with rather quickly.

Background

Clustering is an unsupervised learning task that attempts to aggregate related content together without any a priori knowledge about the content.  It is very similar to faceting (some people call it dynamic faceting), but works in a less structured manner.  Clustering algorithms often look at the features of the text (important words, etc.) and use them to determine similarity.  Most implementations define a notion of distance between any two documents and then use that to determine which documents are similar to one another.  Popular algorithms include hierarchical and k-Means clustering.  For more information and to see several implementations of clustering in action, I’d encourage you to check out Apache Mahout.

Getting Started

The first thing to do to get started is get the code.  There are a number of ways to do this, but I just like SVN on the command line, as in:

svn co  https://svn.apache.org/repos/asf/lucene/solr/trunk

Next, you can switch into the trunk directory and build everything:

ant build-contrib

This is an important step because some of Carrot2′s libraries cannot be included by default in the Apache SVN because they are LGPL.   The build-contrib Ant target will go and automatically download the necessary libraries.

Once built, I need to add the Clustering libs to my Solr Home lib directory (called solr-clustering), as in:

cp <SOLR_HOME>/contrib/clustering/lib ./solr-clustering/lib/
cp <SOLR_HOME>/contrib/clustering/build/apache-solr-clustering-1.4-dev.jar ../solr-clustering/lib/.
cp <SOLR_HOME>/contrib/clustering/lib/downloads ./solr-clustering/lib/

I also got Solr Cell (Apache Tika integration) so that I can easily load some content to cluster:

 cp <SOLR_HOME>/contrib/extraction/build/apache-solr-cell-1.4-dev.jar ../solr-clustering/lib/.
cp <SOLR_HOME>/contrib/extraction/lib/* ./solr-clustering/lib/.

For this example, the pertinent parts of my schema are:

<field name="id" type="string" indexed="true" stored="true" required="true" />
 <field name="title" type="text" indexed="true" stored="true" multiValued="true"/>
 <field name="subject" type="text" indexed="true" stored="true"/>
 <field name="description" type="text" indexed="true" stored="true"/>
 <field name="comments" type="text" indexed="true" stored="true"/>
 <field name="author" type="textgen" indexed="true" stored="true"/>
 <field name="keywords" type="textgen" indexed="true" stored="true"/>
 <field name="category" type="textgen" indexed="true" stored="true"/>
 <field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
 <field name="last_modified" type="date" indexed="true" stored="true"/>
 <field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
 <field name="text" type="text" indexed="true" stored="true" multiValued="true"/>

And my Solr config has:

<requestHandler name="standard" default="true">
 <!-- default values for query parameters -->
 <lst name="defaults">
 <str name="echoParams">explicit</str>
 <!--
 <int name="rows">10</int>
 <str name="fl">*</str>
 <str name="version">2.1</str>
 -->
 <!--<bool name="clustering">true</bool>-->
 <str name="clustering.engine">default</str>
 <bool name="clustering.results">true</bool>
 <!-- The title field -->
 <str name="carrot.title">title</str>
 <str name="carrot.url">id</str>
 <!-- The field to cluster on -->
 <str name="carrot.snippet">text</str>
 <!-- produce summaries -->
 <bool name="carrot.produceSummary">true</bool>
 <!-- the maximum number of labels per cluster -->
 <!--<int name="carrot.numDescriptions">5</int>-->
 <!-- produce sub clusters -->
 <bool name="carrot.outputSubClusters">false</bool>

 </lst>
 <arr name="last-components">
 <str>clustering</str>
 </arr>
 </requestHandler>

 <searchComponent name="clustering">
 <!-- Declare an engine -->
 <lst name="engine">
 <!-- The name, only one can be named "default" -->
 <str name="name">default</str>

 <str name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
 <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
 </lst>
 <lst name="engine">
 <str name="name">stc</str>
 <str name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm</str>
 </lst>
 </searchComponent>

Finally, I need to fire up Solr:

cd <SOLR_HOME>/example
java -Dsolr.solr.home=<PATH TO HOME>/solr-clustering -Dsolr.data.dir=<PATH TO HOME>/solr-clustering/data -jar start.jar

Now I need some documents.  In this case, I have a bunch of PDF files that I keep organized in Mekentosj’s excellent PDF library organizer Papers (Mac only) that I want to index.  The code for this (it’s just a quick little hack) is in Appendix A at the bottom of the post.  I point it at my directory and off it goes.  When I’m done, I have 91 documents in my index.  I then did some basic searches to make sure I can get some decent results for some queries.  From here, all I need to do is tell Solr to cluster the results:

http://localhost:8983/solr/select/?q=*:*&fl=title,score,id&version=2.2&start=0&rows=100&indent=on&clustering=true

Notice I added the &clustering=true parameter on the end and that I set &rows to be 100.  This turns on the clustering component which then hands off the work to Carrot2 using the parameters defined in my Request Handler.  Carrot2 is an in-memory clustering engine and the implementation is designed to cluster on only the top results, not necessarily all the results that matched.

In the case of the request above, some of my results look like:

<arr name="clusters">
 <lst>
  <arr name="labels">
	<str>Naive Bayesian</str>
  </arr>
  <arr name="docs">
	<str>/Users/grantingersoll/Documents/Papers/1996/Friedman/Proceedings of the Thirteenth National Conference on … 1996 Friedman.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/1998/McCallum/AAAI-98 Workshop on Learning for Text Categorization 1998 McCallum.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/2000/Androutsopoulos/Arxiv preprint cs.CL 2000 Androutsopoulos.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/2002/Sebastiani/ACM Computing Surveys (CSUR) 2002 Sebastiani.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/2003/Unknown/2003.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/2006/Unknown/2006-17.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/2006/Unknown/2006-9.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/2007/Unknown/2007.pdf</str>	<str>/Users/grantingersoll/Documents/Papers/2008/Graham-Cumming/2008 Graham-Cumming.pdf</str>
        <str>/Users/grantingersoll/Documents/Papers/2008/McCullagh/Bayesian Analysis 2008 McCullagh.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/Unknown/Marmanis/Marmanis.pdf</str>
  </arr>
 </lst>
 <lst>
  <arr name="labels">
	<str>Semantic Distance of the Component Nodes</str>
  </arr>
  <arr name="docs">
	<str>/Users/grantingersoll/Documents/Papers/1992/Kukich/ACM computing surveys 1992 Kukich.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/2002/Sebastiani/ACM Computing Surveys (CSUR) 2002 Sebastiani.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/2006/Unknown/2006-10.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/2006/Unknown/2006-12.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/2006/Unknown/2006-2.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/2006/Unknown/2006-5.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/2006/Unknown/2006-8.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/2006/Unknown/2006-9.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/2008/E. Bernard/2008 E. Bernard.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/Unknown/Marmanis/Marmanis.pdf</str>
	<str>/Users/grantingersoll/Documents/Papers/Unknown/Rennie/Rennie.pdf</str>
  </arr>
 </lst>

You can see Carrot2 provides a label and then a list of the ids that fit under that label.

From here, I can play around with other options, such as trying out the STC algorithm:

http://localhost:8983/solr/select/?q=*:*&fl=title,score,id&version=2.2&start=0&rows=100&indent=on&clustering=true&clustering.engine=stc

What’s Next?

While I don’t have a specific roadmap, for clustering support, I can see a couple of things that are interesting:

  1. Whole collection clustering – Using a background process, cluster all the documents in the entire index using something like Apache Mahout.
  2. Clusters -> Filters – Take the docs in each cluster and create filters out of them and then store them in the filter cache with a name.  Then, future queries could be restricted to search 1 or more clusters only.
  3. Implement other algorithms.
  4. Take a deeper look at performance – Carrot2 is pretty fast, but maybe more profiling, etc. can be done to speed things up even more.

Appendix A

My indexing code:

package com.grantingersoll.noodles.solr;
 
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
 
import java.io.File;
import java.io.FilenameFilter;
import java.io.IOException;
import java.net.MalformedURLException;
 
/**
 *
 *
 **/
public class SimpleFileIndexer {
  protected SolrServer server;
 
  public SimpleFileIndexer() throws MalformedURLException {
    server = new CommonsHttpSolrServer("http://localhost:8983/solr");
  }
 
  public long crawl(File input) throws IOException, SolrServerException {
    long result = 0;
    if (input.isDirectory()) {
 
      File[] files = input.listFiles(new FilenameFilter() {
        public boolean accept(File file, String s) {
          return s.endsWith(".pdf") || s.endsWith(".doc") || file.isDirectory();
        }
      });
      for (int i = 0; i &lt; files.length; i++) {
        File file = files[i];
        result += crawl(file);
      }
    } else {
      String name = input.getName();
      if (name.endsWith(".pdf") || name.endsWith(".doc")){
        System.out.println("Adding: " + input);
        ContentStreamUpdateRequest csur = new ContentStreamUpdateRequest("/update/extract");
        csur.addFile(input);
        csur.setParam("literal.id", input.getAbsolutePath());
        try {
          server.request(csur);
          result++;
        } catch (Exception e){
          System.err.println("Couldn't add: " + input);
        }
 
      }
    }
 
    //autocommit is on
    return result;
  }
 
  public SolrServer getServer() {
    return server;
  }
 
  public static void main(String[] args) throws IOException, SolrServerException {
    File dir = new File(args[0]);
    if (dir.exists()) {
      SimpleFileIndexer idxr = new SimpleFileIndexer();
      long count = idxr.crawl(dir);
      idxr.getServer().commit();
      System.out.println("Crawled: " + count + " documents.");
    } else {
      System.err.println("Input file or dir does not exist: " + args[0]);
    }
  }
 
}
  • Share this:
  • Email
  • Facebook
  • Digg
  • Share
  • Print
  • Reddit
  • StumbleUpon

Category: Lucene, Solr

One Response to “Solr’s New Clustering Capabilities”

  1. Hi,

    Thanks for the tutorial!
    For me it only works after adding the “class” parameter to the solrconfig file:

    December 10, 2009 01:37 — Roxana

Leave a Reply

Go to Blog Front Page

  • Recent Posts

    • Lucene Revolution 2012 – Call for Participation now open!
    • SolrCloud is Coming (and looking to mix in even more ‘NoSQL’)
    • Our Solr Reference Guide updated for v3.5
    • Enhancing Discovery with Solr and Mahout – session slides now available!
    • Solr and LucidWorks feature matrix available
    • LucidWorks Enterprise latest version 2.0.1 released!
    • Why Not AND, OR, And NOT?
    • Options to tune document’s relevance in Solr
    • Dallas JavaMUG December 14th 2011
    • Apache Mahout user meeting – session slides and videos are now available!
  • Archives

    • January 2012
    • December 2011
    • November 2011
    • October 2011
    • September 2011
    • August 2011
  • Tags

    acts_as_solr apache Apache Mahout best practices chump code4lib dismax drupal enterprise search Erik Hatcher field collapsing function query Grant Ingersoll hoss image isfdb local params Lucene lucene revolution LucidGaze lucid imagination Mahout Marc Krellenstein Mark Miller nested queries nutch Open Source Open Source Search qparser query parser queryparser Rails release result grouping Richmond Ruby schema design sint Solr solr 3.1 solr 4.0 solr cloud sortable Tika VA
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Log in

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2011 Lucid Imagination. All Right reserved.

loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.