<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lucid Imagination &#187; Mahout</title>
	<atom:link href="http://www.lucidimagination.com/blog/tag/mahout/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lucidimagination.com/blog</link>
	<description>Exclusively dedicated to Apache Lucene/Solr open source search technology</description>
	<lastBuildDate>Sat, 04 Feb 2012 01:12:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>SF Bay Area Apache Mahout User Meeting on Nov. 29</title>
		<link>http://www.lucidimagination.com/blog/2011/11/05/sf-bay-area-apache-mahout-user-meeting-on-nov-29/</link>
		<comments>http://www.lucidimagination.com/blog/2011/11/05/sf-bay-area-apache-mahout-user-meeting-on-nov-29/#comments</comments>
		<pubDate>Sat, 05 Nov 2011 14:42:16 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[lucid imagination]]></category>
		<category><![CDATA[MapR]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4416</guid>
		<description><![CDATA[[ Tuesday, 29 November 2011; 18:30 to 21:30. ] <p>For all of those interested in Apache Mahout and scalable machine learning, Lucid Imagination is hosting a Mahout Users Meeting at it&#8217;s new office in Redwood City on Nov. 29th. Doors open at 6:30 pm. The night will feature two speakers, Ted Dunning of <a href="http://www.mapr.com">MapR Technologies</a> and Grant Ingersoll of <a href="http://www.lucidimagination.com">Lucid Imagination</a>, along with a social gathering with food and drinks.</p>
<p>For more details and &#8230;</p>]]></description>
			<content:encoded><![CDATA[[ Tuesday, 29 November 2011; 18:30 to 21:30. ] <p>For all of those interested in Apache Mahout and scalable machine learning, Lucid Imagination is hosting a Mahout Users Meeting at it&#8217;s new office in Redwood City on Nov. 29th. Doors open at 6:30 pm. The night will feature two speakers, Ted Dunning of <a href="http://www.mapr.com">MapR Technologies</a> and Grant Ingersoll of <a href="http://www.lucidimagination.com">Lucid Imagination</a>, along with a social gathering with food and drinks.</p>
<p>For more details and to RSVP, please see <a href="http://sf-mahout-11-11.eventbrite.com/">http://sf-mahout-11-11.eventbrite.com/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/11/05/sf-bay-area-apache-mahout-user-meeting-on-nov-29/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Mahout in Action Review</title>
		<link>http://www.lucidimagination.com/blog/2011/10/15/mahout-in-action-review/</link>
		<comments>http://www.lucidimagination.com/blog/2011/10/15/mahout-in-action-review/#comments</comments>
		<pubDate>Sat, 15 Oct 2011 13:13:18 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Grant Ingersoll]]></category>
		<category><![CDATA[Mahout in Action]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4317</guid>
		<description><![CDATA[<p>You know your (technical) <a href="https://cwiki.apache.org/confluence/display/MAHOUT/MahoutName">baby</a> is (almost) grown up when the book on the project finally comes out.  Such is the case for Apache Mahout, thanks to <a href="http://www.manning.com">Manning Publications</a> shipping <a href="http://affiliate.manning.com/idevaffiliate.php?id=1141_219">Mahout in Action</a> this week.</p>
<p><img src="http://manning.com/owen/owen_cover150.jpg" alt="" width="150" height="187" class="alignright" float="right" />So, before I start into my review, let me first say congratulations to Sean, Robin, Ted, Ellen and Manning for producing such an excellent product.   The simplest praise I can give it is to put it on the same &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>You know your (technical) <a href="https://cwiki.apache.org/confluence/display/MAHOUT/MahoutName">baby</a> is (almost) grown up when the book on the project finally comes out.  Such is the case for Apache Mahout, thanks to <a href="http://www.manning.com">Manning Publications</a> shipping <a href="http://affiliate.manning.com/idevaffiliate.php?id=1141_219">Mahout in Action</a> this week.</p>
<p><img src="http://manning.com/owen/owen_cover150.jpg" alt="" width="150" height="187" class="alignright" float="right" />So, before I start into my review, let me first say congratulations to Sean, Robin, Ted, Ellen and Manning for producing such an excellent product.   The simplest praise I can give it is to put it on the same level as one of the best intro to technology books I know:  <a href="http://www.manning.com/affiliate/idevaffiliate.php?id=1071_147">Lucene In Action</a>.  In other words, it sets the standard by which all other Mahout books will be judged.<br />
<br />
As for the actual book, it is broken down into 3 sections, which I like to call the &#8220;three C&#8217;s&#8221;:</p>
<ol>
<li>Collaborative Filtering</li>
<li>Clustering</li>
<li>Classification</li>
</ol>
<p>So, without further ado, let&#8217;s take a deeper look at the book in this context of the three C&#8217;s.</p>
<h2>Collaborative Filtering</h2>
<p>Collaborative Filtering is by far one of the most popular parts of Mahout, being used in places like <a href="https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout">Amazon and Foursquare</a> and this section of the book, via 5 chapters,  walks you nicely through both the concepts and the practical aspects of collaborative filtering.   Chapter 2 starts by getting you up and running using the <a href="http://www.grouplens.org/">GroupLens</a> dataset for movie recommendations.  For those unfamiliar with collaborative filtering, this makes for a nice entrance into the subject with data everyone can relate to easily.  Chapter 3 then discusses how to best model your data, while chapter 4 looks at the mechanics of actually generating recommendations from the data. </p>
<p>Chapters 5 and 6 then discuss the ins and outs of taking a recommendation engine into production, including details on how to scale it out using Apache Hadoop.  I found the explanation of the Hadoop based co-occurrence process (via RecommenderJob) especially useful, as I recently just committed <a href="https://issues.apache.org/jira/browse/MAHOUT-798">MAHOUT-798</a>, which uses it to build an example recommendation system based off of user interaction with email.  In fact, I relied heavily on all of the concepts in this part of the book, as I first had to extract and clean the data, then properly model it before finally running the recommendation task on EC2.</p>
<p>When I first got access to the MEAP for this book (quite some time ago), I did not have a lot of background in collaborative filtering and these chapters really helped fill in the practical details for me as well as provided a good foundation for the theoretical aspects behind collab. filtering.  I think this will serve others well who are looking to get started with collaborative filtering as well.</p>
<h2>Clustering</h2>
<p>Similar to collaborative filtering, the clustering section starts off by introducing the basic concepts and then quickly gets you up and running with an example clustering run.  Chapter 8 then gets into how best to do feature selection for clustering.  Feature selection is often one of the keys to successful clustering, so be sure to make sure you have a good grasp on the contents of the chapter before moving ahead into chapter 9, which gets into some of Mahout&#8217;s clustering algorithms.  That chapter primarily focuses on K-Means and Dirichlet, but also covers a few others.  Note, Mahout actually has a few other algorithms for clustering then the ones described, like spectral, canopy, meanshift and minhash.  Of course, some of these were added later in the book cycle, so it is hard to complain that they weren&#8217;t incorporated. Chapter 10 then covers, in my experience, one of the harder aspects of clustering, namely how to evaluate the results.  This chapter is a little bit thin, but it seems the overall field is the same, so this is not a put down on the chapter!  There simply isn&#8217;t a lot of great tools available for evaluating clustering.</p>
<p>Chapter 11 then adds some meat onto the bones of taking clustering to producti0n, including information on leveraging clustering in a Hadoop cluster.  Chapter 12 adds some nice concreteness to the sections by looking at clustering of real data sets from <a href="http://www.twitter.com">Twitter</a>, <a href="http://last.fm">Last.fm</a> and <a href="http://www.stackoverflow.com">Stack Overflow</a>.  For those looking to kick the tires with some real data, be sure to check out that chapter.</p>
<h2>Classification</h2>
<p>Classification is very popular these days both in search and beyond, so it is great to see this set of chapters covering the topic so well in practical, accessible terms.  As you would expect, the first chapter (13) gets you up and running as well as introduces the concepts of classification.  This chapter has a great explanation of how classification works and a typical workflow for building a classifier.</p>
<p>Chapter 14 then delves into the details of actually training a classifier using Mahout&#8217;s Stochastic Gradient Descent algorithm as well as it&#8217;s Bayesian classifier.</p>
<p><img src="http://1.bp.blogspot.com/_t0NJvKaO1dI/SjXBGm0DCpI/AAAAAAAAD1M/ISwdVEi7dt4/s400/potatosalad.jpg" alt="" width="243" height="320" class="alignright" float="right" />The next chapter then takes a look at how best to evaluate a classifier as well as some insight into what happens when a classifier goes bad.  Be sure to check this out, as you will no doubt run into many of the issues covered.  As an aside, I couldn&#8217;t help thinking of the classic &#8220;Far Side&#8221; cartoon to the right upon reading that section heading.The penultimate classification chapter digs into the practical aspects of deploying a classifier in production, including details on working through your scale and speed requirements.  It finishes off with an example Apache Thrift based server which some may find as a useful starting point for their applications.  Finally, Mahout in Action finishes off with a Case Study of how <a href="http://www.shopittome.com">Shop It To Me</a> uses a Mahout classifier to provide recommendations of offers to customers.  As with any technical book, it is great to have some concrete discussion of how this stuff functions in the wild.</p>
<h2>What&#8217;s Missing (i.e. When&#8217;s the 2nd edition coming out?)</h2>
<p>Mahout has a number of other interesting things that are in various stages of development like frequent patternset mining, Singular Value Decomposition (feature reduction), evolutionary programming, integration libraries for input/output as well as tools for storing data in Cassandra and Mongo.  Since Mahout is developing pretty quickly, the lack of this being in the book is no fault of the authors, I&#8217;m just putting it up here so that people are aware that Mahout has more to offer, even if the three &#8220;C&#8217;s&#8221; are the most popular.</p>
<p>All in all, Mahout in Action is an excellent introduction to the project.  Naturally I&#8217;m biased, but, pun intended, I highly recommend the book!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/10/15/mahout-in-action-review/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Apache Mahout 0.2 Released</title>
		<link>http://www.lucidimagination.com/blog/2009/11/18/apache-mahout-0-2-released/</link>
		<comments>http://www.lucidimagination.com/blog/2009/11/18/apache-mahout-0-2-released/#comments</comments>
		<pubDate>Wed, 18 Nov 2009 13:37:39 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Apache Mahout]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1344</guid>
		<description><![CDATA[<p>I just sent out the Apache Mahout 0.2 release announcement.  Here&#8217;s a copy:</p>
<blockquote><p>Apache Mahout 0.2 has been released and is now available for public<br />
download at http://www.apache.org/dyn/closer.cgi/lucene/mahout</p>
<p>Apache Mahout is a subproject of Apache Lucene with the goal<br />
of delivering scalable machine learning algorithm implementations<br />
under the Apache license. http://www.apache.org/licenses/LICENSE-2.0<br />
Scale in terms of computation to the<br />
size of data you manage today.  Scale in terms of community to support anyone<br />
interested in using </p>&#8230;</blockquote>]]></description>
			<content:encoded><![CDATA[<p>I just sent out the Apache Mahout 0.2 release announcement.  Here&#8217;s a copy:</p>
<blockquote><p>Apache Mahout 0.2 has been released and is now available for public<br />
download at http://www.apache.org/dyn/closer.cgi/lucene/mahout</p>
<p>Apache Mahout is a subproject of Apache Lucene with the goal<br />
of delivering scalable machine learning algorithm implementations<br />
under the Apache license. http://www.apache.org/licenses/LICENSE-2.0<br />
Scale in terms of computation to the<br />
size of data you manage today.  Scale in terms of community to support anyone<br />
interested in using machine learning. Scale<br />
in terms of business by providing the library under a commercially<br />
friendly, free software license.</p>
<p>Built on top of the powerful map/reduce paradigm of the Apache Hadoop<br />
project, Mahout&#8217;s goal is to solve popular machine learning problems<br />
like clustering, collaborative filtering and classification<br />
over extremely large data sets over thousands of computers.</p>
<p>Up to date maven artifacts can be found in the Apache repository at</p>
<p>https://repository.apache.org/content/repositories/releases/org/apache/mahout/</p>
<p>The complete changelist can be found here:</p>
<p>http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12313278</p>
<p>New Mahout 0.2 features include</p>
<p>- Major performance enhancements in Collaborative Filtering,<br />
Classification and Clustering<br />
- New: Latent Dirichlet Allocation(LDA) implementation for topic<br />
modelling<br />
- New: Frequent Itemset Mining for mining top-k patterns from a list<br />
of transactions<br />
- New: Decision Forests implementation for Decision Tree classification<br />
(In Memory &amp; Partial Data)<br />
- New: HBase storage support for Naive Bayes model building and<br />
classification<br />
- New: Generation of vectors from Text documents for use with Mahout<br />
Algorithms<br />
- Performance improvements in various Vector implementations<br />
- Tons of bug fixes and code cleanup</p>
<p>Getting started: New to Mahout?</p>
<p>1) Download Mahout at http://www.apache.org/dyn/closer.cgi/lucene/mahout<br />
2) Check out the Quick start:</p>
<p>http://cwiki.apache.org/MAHOUT/quickstart.html</p>
<p>3) Read the Mahout Wiki: http://cwiki.apache.org/MAHOUT<br />
4) Join the community by subscribing to mahout-user@lucene.apache.org<br />
5) Give back: http://www.apache.org/foundation/getinvolved.html (optional, but much appreciated!)<br />
6) Consider adding yourself to the power by Wiki page:</p>
<p>http://cwiki.apache.org/MAHOUT/poweredby.html</p>
<p>For more information on Apache Mahout, see</p>
<p>http://lucene.apache.org/mahout</p></blockquote>
<p>For those wanting to read more, check out my IBM developerWorks article on <a href="https://www.ibm.com/developerworks/java/library/j-mahout/index.html">Apache Mahout</a>, as well.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2009/11/18/apache-mahout-0-2-released/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Thoughts on Efficiency of Enterprise Search on eWeek.com</title>
		<link>http://www.lucidimagination.com/blog/2009/07/16/thoughts-on-efficiency-of-enterprise-search-on-eweekcom/</link>
		<comments>http://www.lucidimagination.com/blog/2009/07/16/thoughts-on-efficiency-of-enterprise-search-on-eweekcom/#comments</comments>
		<pubDate>Thu, 16 Jul 2009 15:11:39 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[Enterprise Search]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[OpenNLP]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=796</guid>
		<description><![CDATA[<p>eWeek.com recently posted a <a href="http://www.eweek.com/c/a/Search-Engines/How-to-Improve-the-Efficiency-of-Enterprise-Search/">nice article</a> by Dr. Yves Schabes, founder of <a href="http://www.teragram.com/">Teragram</a>, on how to make enterprise search better through some higher order processing techniques like metadata generation, applying taxonomies, etc. and doing <a href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Debugging-Relevance-Issues-Search">relevance testing</a> on a regular basis.  Naturally, this got me thinking about all the different ways this relates to the Apache Lucene ecosystem (<a href="http://lucene.apache.org/java">Lucene</a>, <a href="http://lucene.apache.org/solr">Solr</a>, <a href="http://lucene.apache.org/mahout">Mahout</a>, <a href="http://lucene.apache.org/tika">Tika</a>, etc.) and Lucid Imagination.</p>
<p>First, by choosing an &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>eWeek.com recently posted a <a href="http://www.eweek.com/c/a/Search-Engines/How-to-Improve-the-Efficiency-of-Enterprise-Search/">nice article</a> by Dr. Yves Schabes, founder of <a href="http://www.teragram.com/">Teragram</a>, on how to make enterprise search better through some higher order processing techniques like metadata generation, applying taxonomies, etc. and doing <a href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Debugging-Relevance-Issues-Search">relevance testing</a> on a regular basis.  Naturally, this got me thinking about all the different ways this relates to the Apache Lucene ecosystem (<a href="http://lucene.apache.org/java">Lucene</a>, <a href="http://lucene.apache.org/solr">Solr</a>, <a href="http://lucene.apache.org/mahout">Mahout</a>, <a href="http://lucene.apache.org/tika">Tika</a>, etc.) and Lucid Imagination.</p>
<p>First, by choosing an open backbone like Lucene and Solr, you are free to plugin the best tool for the job; proprietary solutions often limit you to their own tools and their implementation.  Let&#8217;s face it, we can&#8217;t be good at everything, so it makes sense to be able to plug in the best of breed for something that isn&#8217;t a core competency.  For example, one could choose <a href="http://opennlp.sourceforge.net/">OpenNLP</a> or Teragram or any other commercial vendor for these capabilities.  Solr, especially, makes it simple to plugin these capabilities through its well defined plugin architecture.  (By the way, for almost every capability out there in this realm, there is an open source alternative that warrants investigation.)</p>
<p>Second, intelligent search&#8211;in other words, search that goes beyond simple keyword capabilities&#8211;is the leading edge of the field and is being adopted in more and more products, just as Dr. Schabes recommends.  Whether it is intelligent query parsing, better faceting and discovery capabilities or integration with natural language processing (NLP) tools for NER (Named Entity Recognition), sentiment analysis and relationship discovery, the companies making a difference in search are those that intelligently bring together a variety of approaches to solve the problem at hand.  I believe Lucene,  Solr and open source are uniquely positioned to fuel intelligent search because they drive down the <a href="http://www.lucidimagination.com/blog/2009/04/20/lucene-open-source-and-the-cost-of-experimentation/">cost of experimentation</a> simply because it takes effort to get this stuff right, much of it due to the need to understand your domain and how to translate it into a good user experience.  Furthermore, open source lets you cost effectively fill in your infrastructure and conserve your precious resources for your core competencies.  Why would you pay millions of dollars for a search engine that implements a <a href="http://en.wikipedia.org/wiki/Vector_space_model">vector space retrieval model</a> (which most of the commercial vendors do) when you can get the same thing from Lucene for free?  If you suspect that you think Lucene isn&#8217;t as good, think again; there&#8217;s a reason it is used at the likes of Apple, AOL, Comcast, CNET, Viacom and thousands of others.  If you like bells and whistles and knowing there is a company behind your chosen solution, I&#8217;ll do you three better:  with Lucene and Solr you not only get 1) a <a href="http://www.lucidimagination.com">company that offers support, training, professional services, and bells and whistles</a>, you also get 2) the very large Apache community of users as well who constantly use/test/fix/improve the software and 3) all of the source code,  completely unencumbered, so you are free to change it as you see fit.</p>
<p>Finally, you get to choose whether you even need a particular capability.  On more than one occasion, I have been involved in replacement of a proprietary search package so bloated with unused add-ons that the Solr installation, containing only the required functions, needed an index that was a mere fraction of the size of the proprietary solution, resulting in:</p>
<ul>
<li>Less hardware to achieve the same throughput</li>
<li>Less operations costs &#8212; more hardware = more hardware failures</li>
<li>faster indexing, faster queries, etc.</li>
</ul>
<p>In short, Lucene and Solr offer a cost effective and fully capable mechanism for improving the efficiency of search along the lines of the approach Dr. Shabes recommends, giving you the freedom to choose based on your idea of what works, not someone else&#8217;s.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2009/07/16/thoughts-on-efficiency-of-enterprise-search-on-eweekcom/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

