<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lucid Imagination &#187; Mahout</title>
	<atom:link href="http://www.lucidimagination.com/blog/category/mahout/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lucidimagination.com/blog</link>
	<description>Exclusively dedicated to Apache Lucene/Solr open source search technology</description>
	<lastBuildDate>Sat, 04 Feb 2012 01:12:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Apache Mahout: Scalable machine learning for everyone</title>
		<link>http://www.lucidimagination.com/blog/2011/11/08/apache-mahout-scalable-machine-learning-for-everyone/</link>
		<comments>http://www.lucidimagination.com/blog/2011/11/08/apache-mahout-scalable-machine-learning-for-everyone/#comments</comments>
		<pubDate>Tue, 08 Nov 2011 17:46:26 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[Mahout]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4420</guid>
		<description><![CDATA[<p>&#160;</p>
<p>&#160;</p>
<p>My most recent article on Mahout is up at IBM developerWorks.  It is titled <a href="http://www.ibm.com/developerworks/java/library/j-mahout-scaling/index.html">Apache Mahout: Scalable machine learning for everyone</a> and is designed to walk you through using Mahout with a real email data set using Hadoop and EC2.  It also gets you up to speed on some of the new things in Mahout since <a href="http://www.ibm.com/developerworks/library/j-mahout/">I last wrote on the subject for developerWorks</a>.</p>
<p>Note, I will also be giving a talk &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>&nbsp;</p>
<p>&nbsp;</p>
<p>My most recent article on Mahout is up at IBM developerWorks.  It is titled <a href="http://www.ibm.com/developerworks/java/library/j-mahout-scaling/index.html">Apache Mahout: Scalable machine learning for everyone</a> and is designed to walk you through using Mahout with a real email data set using Hadoop and EC2.  It also gets you up to speed on some of the new things in Mahout since <a href="http://www.ibm.com/developerworks/library/j-mahout/">I last wrote on the subject for developerWorks</a>.</p>
<p>Note, I will also be giving a talk on the subject, at the <a href="http://www.lucidimagination.com/blog/2011/11/05/sf-bay-area-apache-mahout-user-meeting-on-nov-29/">San Francisco Bay Area Mahout User Meeting on Nov. 29th</a>.  Hope to see you there!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/11/08/apache-mahout-scalable-machine-learning-for-everyone/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>SF Bay Area Apache Mahout User Meeting on Nov. 29</title>
		<link>http://www.lucidimagination.com/blog/2011/11/05/sf-bay-area-apache-mahout-user-meeting-on-nov-29/</link>
		<comments>http://www.lucidimagination.com/blog/2011/11/05/sf-bay-area-apache-mahout-user-meeting-on-nov-29/#comments</comments>
		<pubDate>Sat, 05 Nov 2011 14:42:16 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[lucid imagination]]></category>
		<category><![CDATA[MapR]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4416</guid>
		<description><![CDATA[[ Tuesday, 29 November 2011; 18:30 to 21:30. ] <p>For all of those interested in Apache Mahout and scalable machine learning, Lucid Imagination is hosting a Mahout Users Meeting at it&#8217;s new office in Redwood City on Nov. 29th. Doors open at 6:30 pm. The night will feature two speakers, Ted Dunning of <a href="http://www.mapr.com">MapR Technologies</a> and Grant Ingersoll of <a href="http://www.lucidimagination.com">Lucid Imagination</a>, along with a social gathering with food and drinks.</p>
<p>For more details and &#8230;</p>]]></description>
			<content:encoded><![CDATA[[ Tuesday, 29 November 2011; 18:30 to 21:30. ] <p>For all of those interested in Apache Mahout and scalable machine learning, Lucid Imagination is hosting a Mahout Users Meeting at it&#8217;s new office in Redwood City on Nov. 29th. Doors open at 6:30 pm. The night will feature two speakers, Ted Dunning of <a href="http://www.mapr.com">MapR Technologies</a> and Grant Ingersoll of <a href="http://www.lucidimagination.com">Lucid Imagination</a>, along with a social gathering with food and drinks.</p>
<p>For more details and to RSVP, please see <a href="http://sf-mahout-11-11.eventbrite.com/">http://sf-mahout-11-11.eventbrite.com/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/11/05/sf-bay-area-apache-mahout-user-meeting-on-nov-29/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>From Barcelona to Vancouver with Lucene and Solr</title>
		<link>http://www.lucidimagination.com/blog/2011/10/22/barcelona-vancouver/</link>
		<comments>http://www.lucidimagination.com/blog/2011/10/22/barcelona-vancouver/#comments</comments>
		<pubDate>Sat, 22 Oct 2011 10:14:36 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4364</guid>
		<description><![CDATA[<p>With another <a href="http://lucene-eurocon.com/">Lucene Eurocon</a> successfully behind us (thanks Barcelona, you&#8217;ve been awesome!), it&#8217;s time to say hello to Vancouver for <a href="http://na11.apachecon.com/">ApacheCon</a>.  I&#8217;ll leave it to others to fill in the blanks on the Barcelona conference other than to say that I am continually amazed by the vibrancy of the Lucene/Solr community and especially grateful to all the committers and contributors who take the time to show up and give talks about how they leverage &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>With another <a href="http://lucene-eurocon.com/">Lucene Eurocon</a> successfully behind us (thanks Barcelona, you&#8217;ve been awesome!), it&#8217;s time to say hello to Vancouver for <a href="http://na11.apachecon.com/">ApacheCon</a>.  I&#8217;ll leave it to others to fill in the blanks on the Barcelona conference other than to say that I am continually amazed by the vibrancy of the Lucene/Solr community and especially grateful to all the committers and contributors who take the time to show up and give talks about how they leverage the world&#8217;s premier open source search engine.</p>
<p>For me personally, I&#8217;m on to Vancouver and ApacheCon for two primary things, besides of course the community bits that go with every ApacheCon:</p>
<ol>
<li>Providing the ApacheCon&#8217;s first ever <a href="http://na11.apachecon.com/talks/18395">Apache Mahout training on Monday, November 7th</a>.  It&#8217;s still not too late to sign up!</li>
<li>Giving a talk on alternative uses of Lucene/Solr other than traditional free text search (things like recommendation engines, classification, etc.)</li>
</ol>
<p>For the 2nd item, I&#8217;m also interested in hearing from you, the user, about interesting things you&#8217;ve done with Lucene/Solr that fall outside the norm of free text search.  If you care to share, please leave a comment on this post.</p>
<p>I&#8217;d be remiss if I didn&#8217;t also plug several other Lucid Imagination employees who are speaking at ApacheCon as well:</p>
<ol>
<li><a href="http://na11.apachecon.com/talks/19453">Solr Flair</a> by Erik Hatcher.  Erik will also be doing a <a href="http://na11.apachecon.com/talks/19454">2 day Solr training class</a>.  Registration is still open for this class as well.</li>
<li><a href="http://na11.apachecon.com/talks/19346">Apache Solr: Out of the Box</a> by Chris Hostetter</li>
</ol>
<p>Lucid Imagination is also sponsoring the Lucene/Solr <a href="https://wiki.apache.org/lucene-java/ApacheCon2011NaMeetup">meetup</a> on Wed. November 9th, so if you are in town, please feel free to drop by for a drink and a chat.</p>
<p>With that, I&#8217;ll simply say, I hope to see you in Vancouver in a few weeks!</p>
<p>-Grant</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/10/22/barcelona-vancouver/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Mahout in Action Review</title>
		<link>http://www.lucidimagination.com/blog/2011/10/15/mahout-in-action-review/</link>
		<comments>http://www.lucidimagination.com/blog/2011/10/15/mahout-in-action-review/#comments</comments>
		<pubDate>Sat, 15 Oct 2011 13:13:18 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Grant Ingersoll]]></category>
		<category><![CDATA[Mahout in Action]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4317</guid>
		<description><![CDATA[<p>You know your (technical) <a href="https://cwiki.apache.org/confluence/display/MAHOUT/MahoutName">baby</a> is (almost) grown up when the book on the project finally comes out.  Such is the case for Apache Mahout, thanks to <a href="http://www.manning.com">Manning Publications</a> shipping <a href="http://affiliate.manning.com/idevaffiliate.php?id=1141_219">Mahout in Action</a> this week.</p>
<p><img src="http://manning.com/owen/owen_cover150.jpg" alt="" width="150" height="187" class="alignright" float="right" />So, before I start into my review, let me first say congratulations to Sean, Robin, Ted, Ellen and Manning for producing such an excellent product.   The simplest praise I can give it is to put it on the same &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>You know your (technical) <a href="https://cwiki.apache.org/confluence/display/MAHOUT/MahoutName">baby</a> is (almost) grown up when the book on the project finally comes out.  Such is the case for Apache Mahout, thanks to <a href="http://www.manning.com">Manning Publications</a> shipping <a href="http://affiliate.manning.com/idevaffiliate.php?id=1141_219">Mahout in Action</a> this week.</p>
<p><img src="http://manning.com/owen/owen_cover150.jpg" alt="" width="150" height="187" class="alignright" float="right" />So, before I start into my review, let me first say congratulations to Sean, Robin, Ted, Ellen and Manning for producing such an excellent product.   The simplest praise I can give it is to put it on the same level as one of the best intro to technology books I know:  <a href="http://www.manning.com/affiliate/idevaffiliate.php?id=1071_147">Lucene In Action</a>.  In other words, it sets the standard by which all other Mahout books will be judged.<br />
<br />
As for the actual book, it is broken down into 3 sections, which I like to call the &#8220;three C&#8217;s&#8221;:</p>
<ol>
<li>Collaborative Filtering</li>
<li>Clustering</li>
<li>Classification</li>
</ol>
<p>So, without further ado, let&#8217;s take a deeper look at the book in this context of the three C&#8217;s.</p>
<h2>Collaborative Filtering</h2>
<p>Collaborative Filtering is by far one of the most popular parts of Mahout, being used in places like <a href="https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout">Amazon and Foursquare</a> and this section of the book, via 5 chapters,  walks you nicely through both the concepts and the practical aspects of collaborative filtering.   Chapter 2 starts by getting you up and running using the <a href="http://www.grouplens.org/">GroupLens</a> dataset for movie recommendations.  For those unfamiliar with collaborative filtering, this makes for a nice entrance into the subject with data everyone can relate to easily.  Chapter 3 then discusses how to best model your data, while chapter 4 looks at the mechanics of actually generating recommendations from the data. </p>
<p>Chapters 5 and 6 then discuss the ins and outs of taking a recommendation engine into production, including details on how to scale it out using Apache Hadoop.  I found the explanation of the Hadoop based co-occurrence process (via RecommenderJob) especially useful, as I recently just committed <a href="https://issues.apache.org/jira/browse/MAHOUT-798">MAHOUT-798</a>, which uses it to build an example recommendation system based off of user interaction with email.  In fact, I relied heavily on all of the concepts in this part of the book, as I first had to extract and clean the data, then properly model it before finally running the recommendation task on EC2.</p>
<p>When I first got access to the MEAP for this book (quite some time ago), I did not have a lot of background in collaborative filtering and these chapters really helped fill in the practical details for me as well as provided a good foundation for the theoretical aspects behind collab. filtering.  I think this will serve others well who are looking to get started with collaborative filtering as well.</p>
<h2>Clustering</h2>
<p>Similar to collaborative filtering, the clustering section starts off by introducing the basic concepts and then quickly gets you up and running with an example clustering run.  Chapter 8 then gets into how best to do feature selection for clustering.  Feature selection is often one of the keys to successful clustering, so be sure to make sure you have a good grasp on the contents of the chapter before moving ahead into chapter 9, which gets into some of Mahout&#8217;s clustering algorithms.  That chapter primarily focuses on K-Means and Dirichlet, but also covers a few others.  Note, Mahout actually has a few other algorithms for clustering then the ones described, like spectral, canopy, meanshift and minhash.  Of course, some of these were added later in the book cycle, so it is hard to complain that they weren&#8217;t incorporated. Chapter 10 then covers, in my experience, one of the harder aspects of clustering, namely how to evaluate the results.  This chapter is a little bit thin, but it seems the overall field is the same, so this is not a put down on the chapter!  There simply isn&#8217;t a lot of great tools available for evaluating clustering.</p>
<p>Chapter 11 then adds some meat onto the bones of taking clustering to producti0n, including information on leveraging clustering in a Hadoop cluster.  Chapter 12 adds some nice concreteness to the sections by looking at clustering of real data sets from <a href="http://www.twitter.com">Twitter</a>, <a href="http://last.fm">Last.fm</a> and <a href="http://www.stackoverflow.com">Stack Overflow</a>.  For those looking to kick the tires with some real data, be sure to check out that chapter.</p>
<h2>Classification</h2>
<p>Classification is very popular these days both in search and beyond, so it is great to see this set of chapters covering the topic so well in practical, accessible terms.  As you would expect, the first chapter (13) gets you up and running as well as introduces the concepts of classification.  This chapter has a great explanation of how classification works and a typical workflow for building a classifier.</p>
<p>Chapter 14 then delves into the details of actually training a classifier using Mahout&#8217;s Stochastic Gradient Descent algorithm as well as it&#8217;s Bayesian classifier.</p>
<p><img src="http://1.bp.blogspot.com/_t0NJvKaO1dI/SjXBGm0DCpI/AAAAAAAAD1M/ISwdVEi7dt4/s400/potatosalad.jpg" alt="" width="243" height="320" class="alignright" float="right" />The next chapter then takes a look at how best to evaluate a classifier as well as some insight into what happens when a classifier goes bad.  Be sure to check this out, as you will no doubt run into many of the issues covered.  As an aside, I couldn&#8217;t help thinking of the classic &#8220;Far Side&#8221; cartoon to the right upon reading that section heading.The penultimate classification chapter digs into the practical aspects of deploying a classifier in production, including details on working through your scale and speed requirements.  It finishes off with an example Apache Thrift based server which some may find as a useful starting point for their applications.  Finally, Mahout in Action finishes off with a Case Study of how <a href="http://www.shopittome.com">Shop It To Me</a> uses a Mahout classifier to provide recommendations of offers to customers.  As with any technical book, it is great to have some concrete discussion of how this stuff functions in the wild.</p>
<h2>What&#8217;s Missing (i.e. When&#8217;s the 2nd edition coming out?)</h2>
<p>Mahout has a number of other interesting things that are in various stages of development like frequent patternset mining, Singular Value Decomposition (feature reduction), evolutionary programming, integration libraries for input/output as well as tools for storing data in Cassandra and Mongo.  Since Mahout is developing pretty quickly, the lack of this being in the book is no fault of the authors, I&#8217;m just putting it up here so that people are aware that Mahout has more to offer, even if the three &#8220;C&#8217;s&#8221; are the most popular.</p>
<p>All in all, Mahout in Action is an excellent introduction to the project.  Naturally I&#8217;m biased, but, pun intended, I highly recommend the book!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/10/15/mahout-in-action-review/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The Apache Lucene Ecosystem: My View of 2010</title>
		<link>http://www.lucidimagination.com/blog/2010/12/27/the-apache-lucene-ecosystem-my-view-of-2010/</link>
		<comments>http://www.lucidimagination.com/blog/2010/12/27/the-apache-lucene-ecosystem-my-view-of-2010/#comments</comments>
		<pubDate>Mon, 27 Dec 2010 15:54:11 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[Droids]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucene Connector Framework]]></category>
		<category><![CDATA[LucidWorks]]></category>
		<category><![CDATA[Lucy]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[ManifoldCF]]></category>
		<category><![CDATA[nutch]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[PyLucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>
		<category><![CDATA[ZooKeeper]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=2809</guid>
		<description><![CDATA[<p>After a week off to enjoy time with my family, I thought I would kick off the last week of 2010 with a look back at the year as it relates to the Apache Lucene ecosystem.  For anyone who follows the amalgamation of projects that I like to call the Lucene Ecosystem (the Apache projects: Lucene, Solr, Nutch, Mahout, Tika, PyLucene, Lucy, Lucene.NET, Droids, ManifoldCF &#8212; Lucene Connector Framework, OpenNLP and UIMA) you know it &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>After a week off to enjoy time with my family, I thought I would kick off the last week of 2010 with a look back at the year as it relates to the Apache Lucene ecosystem.  For anyone who follows the amalgamation of projects that I like to call the Lucene Ecosystem (the Apache projects: Lucene, Solr, Nutch, Mahout, Tika, PyLucene, Lucy, Lucene.NET, Droids, ManifoldCF &#8212; Lucene Connector Framework, OpenNLP and UIMA) you know it has been an amazingly busy and fruitful year.  Instead of going through each project like <a href="http://www.lucidimagination.com/blog/2009/12/24/the-apache-lucene-ecosystem-my-view-of-2009/">last year&#8217;s review</a>, I&#8217;m just going to be a bit less formal and hit on the highlights as I see them.</p>
<p>Before I dig in too much, though, a special thanks to all our customers at Lucid Imagination as well as to my coworkers.  I&#8217;m coming up on 15 years out in the &#8220;real world&#8221; and I can honestly say I&#8217;ve never enjoyed what I do as much as I do here and that even accounts for the normal rough patches one goes through in any job.  As an engineer, there are few things as cool as getting to work with customers who are not only using, but pushing your work/project/product on a daily basis to do new and interesting things (I think this is a direct result of the project being Open Source, which I believe has an inherently <a href="http://www.lucidimagination.com/blog/2009/04/20/lucene-open-source-and-the-cost-of-experimentation/">lower cost of experimentation</a>).  I&#8217;ve been fortunate enough to meet and talk with many people doing all kinds of things with Lucene and Solr ranging from the &#8220;mundane&#8221; of basic keyword search to those building next generation search capabilities at incredible scale.  Through it all, I&#8217;m constantly amazed at the flexibility and efficiency of Lucene and Solr.  For instance, I&#8217;ve been working with one customer now whose Solr-based solution (for the exact same content) will use ~50% less hardware and will have an index that is 1/6 the size of their FAST index all while saving them major dinero.</p>
<p>Speaking of Lucid, one of the highlights of the year for us that relates directly to Lucene and Solr is the launch of our enterprise version: <a href="http://www.lucidimagination.com/lwe/download">LucidWorks Enterprise</a>.   I like to think of it as Apache Solr with a whole lot of Lucid expertise on how to use Solr baked in and topped off with other features and functionality to make building search applications easier.</p>
<p>OK, time to move on to the open source projects&#8230;</p>
<ol>
<li>Without a doubt, the biggest news of the year is the merging of the Lucene and Solr code base as well as the &#8220;graduation&#8221; of several subprojects to Apache Soft. Foundation Top Level Projects (TLP).  The graduating projects are <a href="http://tika.apache.org">Tika</a>, <a href="http://nutch.apache.org">Nutch</a>, and <a href="http://mahout.apache.org">Mahout</a>.  We also spun Lucy (a C port) to the Incubator, where it is working on it&#8217;s own community.  These moves were primarily done to focus the project management on single code base, but they also demonstrate the project has reached a level of maturity at the ASF.  The move also has the side benefit of bringing each project higher visibility.</li>
<li>I&#8217;m particularly excited about the addition of <a href="http://www.lucidimagination.com/blog/2010/12/02/opennlp-moving-to-apache/">OpenNLP to the Apache</a> umbrella.  OpenNLP is a nice open source Java project for natural language processing that has lived at Source Forge for quite some time.  I would expect development to grow quite a bit under the ASF community based model.  Also, integrating OpenNLP with Solr and Lucene is pretty easy to do.  I would be remiss if I didn&#8217;t also give a nod to the addition of the <a href="http://incubator.apache.org/connectors">ManifoldCF</a> project to the ASF.  ManifoldCF will help unlock content in Sharepoint, Documentum and other repositories for users of Lucene and Solr.</li>
<li>Lucene&#8217;s trunk code base now implements our &#8220;Flex APIs&#8221;, which should allow users to have near total control over what goes in the index as well as alternate compression techniques, different scoring models, etc.  See Michael McCandless&#8217; excellent <a href="http://www.lucidimagination.com/files/file/LuceneRev_McCandless_FunWithFlex.pdf">talk at Lucene Revolution</a> for more details.</li>
<li>With all the location aware devices and capabilities on the market, geo-spatial search is a hot topic and Lucene and Solr have been adding quite a bit of capabilities in this regard with the ability to filter, boost and sort results based on location information in documents.  See Solr&#8217;s <a href="http://wiki.apache.org/solr/SpatialSearch">Spatial Search Wiki page</a> for more info as well as several of my <a href="http://www.lucidimagination.com/search/?q=spatial#/s:lucid/li:blogs">past blog posts</a>.</li>
<li>Of course, everyone was a buzz about the cloud this year.  For Solr, this translates into greater efforts to make Solr easier to scale to very large installations (100s to 1000s of nodes and billions and billions of documents) via the <a href="http://wiki.apache.org/solr/SolrCloud">Solr Cloud project that Yonik Seeley and Mark Miller have been spearheading</a>.</li>
<li>On the user side, one of the biggest pieces of buzz this year related to Lucene was the migration of Twitter search to Lucene.  At 1 billion queries per day and 50 million posts per day (all indexed and searchable in near real time), Twitter&#8217;s search system certainly has it&#8217;s work cut out for itself.  However, as Michael Busch <a href="http://www.lucidimagination.com/events/revolution2010/videos/mbusch">outlined at Lucene Revolution</a>, Apache Lucene was up to the task!  Naturally, there were lots of other companies that migrated to Solr and Lucene as well.  Have you <a href="http://www.lucidimagination.com/enterprise-search-solutions/case-studies">shared your use case</a>?</li>
</ol>
<p>Well, I&#8217;ve no doubt missed a bunch of other things, but those items, to me, are some of the bigger highlights.  Looking forward, there are some other exciting things coming to Lucene and Solr.  In particular, I&#8217;m working on adding language identification, related searches and point in polygon filtering to Solr.  I would also expect we will release Lucene/Solr 3.1 fairly soon, too, but you can&#8217;t pin me down on a date just yet.</p>
<p>Here&#8217;s hoping you all have a Happy Holidays and a Happy New Year!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/12/27/the-apache-lucene-ecosystem-my-view-of-2010/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>RTP Semantic Web Slides are available</title>
		<link>http://www.lucidimagination.com/blog/2010/07/09/rtp-semantic-web-slides-are-available/</link>
		<comments>http://www.lucidimagination.com/blog/2010/07/09/rtp-semantic-web-slides-are-available/#comments</comments>
		<pubDate>Fri, 09 Jul 2010 12:43:34 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=2200</guid>
		<description><![CDATA[<p>Here are my slides from the talk I gave last night at the RTP Semantic Web Group:</p>
<div id="__ss_4719463" style="width: 425px;"><strong style="display: block; margin: 12px 0 4px;"><a title="Intro to Apache Mahout" href="http://www.slideshare.net/gsingers/intro-to-apache-mahout">Intro to Apache Mahout</a></strong>
<div style="padding: 5px 0 12px;">View more <a href="http://www.slideshare.net/">presentations</a> from <a href="http://www.slideshare.net/gsingers">gsingers</a>.</div>
&#8230;</div>]]></description>
			<content:encoded><![CDATA[<p>Here are my slides from the talk I gave last night at the RTP Semantic Web Group:</p>
<div id="__ss_4719463" style="width: 425px;"><strong style="display: block; margin: 12px 0 4px;"><a title="Intro to Apache Mahout" href="http://www.slideshare.net/gsingers/intro-to-apache-mahout">Intro to Apache Mahout</a></strong><object id="__sse4719463" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="425" height="355" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="allowFullScreen" value="true" /><param name="allowScriptAccess" value="always" /><param name="src" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=intro-mahout-100709062244-phpapp01&amp;stripped_title=intro-to-apache-mahout" /><param name="name" value="__sse4719463" /><param name="allowfullscreen" value="true" /><embed id="__sse4719463" type="application/x-shockwave-flash" width="425" height="355" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=intro-mahout-100709062244-phpapp01&amp;stripped_title=intro-to-apache-mahout" name="__sse4719463" allowscriptaccess="always" allowfullscreen="true"></embed></object></p>
<div style="padding: 5px 0 12px;">View more <a href="http://www.slideshare.net/">presentations</a> from <a href="http://www.slideshare.net/gsingers">gsingers</a>.</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/07/09/rtp-semantic-web-slides-are-available/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Berlin Buzzwords Recap</title>
		<link>http://www.lucidimagination.com/blog/2010/06/11/berlin-buzzwords-recap/</link>
		<comments>http://www.lucidimagination.com/blog/2010/06/11/berlin-buzzwords-recap/#comments</comments>
		<pubDate>Fri, 11 Jun 2010 13:53:58 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=2157</guid>
		<description><![CDATA[<p>Back from <a href="http://www.berlinbuzzwords.de">Berlin Buzzwords</a> and finally over the jet lag, so I thought I would put up some feedback.  First off, it was a well organized conference with a nice focus on searching, storage and scaling.  Kudos to Isabel, Simon and Jan for all their hard work.  It also had great wi-fi coverage, which is always a struggle at every conference I&#8217;ve ever been too.</p>
<p>As for the talks, I gave the Keynote on using &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>Back from <a href="http://www.berlinbuzzwords.de">Berlin Buzzwords</a> and finally over the jet lag, so I thought I would put up some feedback.  First off, it was a well organized conference with a nice focus on searching, storage and scaling.  Kudos to Isabel, Simon and Jan for all their hard work.  It also had great wi-fi coverage, which is always a struggle at every conference I&#8217;ve ever been too.</p>
<p>As for the talks, I gave the Keynote on using open source tools like Apache Solr and Mahout to deliver intelligent applications (<a href="http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/ingersoll_bbuzz2010.pdf">slides</a> &#8212; really should be a PPT so you can see the animations) on Monday first thing in the morning and I felt it went pretty well, but I&#8217;ll let others be the judge (videos should be online soon).  The rest of the day, I spent going in and out of the various tracks.  The Lucene track was very well done, with good talks by: Uwe Schindler and Simon Willnauer on the State of Lucene, Robert Muir on Finite State queries in Lucene; Michael Busch on Real Time Search at Twitter, Jukka Zitting on Tika and Andrzej Bialecki on Nutch. See <a href="http://berlinbuzzwords.wikidot.com/links-to-slides">Berlinbuzzwords: Links To Slides</a> for all the slides (not all are available just yet).</p>
<p>I also went to a variety of the Hadoop and NoSQL talks.  Lots of people in the NoSQL talks making pitches on why their approach is best, which is very helpful in determining what tool to use at the appropriate time.  I still, however, can&#8217;t shake the feeling that one could take the new <a href="http://wiki.apache.org/solr/SolrCloud">Solr Cloud stuff</a>, a dead simple schema (id and one or two simple fields), and have a large scale distributed key-value storage that overcomes almost all of the limitations of many of the NoSQL technologies (ad-hoc queries, range queries, search within the values, extendability) with minimal overhead of indexing (which can be greatly reduced by using either literals or very simple analysis).  Not only that, Lucene/Solr already is &#8220;document-centric&#8221; and I&#8217;ve seen it scale to billions of documents with high availability and high QPS and that was using &#8220;real&#8221; documents (i.e. articles, etc.), not simple key-value pairs, so I can&#8217;t help but feel like simple key-value pairs would be even faster and more scalable.  In other words, Lucene isn&#8217;t just for text search.  Naturally, this is just a thought at this point, I haven&#8217;t tried testing it just yet. Also, once the new real time stuff is in Lucene, I think it will be even faster.</p>
<p>At any rate, the best thing about the conference was the fact that it shows the eagerness for new solutions to large scale solutions that cost less money than the sturdy old database.</p>
<p>Again, congrats to Isabel and team for a well executed conference in a great city and at a great venue.  If you are interested in more on the Lucene portion of the conference, make sure you come visit us in Boston for <a href="http://www.lucenerevolution.com">Lucene Revolution</a>!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/06/11/berlin-buzzwords-recap/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Apache Lucene EuroCon Agenda &#8211; The Revolution is On!</title>
		<link>http://www.lucidimagination.com/blog/2010/04/22/apache-lucene-eurocon-agenda-the-revolution-is-on/</link>
		<comments>http://www.lucidimagination.com/blog/2010/04/22/apache-lucene-eurocon-agenda-the-revolution-is-on/#comments</comments>
		<pubDate>Thu, 22 Apr 2010 11:09:33 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucene Connector Framework]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[nutch]]></category>
		<category><![CDATA[Open Relevance Project]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>
		<category><![CDATA[ZooKeeper]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1965</guid>
		<description><![CDATA[<p>After reviewing a lot of great talk proposals, we&#8217;ve announced the  agenda for Apache Lucene Eurocon: <a href="http://lucene-eurocon.org/agenda.html">Apache Lucene EuroCon &#8211;  Europe&#8217;s Premier Lucene and Solr Search User Conference</a>.</p>
<p>One  of the things I really like about this agenda is it is a great mix of  basics, use cases from all over the search map (CMS, news, social media,  advertising), business decisions (see last list and next list) and advanced topics  (NLP, collab filtering, machine &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>After reviewing a lot of great talk proposals, we&#8217;ve announced the  agenda for Apache Lucene Eurocon: <a href="http://lucene-eurocon.org/agenda.html">Apache Lucene EuroCon &#8211;  Europe&#8217;s Premier Lucene and Solr Search User Conference</a>.</p>
<p>One  of the things I really like about this agenda is it is a great mix of  basics, use cases from all over the search map (CMS, news, social media,  advertising), business decisions (see last list and next list) and advanced topics  (NLP, collab filtering, machine learning, advanced visualization, multilingual).   Moreover, the content, even though it is centered in Lucene, goes well  beyond just being about Lucene and is really about search, in all of it&#8217;s power and  glory.  It&#8217;s about real users with real needs getting real problems  solved using the Lucene ecosystem.  Oh, and by the way, those users are doing it at scale!  Big scale.</p>
<p>That&#8217;s powerful stuff,  because, in case you hadn&#8217;t noticed (shh, it&#8217;s our little secret) there is a revolution going on in search.  (Funny how that line coincides with Lucid&#8217;s frontman, Eric Gries, giving  a  talk titled &#8220;The Search Revolution&#8221;)</p>
<p>Are you a part of the revolution?  See you in <a href="http://lucene-eurocon.org/index.html">Prague</a> in mid-May.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/04/22/apache-lucene-eurocon-agenda-the-revolution-is-on/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>News Flash: Apache Lucene gives birth to triplets!</title>
		<link>http://www.lucidimagination.com/blog/2010/04/21/news-flash-apache-lucene-gives-birth-to-triplets/</link>
		<comments>http://www.lucidimagination.com/blog/2010/04/21/news-flash-apache-lucene-gives-birth-to-triplets/#comments</comments>
		<pubDate>Wed, 21 Apr 2010 20:25:10 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1961</guid>
		<description><![CDATA[<p><a href="http://lucene.apache.org">Apache Lucene</a> (the Lucene top level project, not Lucene the Java search API.  I know,  it&#8217;s confusing sometimes) has once again proved to be a fertile area for innovation (having already given birth to <a href="http://hadoop.apache.org">Apache Hadoop</a> a few years back), as it once again has given birth, this time to three new <a href="http://www.lucidimagination.com/search/document/d833ce805528045b/tlp_status">Apache Top Level Projects</a> (just approved by the Board at Apache): <a href="http://lucene.apache.org/mahout">Apache Mahout</a>, <a href="http://lucene.apache.org/nutch">Apache Nutch</a> and <a href="http://lucene.apache.org/tika">Apache Tika</a> (never mind the URLs, &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p><a href="http://lucene.apache.org">Apache Lucene</a> (the Lucene top level project, not Lucene the Java search API.  I know,  it&#8217;s confusing sometimes) has once again proved to be a fertile area for innovation (having already given birth to <a href="http://hadoop.apache.org">Apache Hadoop</a> a few years back), as it once again has given birth, this time to three new <a href="http://www.lucidimagination.com/search/document/d833ce805528045b/tlp_status">Apache Top Level Projects</a> (just approved by the Board at Apache): <a href="http://lucene.apache.org/mahout">Apache Mahout</a>, <a href="http://lucene.apache.org/nutch">Apache Nutch</a> and <a href="http://lucene.apache.org/tika">Apache Tika</a> (never mind the URLs, they will be changing soon).  While none of these projects look alike, they all have a strong foundation built in the Lucene community.  Combine this with the <a href="http://www.lucidimagination.com/blog/2010/03/26/lucene-and-solr-development-have-merged/">recent merge</a> of Lucene and Solr development lists (more on this later) and Lucene has been busy; and that doesn&#8217;t even mention all the really cool stuff baking in the source tree right now (spatial, flexible indexing/scoring, some new analyzers and a variety of other cool things &#8212; see Lucene&#8217;s <a href="https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/CHANGES.txt">CHANGES</a> and Solr&#8217;s <a href="https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/CHANGES.txt">CHANGES</a>).</p>
<p>In the end, though, what does all this mean for the users of the Lucene ecosystem?  On one hand, some of the move is just shuffling around of domain names, mailing lists and SVN source trees, but on the other hand, the moves are symbolic and represent a project reaching a level of maturity and self determination, not to mention critical mass and brand awareness.  Thus, in my mind, all of these moves are good things for Lucene as well as the associated projects that are spinning out.  As far as the actual code, I think users will still see the same high quality contributions and products coming out of Apache (aside: <a href="http://www.lucidimagination.com">Lucid Imagination</a> will still be business as usual in regards to these moves) as well as much more focus within the Project Management Committee (PMC) on the specific project.</p>
<p>Which brings me to a bit more on my view of the merge of Lucene and Solr.  I think we are already seeing the fruits of the merge for both Lucene and Solr (I know my open source life is easier already).  For instance, much of the analyzer code is going to being combined from Solr and Lucene to provide a single coherent analyzer library.  This is great news for people who have been using Lucene and pulling in Solr analyzers and is good for Solr users because it now has many more people keeping an eye on Solr&#8217;s analyzers as well as new Lucene analyzers showing up sooner (things like the WordDelimiterFilter, etc.)  Another example is the spatial work that we&#8217;ve been working pretty heavily on (see <a href="https://issues.apache.org/jira/browse/SOLR-773">SOLR-773</a>, <a href="https://issues.apache.org/jira/browse/SOLR-1568">SOLR-1568</a> and <a href="https://issues.apache.org/jira/browse/LUCENE-2350">LUCENE-2350</a>).  With the combination of the two development projects, it is now much easier for us to make sure there is a single, integrated way of delivering spatial search across both the Java API and the Solr REST-like API.</p>
<p>Moreover, in the short run, existing Lucene and Solr users should notice no difference in terms of the products, user communities and the like.  In the long run, it should make for less repeated code, faster integration, more test coverage and a larger, cohesive development team as well as more of Solr&#8217;s capabilities available in pure library form as well as many of Lucene&#8217;s cutting edge capabilities appearing sooner in Solr (flexible indexing and scoring, etc.)</p>
<p>Wrapping up, congrats to Lucene and all of the new top level projects!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/04/21/news-flash-apache-lucene-gives-birth-to-triplets/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Apache Mahout 0.3 Released</title>
		<link>http://www.lucidimagination.com/blog/2010/03/18/apache-mahout-0-3-released/</link>
		<comments>http://www.lucidimagination.com/blog/2010/03/18/apache-mahout-0-3-released/#comments</comments>
		<pubDate>Thu, 18 Mar 2010 12:10:12 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1866</guid>
		<description><![CDATA[<p>Here&#8217;s the announcement:</p>
<blockquote><p>Apache Mahout &#60;http://lucene.apache.org/mahout&#62; 0.3 has been released and is<br />
now available for public<br />
download at http://www.apache.org/dyn/closer.cgi/lucene/mahout</p>
<p>Up-to-date maven artifacts can be found in the Apache repository at</p>
<p>https://repository.apache.org/content/repositories/releases/org/apache/mahout/</p>
<p>Apache Mahout is a subproject of Apache Lucene with the goal of<br />
delivering scalable machine learning algorithm implementations under<br />
the Apache license. http://www.apache.org/licenses/LICENSE-2.0</p>
<p>Mahout is a machine learning library meant to scale: Scale in terms of<br />
community to support anyone interested in using machine </p>&#8230;</blockquote>]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s the announcement:</p>
<blockquote><p>Apache Mahout &lt;http://lucene.apache.org/mahout&gt; 0.3 has been released and is<br />
now available for public<br />
download at http://www.apache.org/dyn/closer.cgi/lucene/mahout</p>
<p>Up-to-date maven artifacts can be found in the Apache repository at</p>
<p>https://repository.apache.org/content/repositories/releases/org/apache/mahout/</p>
<p>Apache Mahout is a subproject of Apache Lucene with the goal of<br />
delivering scalable machine learning algorithm implementations under<br />
the Apache license. http://www.apache.org/licenses/LICENSE-2.0</p>
<p>Mahout is a machine learning library meant to scale: Scale in terms of<br />
community to support anyone interested in using machine learning.<br />
Scale in terms of business by providing the library under a<br />
commercially friendly, free software license. Scale in terms of<br />
computation to the size of data we manage today.</p>
<p>Built on top of the powerful map/reduce paradigm of the Apache Hadoop<br />
project, Mahout lets you solve popular machine learning problem<br />
settings like clustering, collaborative filtering and classification<br />
over Terabytes of data over thousands of computers.</p>
<p>Implemented with scalability in mind the latest release brings many<br />
performance optimizations so that even in a single node setup the<br />
library performs well.</p>
<p>The complete changelist can be found here:</p>
<p>http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12314281</p>
<p>New Mahout 0.3 features include:</p>
<p>* New math and collections modules based on the high performance Colt<br />
library.<br />
* Faster Frequent Pattern Growth (FPGrowth) using FP-bonsai pruning<br />
* Parallel Dirichlet process clustering (a model-based clustering<br />
algorithm)<br />
* Parallel co-occurrence based recommender<br />
* Parallel text document to vector conversion using LLR based ngram<br />
generation<br />
* Parallel Lanczos SVD (Singular Value Decomposition) solver<br />
* Shell scripts for easier running of algorithms, utilities and examples<br />
* &#8230;and much much more: code cleanup, many bug fixes and<br />
performance improvements</p>
<p>Getting started: New to Mahout?</p>
<p>* Download Mahout at http://www.apache.org/dyn/closer.cgi/lucene/mahout<br />
* Check out the Quick start: http://cwiki.apache.org/MAHOUT<br />
* Read the Mahout Wiki: http://cwiki.apache.org/MAHOUT<br />
* Join the community by subscribing to mahout-user@lucene.apache.org<br />
* Give back: http://www.apache.org/foundation/getinvolved.html<br />
* Consider adding yourself to the power by Wiki<br />
page:http://cwiki.apache.org/MAHOUT/poweredby.html</p>
<p>For more information on Apache Mahout, see http://lucene.apache.org/mahout</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/03/18/apache-mahout-0-3-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

