<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lucid Imagination &#187; nutch</title>
	<atom:link href="http://www.lucidimagination.com/blog/category/nutch/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lucidimagination.com/blog</link>
	<description>Exclusively dedicated to Apache Lucene/Solr open source search technology</description>
	<lastBuildDate>Sat, 04 Feb 2012 01:12:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Happy Anniversary, Lucene!  10 years at the ASF</title>
		<link>http://www.lucidimagination.com/blog/2011/09/18/happy-anniversary-lucene-10-years-at-the-asf-3/</link>
		<comments>http://www.lucidimagination.com/blog/2011/09/18/happy-anniversary-lucene-10-years-at-the-asf-3/#comments</comments>
		<pubDate>Sun, 18 Sep 2011 18:05:38 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[nutch]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4050</guid>
		<description><![CDATA[<p>From a quiet start as a pet project to a giant in the industry, <a href="http://lucene.apache.org">Apache Lucene</a> is definitely the little (search) engine that could.  On September 18th, 2001 (at 16:29:48 UTC) Jason Van Zyl made the first <a href="http://svn.apache.org/viewvc?view=revision&#38;revision=149570">official import</a> of Doug Cutting&#8217;s Lucene project (which started in 1997 and was hosted on SourceForge) into <a href="http://www.apache.org">Apache&#8217;s</a> Jakarta project (check out the <a href="http://web.archive.org/web/20011202174653/http://jakarta.apache.org/">Wayback machine</a>).</p>
<p>And while I wasn&#8217;t around in the beginning, I thought I would &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>From a quiet start as a pet project to a giant in the industry, <a href="http://lucene.apache.org">Apache Lucene</a> is definitely the little (search) engine that could.  On September 18th, 2001 (at 16:29:48 UTC) Jason Van Zyl made the first <a href="http://svn.apache.org/viewvc?view=revision&amp;revision=149570">official import</a> of Doug Cutting&#8217;s Lucene project (which started in 1997 and was hosted on SourceForge) into <a href="http://www.apache.org">Apache&#8217;s</a> Jakarta project (check out the <a href="http://web.archive.org/web/20011202174653/http://jakarta.apache.org/">Wayback machine</a>).</p>
<p>And while I wasn&#8217;t around in the beginning, I thought I would offer up some (little) known tidbits, links, etc. about Lucene as an ode to the search library that has significantly changed the search world, as well as my own career:</p>
<ol>
<li>Lucene was <a href="http://www.lucidimagination.com/devzone/videos-podcasts/podcasts/interview-doug-cutting">Doug&#8217;s way of learning Java</a>!  How&#8217;s that for a start?  It took him 3 months, working 2 days a week to crank out the first version.</li>
<li>At the time, some commercial search engines could not do incremental updates of the index, meaning you had to re-index all your documents anytime you had an update.  Lucene has always had an incremental model, all the way through to today&#8217;s Near Real Time features that power the likes of <a href="http://www.twitter.com">Twitter</a> at 1 billion+ searches and 100M+ new documents per day.</li>
<li><a href="http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/document/Field.java?view=markup&amp;pathrev=149570">Field myField = Field.Text(&#8220;foo&#8221;, &#8220;bar&#8221;)</a>; anyone?  Or how about Field myField = Field.UnIndexed(&#8220;foo&#8221;, &#8220;bar&#8221;);</li>
<li>Back then, Lucene had it&#8217;s own <a href="http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/PorterStemmer.java?view=markup&amp;pathrev=149570">PorterStemmer</a>, now we just use <a href="http://snowball.tartarus.org">Snowball</a>.</li>
<li>Only 1 of the <a href="http://web.archive.org/web/20020213045032/http://jakarta.apache.org/lucene/docs/whoweare.html">original committers</a> still remains somewhat active.</li>
<li>Read the old <a href="http://web.archive.org/web/20020203084504/http://www.lucene.com/cgi-bin/faq/faqmanager.cgi">FAQ</a>!  True as it ever was.  (Mostly)</li>
<li>Lucene 2.3 drastically improved indexing performance thanks to a thorough overhaul of the innards while barely affecting the API.  4.0 will blow the doors off of previous versions in terms of speed and efficiency.</li>
<li>Lucene is Doug&#8217;s wife&#8217;s <a href="http://www.lucidimagination.com/devzone/videos-podcasts/podcasts/interview-doug-cutting">middle name</a>.</li>
<li>Lucene has evolved from offering a single vector space scoring model to one that now <a href="http://www.lucidimagination.com/blog/2011/09/12/flexible-ranking-in-lucene-4/">offers plug-n-play</a> ranking (BM25 anyone?)</li>
<li>Lucene is ubiquitous.  It powers search on everything from mobile devices to web scale engines.  I&#8217;ve seen indexes as small as 15% of the original content.  I&#8217;ve also seen indexes grow to several billion documents in size.  Lucene has been used as a caching store, an ORM, a cross language search engine, the guts of the popular <a href="http://lucene.apache.org/solr">Solr</a> search server, the retrieval engine for IBM&#8217;s <a href="http://www-03.ibm.com/innovation/us/watson/index.html">Watson</a> as well as several commercial search engines and pretty much everything in between.</li>
<li>Did you know <a href="http://hadoop.apache.org">Apache Hadoop</a> started as a subproject of Lucene?  Doug Cutting and Mike Cafarella first built out Hadoop in order to scale out indexing for the <a href="http://nutch.apache.org">Apache Nutch</a> project.  From there it was spun out to be a top level ASF project and has gone on to be the de facto choice for large scale distributed processing, much like Lucene is the de facto choice for search!  Lucene has also spun out <a href="http://mahout.apache.org">Mahout</a>, <a href="http://tika.apache.org">Tika</a>, Lucene.NET and Lucy!</li>
</ol>
<p>As for how Lucene&#8217;s impacted me?  In 2004, I took a job at the <a href="http://www.cnlp.org">Center for Natural Language Processing</a> at Syracuse University working for Dr. Liz Liddy.  My job was to build an Arabic-English cross language search engine.  Within a day or two of starting, <a href="http://www.linkedin.com/profile/view?id=10139209&amp;trk=tyah">Ozgur Yilmazel</a> (my boss at the time) said something to the effect of &#8220;we&#8217;ll be using Lucene for the implementation.  Go learn it.&#8221;  Digging in, I quickly needed a couple of features, the biggest one being Term Vectors, so I updated a patch from an earlier version of Lucene and managed to convince the committers at the time to commit it.  From there, I kept supplying patches.  Eventually, I was asked to be a committer.  Some time after that, Yonik Seeley and Marc Krellenstein approached a bunch of the committers about starting a company and here I am today at the company we (Erik, Yonik, Marc and I) founded back in 2007, <a href="http://www.lucidimagination.com">Lucid Imagination</a>.  I feel fortunate to have the opportunity to work on hard problems in an interesting field and for that, Lucene, in no small part, I thank  you.</p>
<p>But enough of my self-indulgence, how has Lucene impacted you?  When did you first start using it?  What&#8217;s your biggest index or fastest QPS?   What ways have you used Lucene beyond that of a search engine?  Leave a comment and let us know.</p>
<p>Happy 10th Anniversary, Lucene!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/09/18/happy-anniversary-lucene-10-years-at-the-asf-3/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>The Apache Lucene Ecosystem: My View of 2010</title>
		<link>http://www.lucidimagination.com/blog/2010/12/27/the-apache-lucene-ecosystem-my-view-of-2010/</link>
		<comments>http://www.lucidimagination.com/blog/2010/12/27/the-apache-lucene-ecosystem-my-view-of-2010/#comments</comments>
		<pubDate>Mon, 27 Dec 2010 15:54:11 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[Droids]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucene Connector Framework]]></category>
		<category><![CDATA[LucidWorks]]></category>
		<category><![CDATA[Lucy]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[ManifoldCF]]></category>
		<category><![CDATA[nutch]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[PyLucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>
		<category><![CDATA[ZooKeeper]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=2809</guid>
		<description><![CDATA[<p>After a week off to enjoy time with my family, I thought I would kick off the last week of 2010 with a look back at the year as it relates to the Apache Lucene ecosystem.  For anyone who follows the amalgamation of projects that I like to call the Lucene Ecosystem (the Apache projects: Lucene, Solr, Nutch, Mahout, Tika, PyLucene, Lucy, Lucene.NET, Droids, ManifoldCF &#8212; Lucene Connector Framework, OpenNLP and UIMA) you know it &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>After a week off to enjoy time with my family, I thought I would kick off the last week of 2010 with a look back at the year as it relates to the Apache Lucene ecosystem.  For anyone who follows the amalgamation of projects that I like to call the Lucene Ecosystem (the Apache projects: Lucene, Solr, Nutch, Mahout, Tika, PyLucene, Lucy, Lucene.NET, Droids, ManifoldCF &#8212; Lucene Connector Framework, OpenNLP and UIMA) you know it has been an amazingly busy and fruitful year.  Instead of going through each project like <a href="http://www.lucidimagination.com/blog/2009/12/24/the-apache-lucene-ecosystem-my-view-of-2009/">last year&#8217;s review</a>, I&#8217;m just going to be a bit less formal and hit on the highlights as I see them.</p>
<p>Before I dig in too much, though, a special thanks to all our customers at Lucid Imagination as well as to my coworkers.  I&#8217;m coming up on 15 years out in the &#8220;real world&#8221; and I can honestly say I&#8217;ve never enjoyed what I do as much as I do here and that even accounts for the normal rough patches one goes through in any job.  As an engineer, there are few things as cool as getting to work with customers who are not only using, but pushing your work/project/product on a daily basis to do new and interesting things (I think this is a direct result of the project being Open Source, which I believe has an inherently <a href="http://www.lucidimagination.com/blog/2009/04/20/lucene-open-source-and-the-cost-of-experimentation/">lower cost of experimentation</a>).  I&#8217;ve been fortunate enough to meet and talk with many people doing all kinds of things with Lucene and Solr ranging from the &#8220;mundane&#8221; of basic keyword search to those building next generation search capabilities at incredible scale.  Through it all, I&#8217;m constantly amazed at the flexibility and efficiency of Lucene and Solr.  For instance, I&#8217;ve been working with one customer now whose Solr-based solution (for the exact same content) will use ~50% less hardware and will have an index that is 1/6 the size of their FAST index all while saving them major dinero.</p>
<p>Speaking of Lucid, one of the highlights of the year for us that relates directly to Lucene and Solr is the launch of our enterprise version: <a href="http://www.lucidimagination.com/lwe/download">LucidWorks Enterprise</a>.   I like to think of it as Apache Solr with a whole lot of Lucid expertise on how to use Solr baked in and topped off with other features and functionality to make building search applications easier.</p>
<p>OK, time to move on to the open source projects&#8230;</p>
<ol>
<li>Without a doubt, the biggest news of the year is the merging of the Lucene and Solr code base as well as the &#8220;graduation&#8221; of several subprojects to Apache Soft. Foundation Top Level Projects (TLP).  The graduating projects are <a href="http://tika.apache.org">Tika</a>, <a href="http://nutch.apache.org">Nutch</a>, and <a href="http://mahout.apache.org">Mahout</a>.  We also spun Lucy (a C port) to the Incubator, where it is working on it&#8217;s own community.  These moves were primarily done to focus the project management on single code base, but they also demonstrate the project has reached a level of maturity at the ASF.  The move also has the side benefit of bringing each project higher visibility.</li>
<li>I&#8217;m particularly excited about the addition of <a href="http://www.lucidimagination.com/blog/2010/12/02/opennlp-moving-to-apache/">OpenNLP to the Apache</a> umbrella.  OpenNLP is a nice open source Java project for natural language processing that has lived at Source Forge for quite some time.  I would expect development to grow quite a bit under the ASF community based model.  Also, integrating OpenNLP with Solr and Lucene is pretty easy to do.  I would be remiss if I didn&#8217;t also give a nod to the addition of the <a href="http://incubator.apache.org/connectors">ManifoldCF</a> project to the ASF.  ManifoldCF will help unlock content in Sharepoint, Documentum and other repositories for users of Lucene and Solr.</li>
<li>Lucene&#8217;s trunk code base now implements our &#8220;Flex APIs&#8221;, which should allow users to have near total control over what goes in the index as well as alternate compression techniques, different scoring models, etc.  See Michael McCandless&#8217; excellent <a href="http://www.lucidimagination.com/files/file/LuceneRev_McCandless_FunWithFlex.pdf">talk at Lucene Revolution</a> for more details.</li>
<li>With all the location aware devices and capabilities on the market, geo-spatial search is a hot topic and Lucene and Solr have been adding quite a bit of capabilities in this regard with the ability to filter, boost and sort results based on location information in documents.  See Solr&#8217;s <a href="http://wiki.apache.org/solr/SpatialSearch">Spatial Search Wiki page</a> for more info as well as several of my <a href="http://www.lucidimagination.com/search/?q=spatial#/s:lucid/li:blogs">past blog posts</a>.</li>
<li>Of course, everyone was a buzz about the cloud this year.  For Solr, this translates into greater efforts to make Solr easier to scale to very large installations (100s to 1000s of nodes and billions and billions of documents) via the <a href="http://wiki.apache.org/solr/SolrCloud">Solr Cloud project that Yonik Seeley and Mark Miller have been spearheading</a>.</li>
<li>On the user side, one of the biggest pieces of buzz this year related to Lucene was the migration of Twitter search to Lucene.  At 1 billion queries per day and 50 million posts per day (all indexed and searchable in near real time), Twitter&#8217;s search system certainly has it&#8217;s work cut out for itself.  However, as Michael Busch <a href="http://www.lucidimagination.com/events/revolution2010/videos/mbusch">outlined at Lucene Revolution</a>, Apache Lucene was up to the task!  Naturally, there were lots of other companies that migrated to Solr and Lucene as well.  Have you <a href="http://www.lucidimagination.com/enterprise-search-solutions/case-studies">shared your use case</a>?</li>
</ol>
<p>Well, I&#8217;ve no doubt missed a bunch of other things, but those items, to me, are some of the bigger highlights.  Looking forward, there are some other exciting things coming to Lucene and Solr.  In particular, I&#8217;m working on adding language identification, related searches and point in polygon filtering to Solr.  I would also expect we will release Lucene/Solr 3.1 fairly soon, too, but you can&#8217;t pin me down on a date just yet.</p>
<p>Here&#8217;s hoping you all have a Happy Holidays and a Happy New Year!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/12/27/the-apache-lucene-ecosystem-my-view-of-2010/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Open Source Escrow to the Rescue</title>
		<link>http://www.lucidimagination.com/blog/2010/08/19/open-source-escrow-to-the-rescue/</link>
		<comments>http://www.lucidimagination.com/blog/2010/08/19/open-source-escrow-to-the-rescue/#comments</comments>
		<pubDate>Thu, 19 Aug 2010 21:12:20 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Enterprise Search]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[LucidGaze]]></category>
		<category><![CDATA[nutch]]></category>
		<category><![CDATA[PyLucene]]></category>
		<category><![CDATA[Span Queries]]></category>
		<category><![CDATA[ZooKeeper]]></category>
		<category><![CDATA[enterprise search]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=2341</guid>
		<description><![CDATA[<p>Do you remember this scenario from days of  yore?</p>
<ul>
<li>Company A buys a software  license from Company B, a startup.</li>
<li>Company A crosses its fingers  that Company B doesn’t go bankrupt and disappear, along with the source code for  Company A’s mission-critical software.</li>
<li>Company B goes  kaput.</li>
<li>Company A is left with some  machine-readable binary code that it is powerless to develop or use.</li>
</ul>
<p>Source code escrow has changed the outcome  of this sticky situation &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>Do you remember this scenario from days of  yore?</p>
<ul>
<li>Company A buys a software  license from Company B, a startup.</li>
<li>Company A crosses its fingers  that Company B doesn’t go bankrupt and disappear, along with the source code for  Company A’s mission-critical software.</li>
<li>Company B goes  kaput.</li>
<li>Company A is left with some  machine-readable binary code that it is powerless to develop or use.</li>
</ul>
<p>Source code escrow has changed the outcome  of this sticky situation for the better, and here’s how: Countless  software companies go out of business every year, and either their code  disappears entirely or goes to another company that doesn’t do any development  or maintenance on it. The concept of escrow is one way in which open source  gives companies a chance to continue their contribution and innovation, because  the code they wrote can outlive them and continue to be evolved by the  community.  I covered this topic in my most recent <a title="http://www.networkworld.com/community/blog/13681" href="http://www.networkworld.com/community/blog/13681">post</a> on the Network  World open source subnet. I invite your feedback: what’s your experience with  source code or open source escrow? Any best practices or cautionary tales to  share? Looking forward to hearing from you.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/08/19/open-source-escrow-to-the-rescue/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Apache Lucene EuroCon Agenda &#8211; The Revolution is On!</title>
		<link>http://www.lucidimagination.com/blog/2010/04/22/apache-lucene-eurocon-agenda-the-revolution-is-on/</link>
		<comments>http://www.lucidimagination.com/blog/2010/04/22/apache-lucene-eurocon-agenda-the-revolution-is-on/#comments</comments>
		<pubDate>Thu, 22 Apr 2010 11:09:33 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucene Connector Framework]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[nutch]]></category>
		<category><![CDATA[Open Relevance Project]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>
		<category><![CDATA[ZooKeeper]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1965</guid>
		<description><![CDATA[<p>After reviewing a lot of great talk proposals, we&#8217;ve announced the  agenda for Apache Lucene Eurocon: <a href="http://lucene-eurocon.org/agenda.html">Apache Lucene EuroCon &#8211;  Europe&#8217;s Premier Lucene and Solr Search User Conference</a>.</p>
<p>One  of the things I really like about this agenda is it is a great mix of  basics, use cases from all over the search map (CMS, news, social media,  advertising), business decisions (see last list and next list) and advanced topics  (NLP, collab filtering, machine &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>After reviewing a lot of great talk proposals, we&#8217;ve announced the  agenda for Apache Lucene Eurocon: <a href="http://lucene-eurocon.org/agenda.html">Apache Lucene EuroCon &#8211;  Europe&#8217;s Premier Lucene and Solr Search User Conference</a>.</p>
<p>One  of the things I really like about this agenda is it is a great mix of  basics, use cases from all over the search map (CMS, news, social media,  advertising), business decisions (see last list and next list) and advanced topics  (NLP, collab filtering, machine learning, advanced visualization, multilingual).   Moreover, the content, even though it is centered in Lucene, goes well  beyond just being about Lucene and is really about search, in all of it&#8217;s power and  glory.  It&#8217;s about real users with real needs getting real problems  solved using the Lucene ecosystem.  Oh, and by the way, those users are doing it at scale!  Big scale.</p>
<p>That&#8217;s powerful stuff,  because, in case you hadn&#8217;t noticed (shh, it&#8217;s our little secret) there is a revolution going on in search.  (Funny how that line coincides with Lucid&#8217;s frontman, Eric Gries, giving  a  talk titled &#8220;The Search Revolution&#8221;)</p>
<p>Are you a part of the revolution?  See you in <a href="http://lucene-eurocon.org/index.html">Prague</a> in mid-May.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/04/22/apache-lucene-eurocon-agenda-the-revolution-is-on/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Apache Lucene Connector Framework now in Incubation at the ASF</title>
		<link>http://www.lucidimagination.com/blog/2010/01/20/apache-lucene-connector-framework-now-in-incubation-at-the-asf/</link>
		<comments>http://www.lucidimagination.com/blog/2010/01/20/apache-lucene-connector-framework-now-in-incubation-at-the-asf/#comments</comments>
		<pubDate>Wed, 20 Jan 2010 20:16:13 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucene Connector Framework]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[nutch]]></category>
		<category><![CDATA[PyLucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1509</guid>
		<description><![CDATA[<h1>Short Version</h1>
<p>The Apache Lucene Connector Framework project has officially entered incubation.  LCF, for short, is going to be a framework for connecting to content repositories like Sharepoint, Documentum, etc. and will make it easy to hook into Lucene, Solr, Nutch, Mahout, Tika, while, of course, remaining agnostic of the final destination of the data.  See the <a href="http://incubator.apache.org/connectors/">Connectors website</a> and the <a href="http://wiki.apache.org/incubator/LuceneConnectorFrameworkProposal">original proposal</a> for more info.  Help wanted!</p>
<h1>Long Version</h1>
<h2>Background</h2>
<p>A while back, <a href="http://www.metacarta.com">MetaCarta</a>&#8230;</p>]]></description>
			<content:encoded><![CDATA[<h1>Short Version</h1>
<p>The Apache Lucene Connector Framework project has officially entered incubation.  LCF, for short, is going to be a framework for connecting to content repositories like Sharepoint, Documentum, etc. and will make it easy to hook into Lucene, Solr, Nutch, Mahout, Tika, while, of course, remaining agnostic of the final destination of the data.  See the <a href="http://incubator.apache.org/connectors/">Connectors website</a> and the <a href="http://wiki.apache.org/incubator/LuceneConnectorFrameworkProposal">original proposal</a> for more info.  Help wanted!</p>
<h1>Long Version</h1>
<h2>Background</h2>
<p>A while back, <a href="http://www.metacarta.com">MetaCarta</a>, a spatial search company, approached us about open sourcing their internally developed Connector Framework at the <a href="http://www.apache.org">Apache Software Foundation</a>.  After several discussions and a whole bunch of legwork getting a proposal together, the LCF is now officially launched in the <a href="http://incubator.apache.org/">Apache Incubator</a>!  We&#8217;ve already got a great roster of committers lined up and are working to incorporate the software grant from MetaCarta, from which we can build out a first release, so stay tuned!  Lucid Imagination, of course, is a big supporter of this project and we look forward to it&#8217;s success!</p>
<h2>What is a Connector Framework?</h2>
<p>To quote the proposal:</p>
<blockquote><p>[The Lucene] Connector Framework is an extendible [sic] incremental crawler, which uses a database to manage configuration and crawl history, and provides reasonably high performance in accessing content in multiple repositories for the main purpose of search engine indexing. Connector Framework also establishes a repository-specific security model which can be used to limit search user access to repository content based on a user&#8217;s identity. Connector Framework also includes existing connectors and authorities for:</p>
<ul>
<li>File system</li>
<li>Windows shares</li>
<li>JDBC-supported databases</li>
<li>RSS feeds</li>
<li>General websites</li>
<li>LiveLink [from OpenText]</li>
<li>Documentum [from EMC]</li>
<li>SharePoint [from Microsoft]</li>
<li>Meridio [from Meridio]</li>
<li>Memex [from Memex]</li>
<li>FileNet [from IBM]</li>
</ul>
</blockquote>
<p>There are two pieces in particular to highlight in the quote.  First of all, it&#8217;s an extensible framework, meaning new connectors can be added without the need for application developers writing &#8220;one-off&#8221; code just for that connector.  For anyone who&#8217;s lived that pain, you know first hand what I mean.  In fact, I&#8217;ve already heard from others who are thinking of contributing their connectors for other data stores as well!  Second, the framework accounts for repository specific security.  In corporate environments, this is vital to making sure that the right people, and only the right people, have access to the right information at the right time.</p>
<h2>Why is this important?</h2>
<p>Many, many search engines, not too mention many other applications, have either rolled their own connectors or bought a company that provides them.  Connectors, in some situations, are the cost of entry into  certain markets, but are rarely the feature that seals the deal.  By making these open source, we can all share the cost of maintaining it while increasing the quality of a piece of software well beyond what any one company can achieve.  Beyond that, we hope the repository companies will also step up and contribute (some are already quite open), as making it easier to access these repositories will no doubt lead to more applications, which of course should mean more sales for said companies.</p>
<h2>How can you contribute?</h2>
<p>For starters, subscribe to the <a href="http://incubator.apache.org/connectors/mail.html">mailing lists</a>.  Then check out the <a href="http://cwiki.apache.org/confluence/display/CONNECTORS/HowToContribute">How To Contribute page</a> on the Wiki.  Beyond that, chip in with your connector use cases on the mailing lists and be a part of the community.</p>
<h2>What&#8217;s next?</h2>
<p>First off, the community will have to process the software grant from MetaCarta and then commit the code to LCF&#8217;s Subversion <a href="https://svn.apache.org/repos/asf/incubator/lcf">repository</a>.  From there, we&#8217;ll do just like any Apache project does and look to build out not only the code, but also the community, all on the path to graduating from the Incubator and taking our place as a full-fledged Lucene subproject.  Keep your eyes here and on the mailing lists and websites for more information in the future!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/01/20/apache-lucene-connector-framework-now-in-incubation-at-the-asf/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>The Apache Lucene Ecosystem: My view of 2009</title>
		<link>http://www.lucidimagination.com/blog/2009/12/24/the-apache-lucene-ecosystem-my-view-of-2009/</link>
		<comments>http://www.lucidimagination.com/blog/2009/12/24/the-apache-lucene-ecosystem-my-view-of-2009/#comments</comments>
		<pubDate>Thu, 24 Dec 2009 15:53:02 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[Droids]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucy]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[nutch]]></category>
		<category><![CDATA[Open Relevance Project]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[PyLucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>
		<category><![CDATA[ZooKeeper]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1429</guid>
		<description><![CDATA[<p>It&#8217;s that time of year, so I thought I would take a look back at the year that was for the <a href="http://lucene.apache.org">Lucene Ecosystem</a> and maybe look ahead just a little bit too.</p>
<p>First and foremost, it should be obvious to even the most casual observer that the Apache Lucene communities are thriving.  Not only is it a great time to be involved in open source, it&#8217;s a great time to be involved in Lucene.  Both &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s that time of year, so I thought I would take a look back at the year that was for the <a href="http://lucene.apache.org">Lucene Ecosystem</a> and maybe look ahead just a little bit too.</p>
<p>First and foremost, it should be obvious to even the most casual observer that the Apache Lucene communities are thriving.  Not only is it a great time to be involved in open source, it&#8217;s a great time to be involved in Lucene.  Both as a committer and as an employee of Lucid Imagination, I&#8217;m continuously amazed at the vibe produced by the people using the Lucene suite of libraries, tools and applications.  People are routinely solving both large scale and really hard problems using the Lucene ecosystem and they are doing it on time and on budget.  For instance, this year alone, I&#8217;ve seen companies and individuals using Lucene and Solr to provide search in production environments with document counts ranging from the few tens of thousands all the way up to 5-10 billion plus and query rates that barely register a blip to 1000+ QPS.  I&#8217;ve also seen many people using Lucene to power recommendation engines, content management systems, machine learning/NLP applications and log analysis tools.</p>
<p>Much to my initial surprise, the number one reason I hear for why they chose Lucene: flexibility. (I thought it would be the fact that they are free to use, but that is just icing on the proverbial cake, I guess)  Namely, Lucene gives them the flexibility to build what they want or simply to use it out of the box.  It gives them the flexibility to bring in other tools from other open source projects or other commercial vendors, all without compromising speed or scale.</p>
<p>With that in mind, I thought I would give some highlights of both the top level project (TLP &#8212; http://lucene.apache.org &#8212; the ASF project that &#8220;houses&#8221; all of the Lucene related subprojects) that is Lucene as well as the individual projects.  (I&#8217;m not involved in all them, so please correct me if I&#8217;m wrong!)</p>
<h1>Lucene TLP</h1>
<p>Whew!  It&#8217;s been a busy year for the Lucene TLP.  We started the <a href="http://lucene.apache.org/openrelevance">Open Relevance Project</a> (ORP), added <a href="http://lucene.apache.org/pylucene">PyLucene</a> (a Python port of Lucene) and successfully graduated a .NET version of Lucene from ASF incubation, not to mention the fact that the Lucene PMC is responsible for overseeing the release of all the various bits and bytes for each and every subproject (which is a lot of releases!)  We also, for the first time ever, organized two days of Lucene related talks at <a href="http://www.us.apachecon.org">ApacheCon US</a> plus two days of training and meetups. (In the past, organization was always handled by the ASF Conference Committee).</p>
<p>In looking ahead for the TLP, I see a continued focus on providing quality software across all the projects.  Additionally, keep your ears open, as there is a new sub project brewing that I think will really make it even easier for people to deploy Lucene based solutions.  Finally, just as Lucene gave birth to Apache Hadoop and is happy to see it doing so well, there is <a href="http://www.lucidimagination.com/search/document/5a41be454d503779/possible_contribution_at_somewhat_of_a_tangent_to_mahout">growing talk</a> that Lucene will look to see Apache Mahout off as it&#8217;s own TLP.  Of course, none of that is in stone yet!</p>
<p>For those looking for more on the big picture that is Lucene, see my <a href="http://www.us.apachecon.com/c/acus2009/sessions/428">talk</a> at ApacheCon US for more details on the ecosystem.  Not sure why the slides aren&#8217;t there, so I put them <a href="http://people.apache.org/~gsingers/apacheconUS09/luceneEcosystem.pptx">here</a>.</p>
<h1>Lucene Java</h1>
<p><a href="http://lucene.apache.org/java">Lucene Java</a> (i.e. what everyone knows as &#8220;Lucene&#8221;) continues to not only provide a rock solid indexing and search API, it continues to push forward with new capabilities.  In 2009, Lucene did 4 releases (2.4.1, 2.9.0, 2.9.1 and 3.0.0).  2.9.0 was probably the most interesting, as it significantly improved performance in a number of areas, while 3.0.0 removed all of the deprecated APIs and finally, officially, dropped support for Java JDK 1.4.  I&#8217;ll leave it to the reader to go look up all the features and changes as they are numerous.</p>
<p>Looking ahead, the phrase of the year appears to be &#8220;flexible indexing&#8221;.  Flex Indexing looks to make it even easier for people to custom craft what is in their index, whether that is rich token attributes (aka &#8220;typed&#8221; payloads), alternative scoring models (like Okapi BM25) or a bare bones index designed for blazing fast speed.</p>
<h1>Solr</h1>
<p>With Lucene as the engine, <a href="http://lucene.apache.org/solr">Solr</a> has evolved into quite the car.   Building on all of the goodness that is Lucene, Solr, in 2009, released version 1.4 with a whole slew of new features, faster implementations and bug fixes.  Highlights for 1.4 include: improved filtering and faceting performance, support for clustering, rich document indexing via Apache Tika, multi-select faceting (see Lucid&#8217;s very own <a href="http://search.lucidimagination.com">search.lucidimagination.com</a> for a demo), many new Query capabilities and a whole bevy of new Components (Terms, Term Vectors, Auto-suggest, deduplication and Statistics on result sets) that truly make Solr an incredible search platform.</p>
<p>Looking ahead, Solr 1.5 (2.0?) is already in the works and looks to have even more <a href="http://www.lucidimagination.com/blog/2009/12/12/apache-solr-1-5-on-the-move-with-more-functionality/">functionality</a>.  For instance, a lot of work is underway to integrate Apache ZooKeeper and other distributed capabilities, which will help make deploying Solr at scale even easier.  Meanwhile, many are hard at work adding &#8220;field collapsing&#8221; (search result grouping/deduplication) and spatial (local/geo) search.</p>
<h1>Mahout</h1>
<p>It&#8217;s been a very exciting year (in my completely biased opinion!) for <a href="http://lucene.apache.org/mahout">Mahout</a>, the scalable machine learning project under Lucene.  In 2009, Mahout shepherded through it&#8217;s very first release (0.1) built on the strength of a few dedicated volunteers working to add capabilities for clustering, categorization and collaborative filtering.  Next came 0.2 with many new features (frequent patternset mining, Latent Dirichlet Allocation, Random Decision Forests, new recommendation capabilities) API and performance improvements and a growing list of people who stopped lurking and stepped up to help out.  Towards the end of the year, Mahout is already reaching a list volume that I find difficult to keep up with if I miss a day or two.  For starters, we have taken on the task of integrating/transforming the <a href="http://acs.lbl.gov/~hoschek/colt/">Colt</a> matrix library for our needs.  We are also working on adding truly large scale recommendation capabilities plus adding in a Support Vector Machine implementation and Logistic Regression.  Not only that, but the mahout-user@lucene.apache.org mailing list continues to be a valuable resources for people seeking practical advice on deploying machine learning in production environments regardless of the choice of Mahout or not.</p>
<p>In 2010, I suspect Mahout will become it&#8217;s own TLP, with several sub projects roughly divided as: core/utilities, recommendations (Taste) and NLP.  Of course, until it happens, this is just speculation.  I also think Mahout will look to finalize its APIs for a 1.0 release.</p>
<h1>Nutch</h1>
<p>In 2009, <a href="http://lucene.apache.org/nutch">Apache Nutch</a> released the long awaited version 1.0.  This release contained many new indexing and scoring capabilities, as well as integration with Solr.  The community continues to be focused on providing large scale crawling and search capabilities by leveraging Apache Hadoop and Lucene/Solr.  Currently, the community is actively looking at modularizing Nutch to allow it to more easily plug in other ecosystem components like Tika and Solr while focusing on the primary task of obtaining and managing content via crawling.</p>
<h1>Tika</h1>
<p><a href="http://lucene.apache.org/tika/">Apache Tika</a> is a content extraction framework for &#8220;rich&#8221; documents like Adobe PDF and Microsoft Office.  In 2009, Tika released versions 0.3, 0.4 and 0.5, all with incremental improvements designed to make it more stable and easier to use.  Each release also seemed to carry with it a new list of supported file formats as more and more people join the project to lend a hand.</p>
<p>Coming up, I suspect Tika will look to finalize a 1.0 release at some point in 2009 as well as focus in on standardizing, if such a thing is possible, on the metadata artifacts produced by Tika.</p>
<h1>Open Relevance Project</h1>
<p>The <a href="http://lucene.apache.org/openrelevance">ORP</a> is a project that has been in my brain for several years now and finally got off the ground in 2009.  The goal of ORP is to provide corpora, queries, judgments and other tools to help search and machine learning projects discuss relevance in a completely open way.  While the project is really young, it is slowly but surely building up steam by adding some basic tools and collections thanks to the hard work of several individuals.  In 2010, look for ORP to build out a more complete toolset while attracting more users and contributors.  It will also be vital for the ORP to create its own versioned corpora for download (free!) so that all experiments can be reliably reproduced.</p>
<h1>Droids</h1>
<p><a href="http://incubator.apache.org/droids">Droids</a> is a standalone crawler framework currently in incubation at the ASF.  Development was active in 2009, but has not yet had a release.  For now, it is a Spring based framework that allows one to quickly build out agents that can go and crawl and process content.</p>
<h1>Lucene.NET</h1>
<p>In 2009, <a href="http://incubator.apache.org/lucene.net/">Lucene.NET</a> graduated (some infrastructure changes still need to happen) from ASF incubation and became a full-fledged member of the Lucene ecosystem.  While I&#8217;m not closely involved with Lucene.NET, the community continues to provide value to those looking for a solid search library in .NET.  Since the project is mostly autogenerated from the Java sources, the .NET version has tracked the Lucene Java releases fairly closely.</p>
<p>Looking forward, I expect the .NET version will strive to maintain a lockstep march with Lucene releases.</p>
<h1>PyLucene</h1>
<p>Similar to .NET, <a href="http://lucene.apache.org/pylucene">PyLucene</a> produces a Python port of Lucene Java.  In 2009, PyLucene was formerly welcomed into the Lucene fold via a software donation by Andi Vajda.  It continues to produce releases of PyLucene in lockstep with Lucene Java.</p>
<h1>Lucy</h1>
<p><a href="http://lucene.apache.org/lucy">Lucy</a> is a &#8220;loose&#8221; &#8216;C&#8217; port of Lucene.  Lucy finally got off the ground in 2009 and is steadily working on building out a core search library that provides fast search capabilities for languages like Perl, C and Ruby.</p>
<p>For 2010, look for Lucy to continue to grow its community while adding capabilities.</p>
<h1>Moving Forward</h1>
<p>While the past is, of course, no prediction of the future, I think it&#8217;s safe to say Lucene is looking to continue to provide significant capabilities and value to both well established and new communities alike.  With open source, you never know where the next good idea is coming from, so make sure to stay tuned both here and on the mailing lists for more insight and more cool new capabilities.</p>
<p>Happy Holidays and here&#8217;s to an Open Source 2010!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2009/12/24/the-apache-lucene-ecosystem-my-view-of-2009/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

