<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lucid Imagination &#187; Tika</title>
	<atom:link href="http://www.lucidimagination.com/blog/category/tika/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lucidimagination.com/blog</link>
	<description>Exclusively dedicated to Apache Lucene/Solr open source search technology</description>
	<lastBuildDate>Sat, 04 Feb 2012 01:12:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Indexing rich files into Solr, quickly and easily</title>
		<link>http://www.lucidimagination.com/blog/2011/08/31/indexing-rich-files-into-solr-quickly-and-easily/</link>
		<comments>http://www.lucidimagination.com/blog/2011/08/31/indexing-rich-files-into-solr-quickly-and-easily/#comments</comments>
		<pubDate>Wed, 31 Aug 2011 14:07:53 +0000</pubDate>
		<dc:creator>Erik Hatcher</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Ruby]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>
		<category><![CDATA[Erik Hatcher]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=3885</guid>
		<description><![CDATA[<p>This past weekend I presented yet another &#8220;Rapid Prototyping with Solr&#8221; presentation, this time back in the saddle with the <a title="No Fluff, Just Stuff - Raleigh, August 2011" href="http://www.nofluffjuststuff.com/conference/raleigh/2011/08/home" target="_blank">No Fluff, Just Stuff symposium in Raleigh, NC</a>. I intentionally waited until the last minute to hack together a quick script to index some data I haven&#8217;t indexed before to demonstrate the ease at which one can grab Solr and immediately make some use out of it. This time around I cobbled together a &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>This past weekend I presented yet another &#8220;Rapid Prototyping with Solr&#8221; presentation, this time back in the saddle with the <a title="No Fluff, Just Stuff - Raleigh, August 2011" href="http://www.nofluffjuststuff.com/conference/raleigh/2011/08/home" target="_blank">No Fluff, Just Stuff symposium in Raleigh, NC</a>. I intentionally waited until the last minute to hack together a quick script to index some data I haven&#8217;t indexed before to demonstrate the ease at which one can grab Solr and immediately make some use out of it. This time around I cobbled together a simple Ruby script to index a directory full of rich (PDF, HTML, Word, etc) documents into a fresh Solr 3.3.0 install. Only a few seconds later I have my documents indexed, and even searchable through a user interface.</p>
<p>Here&#8217;s the steps I took:</p>
<ol>
<li>Download and &#8220;install&#8221; (aka unzip) Apache Solr 3.3.0</li>
<li>Launch Solr (cd example; java -jar start.jar)</li>
<li>Index files</li>
</ol>
<p>That&#8217;s it.  Here&#8217;s the indexing script I used:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'net/http'</span>
&nbsp;
<span style="color:#0066ff; font-weight:bold;">@dir</span> = <span style="color:#CC00FF; font-weight:bold;">Dir</span>.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;/Users/erikhatcher/apache-solr-3.3.0/docs&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
&nbsp;
<span style="color:#0066ff; font-weight:bold;">@url</span> = <span style="color:#CC00FF; font-weight:bold;">URI</span>.<span style="color:#9900CC;">parse</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;http://localhost:8983/solr&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
<span style="color:#0066ff; font-weight:bold;">@connection</span> = <span style="color:#6666ff; font-weight:bold;">Net::HTTP</span>.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span>@url.<span style="color:#9900CC;">host</span>, <span style="color:#0066ff; font-weight:bold;">@url</span>.<span style="color:#9900CC;">port</span><span style="color:#006600; font-weight:bold;">&#41;</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> index<span style="color:#006600; font-weight:bold;">&#40;</span>filename<span style="color:#006600; font-weight:bold;">&#41;</span>
<span style="color:#0066ff; font-weight:bold;">@connection</span>.<span style="color:#9900CC;">get</span><span style="color:#006600; font-weight:bold;">&#40;</span>@url.<span style="color:#9900CC;">path</span> <span style="color:#006600; font-weight:bold;">+</span> <span style="color:#996600;">&quot;/update/extract?stream.file=#{filename}&amp;amp;literal.id=#{filename}&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> commit
<span style="color:#0066ff; font-weight:bold;">@connection</span>.<span style="color:#9900CC;">get</span><span style="color:#006600; font-weight:bold;">&#40;</span>@url.<span style="color:#9900CC;">path</span> <span style="color:#006600; font-weight:bold;">+</span> <span style="color:#996600;">&quot;/update?commit=true&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#0066ff; font-weight:bold;">@dir</span>.<span style="color:#9900CC;">each</span> <span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006600; font-weight:bold;">|</span>name<span style="color:#006600; font-weight:bold;">|</span>
  f = <span style="color:#996600;">&quot;#{@dir.path}/#{name}&quot;</span>
  <span style="color:#9966CC; font-weight:bold;">if</span> <span style="color:#CC00FF; font-weight:bold;">File</span>.<span style="color:#9900CC;">file</span>?<span style="color:#006600; font-weight:bold;">&#40;</span>f<span style="color:#006600; font-weight:bold;">&#41;</span>
    <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;Indexing #{f}...&quot;</span>
    index<span style="color:#006600; font-weight:bold;">&#40;</span>f<span style="color:#006600; font-weight:bold;">&#41;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#006600; font-weight:bold;">&#125;</span>
&nbsp;
<span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;Committing...&quot;</span>
commit
&nbsp;
<span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;Done!&quot;</span></pre></div></div>

<p>To make it look prettier, only a little dabbling with the templates is needed &#8211; add your company logo, customize the colors. And a change to the example (/browse handler) configuration to facet on content_type will allow you to easily search just within documents of specific types through the included UI.  The example code above indexed the docs that ship with Apache Solr 3.3.0; just change the path to a directory of yours to index your own content.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/08/31/indexing-rich-files-into-solr-quickly-and-easily/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>The Apache Lucene Ecosystem: My View of 2010</title>
		<link>http://www.lucidimagination.com/blog/2010/12/27/the-apache-lucene-ecosystem-my-view-of-2010/</link>
		<comments>http://www.lucidimagination.com/blog/2010/12/27/the-apache-lucene-ecosystem-my-view-of-2010/#comments</comments>
		<pubDate>Mon, 27 Dec 2010 15:54:11 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[Droids]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucene Connector Framework]]></category>
		<category><![CDATA[LucidWorks]]></category>
		<category><![CDATA[Lucy]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[ManifoldCF]]></category>
		<category><![CDATA[nutch]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[PyLucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>
		<category><![CDATA[ZooKeeper]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=2809</guid>
		<description><![CDATA[<p>After a week off to enjoy time with my family, I thought I would kick off the last week of 2010 with a look back at the year as it relates to the Apache Lucene ecosystem.  For anyone who follows the amalgamation of projects that I like to call the Lucene Ecosystem (the Apache projects: Lucene, Solr, Nutch, Mahout, Tika, PyLucene, Lucy, Lucene.NET, Droids, ManifoldCF &#8212; Lucene Connector Framework, OpenNLP and UIMA) you know it &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>After a week off to enjoy time with my family, I thought I would kick off the last week of 2010 with a look back at the year as it relates to the Apache Lucene ecosystem.  For anyone who follows the amalgamation of projects that I like to call the Lucene Ecosystem (the Apache projects: Lucene, Solr, Nutch, Mahout, Tika, PyLucene, Lucy, Lucene.NET, Droids, ManifoldCF &#8212; Lucene Connector Framework, OpenNLP and UIMA) you know it has been an amazingly busy and fruitful year.  Instead of going through each project like <a href="http://www.lucidimagination.com/blog/2009/12/24/the-apache-lucene-ecosystem-my-view-of-2009/">last year&#8217;s review</a>, I&#8217;m just going to be a bit less formal and hit on the highlights as I see them.</p>
<p>Before I dig in too much, though, a special thanks to all our customers at Lucid Imagination as well as to my coworkers.  I&#8217;m coming up on 15 years out in the &#8220;real world&#8221; and I can honestly say I&#8217;ve never enjoyed what I do as much as I do here and that even accounts for the normal rough patches one goes through in any job.  As an engineer, there are few things as cool as getting to work with customers who are not only using, but pushing your work/project/product on a daily basis to do new and interesting things (I think this is a direct result of the project being Open Source, which I believe has an inherently <a href="http://www.lucidimagination.com/blog/2009/04/20/lucene-open-source-and-the-cost-of-experimentation/">lower cost of experimentation</a>).  I&#8217;ve been fortunate enough to meet and talk with many people doing all kinds of things with Lucene and Solr ranging from the &#8220;mundane&#8221; of basic keyword search to those building next generation search capabilities at incredible scale.  Through it all, I&#8217;m constantly amazed at the flexibility and efficiency of Lucene and Solr.  For instance, I&#8217;ve been working with one customer now whose Solr-based solution (for the exact same content) will use ~50% less hardware and will have an index that is 1/6 the size of their FAST index all while saving them major dinero.</p>
<p>Speaking of Lucid, one of the highlights of the year for us that relates directly to Lucene and Solr is the launch of our enterprise version: <a href="http://www.lucidimagination.com/lwe/download">LucidWorks Enterprise</a>.   I like to think of it as Apache Solr with a whole lot of Lucid expertise on how to use Solr baked in and topped off with other features and functionality to make building search applications easier.</p>
<p>OK, time to move on to the open source projects&#8230;</p>
<ol>
<li>Without a doubt, the biggest news of the year is the merging of the Lucene and Solr code base as well as the &#8220;graduation&#8221; of several subprojects to Apache Soft. Foundation Top Level Projects (TLP).  The graduating projects are <a href="http://tika.apache.org">Tika</a>, <a href="http://nutch.apache.org">Nutch</a>, and <a href="http://mahout.apache.org">Mahout</a>.  We also spun Lucy (a C port) to the Incubator, where it is working on it&#8217;s own community.  These moves were primarily done to focus the project management on single code base, but they also demonstrate the project has reached a level of maturity at the ASF.  The move also has the side benefit of bringing each project higher visibility.</li>
<li>I&#8217;m particularly excited about the addition of <a href="http://www.lucidimagination.com/blog/2010/12/02/opennlp-moving-to-apache/">OpenNLP to the Apache</a> umbrella.  OpenNLP is a nice open source Java project for natural language processing that has lived at Source Forge for quite some time.  I would expect development to grow quite a bit under the ASF community based model.  Also, integrating OpenNLP with Solr and Lucene is pretty easy to do.  I would be remiss if I didn&#8217;t also give a nod to the addition of the <a href="http://incubator.apache.org/connectors">ManifoldCF</a> project to the ASF.  ManifoldCF will help unlock content in Sharepoint, Documentum and other repositories for users of Lucene and Solr.</li>
<li>Lucene&#8217;s trunk code base now implements our &#8220;Flex APIs&#8221;, which should allow users to have near total control over what goes in the index as well as alternate compression techniques, different scoring models, etc.  See Michael McCandless&#8217; excellent <a href="http://www.lucidimagination.com/files/file/LuceneRev_McCandless_FunWithFlex.pdf">talk at Lucene Revolution</a> for more details.</li>
<li>With all the location aware devices and capabilities on the market, geo-spatial search is a hot topic and Lucene and Solr have been adding quite a bit of capabilities in this regard with the ability to filter, boost and sort results based on location information in documents.  See Solr&#8217;s <a href="http://wiki.apache.org/solr/SpatialSearch">Spatial Search Wiki page</a> for more info as well as several of my <a href="http://www.lucidimagination.com/search/?q=spatial#/s:lucid/li:blogs">past blog posts</a>.</li>
<li>Of course, everyone was a buzz about the cloud this year.  For Solr, this translates into greater efforts to make Solr easier to scale to very large installations (100s to 1000s of nodes and billions and billions of documents) via the <a href="http://wiki.apache.org/solr/SolrCloud">Solr Cloud project that Yonik Seeley and Mark Miller have been spearheading</a>.</li>
<li>On the user side, one of the biggest pieces of buzz this year related to Lucene was the migration of Twitter search to Lucene.  At 1 billion queries per day and 50 million posts per day (all indexed and searchable in near real time), Twitter&#8217;s search system certainly has it&#8217;s work cut out for itself.  However, as Michael Busch <a href="http://www.lucidimagination.com/events/revolution2010/videos/mbusch">outlined at Lucene Revolution</a>, Apache Lucene was up to the task!  Naturally, there were lots of other companies that migrated to Solr and Lucene as well.  Have you <a href="http://www.lucidimagination.com/enterprise-search-solutions/case-studies">shared your use case</a>?</li>
</ol>
<p>Well, I&#8217;ve no doubt missed a bunch of other things, but those items, to me, are some of the bigger highlights.  Looking forward, there are some other exciting things coming to Lucene and Solr.  In particular, I&#8217;m working on adding language identification, related searches and point in polygon filtering to Solr.  I would also expect we will release Lucene/Solr 3.1 fairly soon, too, but you can&#8217;t pin me down on a date just yet.</p>
<p>Here&#8217;s hoping you all have a Happy Holidays and a Happy New Year!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/12/27/the-apache-lucene-ecosystem-my-view-of-2010/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Extending Apache Tika Capabilities</title>
		<link>http://www.lucidimagination.com/blog/2010/06/18/extending-apache-tika-capabilities/</link>
		<comments>http://www.lucidimagination.com/blog/2010/06/18/extending-apache-tika-capabilities/#comments</comments>
		<pubDate>Fri, 18 Jun 2010 08:51:38 +0000</pubDate>
		<dc:creator>Sami Siren</dc:creator>
				<category><![CDATA[Tika]]></category>
		<category><![CDATA[Sami Siren]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=2161</guid>
		<description><![CDATA[<p>Apache Tika is a toolkit for extracting metadata and textual content from various document formats. Tika itself provides implementation for parsing some document formats while it relies on external libraries (such as Apache PDFBox and Apache POI) for parsing many more.</p>
<p>Tika provides a uniform Java API for all of the supported document formats to make life easier for the user.  Additionally, Tika provides functionality for detecting document type and content language.</p>
<p>In my earlier &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>Apache Tika is a toolkit for extracting metadata and textual content from various document formats. Tika itself provides implementation for parsing some document formats while it relies on external libraries (such as Apache PDFBox and Apache POI) for parsing many more.</p>
<p>Tika provides a uniform Java API for all of the supported document formats to make life easier for the user.  Additionally, Tika provides functionality for detecting document type and content language.</p>
<p>In my earlier article, <a href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika">Content Extraction with Tika</a>, I looked into using Tika either with Solr or in standalone mode. In this post I will go though some of the aspects involved when implementing support for new document formats. I will also provide a couple of example parsers and a full maven project to get you up to speed quickly.</p>
<p><strong>Extension mechanisms</strong></p>
<p>The basic principle of adding support for more document formats in Tika is very simple. All you need to do is write a Java class that implements the Tika Parser interface and let Tika know about your extension. If you implemented a parser for one of the &gt;900 file formats Tika knows by filename extension or one of the ~300 formats that Tika can recognize from the number of file content bytes this is all you need to do.</p>
<p>There are at least three different ways to let Tika know about the new parser. The first way (and the most flexible one) is to wire it up with java code. If you wanted to use Tika AutodetectParser you&#8217;d call setParsers method on AutodetectParser with the parser of your choice. By wiring up things with java code you could also customize the detection logic easily too just by calling the setDetector method.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000066; font-weight: bold;">void</span> registerParser<span style="color: #009900;">&#40;</span>AutoDetectParser autodetectParser,
      <span style="color: #003399;">Parser</span> parser<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span>MediaType type <span style="color: #339933;">:</span> parser.<span style="color: #006633;">getSupportedTypes</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> ParseContext<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      autodetectParser.<span style="color: #006633;">getParsers</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">put</span><span style="color: #009900;">&#40;</span>type.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, parser<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span></pre></div></div>

<p>The next way to customize the set of available parsers is to use an external XML file and construct a TikaConfig object with that configuration file.</p>
<p>The last of the three methods explained here is to use the new mechanism that was added in version 0.7 of Tika: the standard Java <a href="http://java.sun.com/j2se/1.5.0/docs/guide/jar/jar.html#Service+Provider">ServiceProvider</a> API. In practice this means that as a provider of a new parsing functionality you&#8217;d need to list your parser implementation classes in the file META-INF/services/org.apache.tika.Parser inside the .jar that you provide to the implementation.</p>
<p><code><br />
com.lucid.tika.MyTXTParser<br />
com.lucid.tika.VCardParser<br />
</code></p>
<p>Extending the capabilities of Tika by using the ServiceProvider API is very straightforward and simple. There are however a couple of details you should pay attention to when using this mechanism.</p>
<p>If you need to replace a Tika provided parser implementation with your custom implementation you need to make sure that your .jar file is loaded <em>after</em> the tika-parsers.jar file. This is because in the current implementation the last parser registered for certain mime type is used to parse content for that mime type.</p>
<p>To support completely new document types (that Tika knows nothing about) you need to customize the detection process of AutoDetectParser manually. This is because there is no similar mechanism to extend the detection step as there is for adding new parsers. One way to do this is to use CompositeDetector to add your &#8220;overlay&#8221; detections to be done and trust for the default Detector for detecting the other types.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000066; font-weight: bold;">void</span> setOverlayDetector<span style="color: #009900;">&#40;</span>Detector overlay, AutoDetectParser parser<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #003399;">ArrayList</span> detectors <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">ArrayList</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    detectors.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span>parser.<span style="color: #006633;">getDetector</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    detectors.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span>overlay<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    parser.<span style="color: #006633;">setDetector</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> CompositeDetector<span style="color: #009900;">&#40;</span>detectors<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span></pre></div></div>

<p>In this blog post I have demonstrated some ways to extend Tika parsing and detection capabilities if needed in your custom environment. Process-wise, the best possible way to add new capabilities to Tika is to contribute your new parser integrations and enhancements back to the Tika project. This way the community as whole will benefit from the results.</p>
<p><strong>Running the provided example</strong></p>
<ul>
<li><a href="http://www.lucidimagination.com/blog/wp-content/uploads/2010/06/tika-post.tar.gz">Download project</a></li>
<li>Compile project<br />
<code>mvn clean install</code></li>
<li>Copy dependencies to directory target/dependencies<br />
<code>mvn dependency:copy-dependencies</code></li>
<li>Execute the default TikaGUI with our additions (&#8220;enhanced&#8221; .txt parser, vCard parser)<br />
<code><br />
java -cp target/dependency/tika-app-0.7.jar:target/extending-tika-post-1.0-SNAPSHOT.jar:target/dependency/ical4j-vcard-0.9.2.jar:target/dependency/ical4j-1.0-rc2.jar:target/dependency/commons-codec-1.3.jar:target/dependency/commons-lang-2.4.jar org.apache.tika.gui.TikaGUI<br />
</code></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/06/18/extending-apache-tika-capabilities/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Berlin Buzzwords Recap</title>
		<link>http://www.lucidimagination.com/blog/2010/06/11/berlin-buzzwords-recap/</link>
		<comments>http://www.lucidimagination.com/blog/2010/06/11/berlin-buzzwords-recap/#comments</comments>
		<pubDate>Fri, 11 Jun 2010 13:53:58 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=2157</guid>
		<description><![CDATA[<p>Back from <a href="http://www.berlinbuzzwords.de">Berlin Buzzwords</a> and finally over the jet lag, so I thought I would put up some feedback.  First off, it was a well organized conference with a nice focus on searching, storage and scaling.  Kudos to Isabel, Simon and Jan for all their hard work.  It also had great wi-fi coverage, which is always a struggle at every conference I&#8217;ve ever been too.</p>
<p>As for the talks, I gave the Keynote on using &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>Back from <a href="http://www.berlinbuzzwords.de">Berlin Buzzwords</a> and finally over the jet lag, so I thought I would put up some feedback.  First off, it was a well organized conference with a nice focus on searching, storage and scaling.  Kudos to Isabel, Simon and Jan for all their hard work.  It also had great wi-fi coverage, which is always a struggle at every conference I&#8217;ve ever been too.</p>
<p>As for the talks, I gave the Keynote on using open source tools like Apache Solr and Mahout to deliver intelligent applications (<a href="http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/ingersoll_bbuzz2010.pdf">slides</a> &#8212; really should be a PPT so you can see the animations) on Monday first thing in the morning and I felt it went pretty well, but I&#8217;ll let others be the judge (videos should be online soon).  The rest of the day, I spent going in and out of the various tracks.  The Lucene track was very well done, with good talks by: Uwe Schindler and Simon Willnauer on the State of Lucene, Robert Muir on Finite State queries in Lucene; Michael Busch on Real Time Search at Twitter, Jukka Zitting on Tika and Andrzej Bialecki on Nutch. See <a href="http://berlinbuzzwords.wikidot.com/links-to-slides">Berlinbuzzwords: Links To Slides</a> for all the slides (not all are available just yet).</p>
<p>I also went to a variety of the Hadoop and NoSQL talks.  Lots of people in the NoSQL talks making pitches on why their approach is best, which is very helpful in determining what tool to use at the appropriate time.  I still, however, can&#8217;t shake the feeling that one could take the new <a href="http://wiki.apache.org/solr/SolrCloud">Solr Cloud stuff</a>, a dead simple schema (id and one or two simple fields), and have a large scale distributed key-value storage that overcomes almost all of the limitations of many of the NoSQL technologies (ad-hoc queries, range queries, search within the values, extendability) with minimal overhead of indexing (which can be greatly reduced by using either literals or very simple analysis).  Not only that, Lucene/Solr already is &#8220;document-centric&#8221; and I&#8217;ve seen it scale to billions of documents with high availability and high QPS and that was using &#8220;real&#8221; documents (i.e. articles, etc.), not simple key-value pairs, so I can&#8217;t help but feel like simple key-value pairs would be even faster and more scalable.  In other words, Lucene isn&#8217;t just for text search.  Naturally, this is just a thought at this point, I haven&#8217;t tried testing it just yet. Also, once the new real time stuff is in Lucene, I think it will be even faster.</p>
<p>At any rate, the best thing about the conference was the fact that it shows the eagerness for new solutions to large scale solutions that cost less money than the sturdy old database.</p>
<p>Again, congrats to Isabel and team for a well executed conference in a great city and at a great venue.  If you are interested in more on the Lucene portion of the conference, make sure you come visit us in Boston for <a href="http://www.lucenerevolution.com">Lucene Revolution</a>!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/06/11/berlin-buzzwords-recap/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Apache Lucene EuroCon Agenda &#8211; The Revolution is On!</title>
		<link>http://www.lucidimagination.com/blog/2010/04/22/apache-lucene-eurocon-agenda-the-revolution-is-on/</link>
		<comments>http://www.lucidimagination.com/blog/2010/04/22/apache-lucene-eurocon-agenda-the-revolution-is-on/#comments</comments>
		<pubDate>Thu, 22 Apr 2010 11:09:33 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucene Connector Framework]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[nutch]]></category>
		<category><![CDATA[Open Relevance Project]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>
		<category><![CDATA[ZooKeeper]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1965</guid>
		<description><![CDATA[<p>After reviewing a lot of great talk proposals, we&#8217;ve announced the  agenda for Apache Lucene Eurocon: <a href="http://lucene-eurocon.org/agenda.html">Apache Lucene EuroCon &#8211;  Europe&#8217;s Premier Lucene and Solr Search User Conference</a>.</p>
<p>One  of the things I really like about this agenda is it is a great mix of  basics, use cases from all over the search map (CMS, news, social media,  advertising), business decisions (see last list and next list) and advanced topics  (NLP, collab filtering, machine &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>After reviewing a lot of great talk proposals, we&#8217;ve announced the  agenda for Apache Lucene Eurocon: <a href="http://lucene-eurocon.org/agenda.html">Apache Lucene EuroCon &#8211;  Europe&#8217;s Premier Lucene and Solr Search User Conference</a>.</p>
<p>One  of the things I really like about this agenda is it is a great mix of  basics, use cases from all over the search map (CMS, news, social media,  advertising), business decisions (see last list and next list) and advanced topics  (NLP, collab filtering, machine learning, advanced visualization, multilingual).   Moreover, the content, even though it is centered in Lucene, goes well  beyond just being about Lucene and is really about search, in all of it&#8217;s power and  glory.  It&#8217;s about real users with real needs getting real problems  solved using the Lucene ecosystem.  Oh, and by the way, those users are doing it at scale!  Big scale.</p>
<p>That&#8217;s powerful stuff,  because, in case you hadn&#8217;t noticed (shh, it&#8217;s our little secret) there is a revolution going on in search.  (Funny how that line coincides with Lucid&#8217;s frontman, Eric Gries, giving  a  talk titled &#8220;The Search Revolution&#8221;)</p>
<p>Are you a part of the revolution?  See you in <a href="http://lucene-eurocon.org/index.html">Prague</a> in mid-May.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/04/22/apache-lucene-eurocon-agenda-the-revolution-is-on/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>News Flash: Apache Lucene gives birth to triplets!</title>
		<link>http://www.lucidimagination.com/blog/2010/04/21/news-flash-apache-lucene-gives-birth-to-triplets/</link>
		<comments>http://www.lucidimagination.com/blog/2010/04/21/news-flash-apache-lucene-gives-birth-to-triplets/#comments</comments>
		<pubDate>Wed, 21 Apr 2010 20:25:10 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1961</guid>
		<description><![CDATA[<p><a href="http://lucene.apache.org">Apache Lucene</a> (the Lucene top level project, not Lucene the Java search API.  I know,  it&#8217;s confusing sometimes) has once again proved to be a fertile area for innovation (having already given birth to <a href="http://hadoop.apache.org">Apache Hadoop</a> a few years back), as it once again has given birth, this time to three new <a href="http://www.lucidimagination.com/search/document/d833ce805528045b/tlp_status">Apache Top Level Projects</a> (just approved by the Board at Apache): <a href="http://lucene.apache.org/mahout">Apache Mahout</a>, <a href="http://lucene.apache.org/nutch">Apache Nutch</a> and <a href="http://lucene.apache.org/tika">Apache Tika</a> (never mind the URLs, &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p><a href="http://lucene.apache.org">Apache Lucene</a> (the Lucene top level project, not Lucene the Java search API.  I know,  it&#8217;s confusing sometimes) has once again proved to be a fertile area for innovation (having already given birth to <a href="http://hadoop.apache.org">Apache Hadoop</a> a few years back), as it once again has given birth, this time to three new <a href="http://www.lucidimagination.com/search/document/d833ce805528045b/tlp_status">Apache Top Level Projects</a> (just approved by the Board at Apache): <a href="http://lucene.apache.org/mahout">Apache Mahout</a>, <a href="http://lucene.apache.org/nutch">Apache Nutch</a> and <a href="http://lucene.apache.org/tika">Apache Tika</a> (never mind the URLs, they will be changing soon).  While none of these projects look alike, they all have a strong foundation built in the Lucene community.  Combine this with the <a href="http://www.lucidimagination.com/blog/2010/03/26/lucene-and-solr-development-have-merged/">recent merge</a> of Lucene and Solr development lists (more on this later) and Lucene has been busy; and that doesn&#8217;t even mention all the really cool stuff baking in the source tree right now (spatial, flexible indexing/scoring, some new analyzers and a variety of other cool things &#8212; see Lucene&#8217;s <a href="https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/CHANGES.txt">CHANGES</a> and Solr&#8217;s <a href="https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/CHANGES.txt">CHANGES</a>).</p>
<p>In the end, though, what does all this mean for the users of the Lucene ecosystem?  On one hand, some of the move is just shuffling around of domain names, mailing lists and SVN source trees, but on the other hand, the moves are symbolic and represent a project reaching a level of maturity and self determination, not to mention critical mass and brand awareness.  Thus, in my mind, all of these moves are good things for Lucene as well as the associated projects that are spinning out.  As far as the actual code, I think users will still see the same high quality contributions and products coming out of Apache (aside: <a href="http://www.lucidimagination.com">Lucid Imagination</a> will still be business as usual in regards to these moves) as well as much more focus within the Project Management Committee (PMC) on the specific project.</p>
<p>Which brings me to a bit more on my view of the merge of Lucene and Solr.  I think we are already seeing the fruits of the merge for both Lucene and Solr (I know my open source life is easier already).  For instance, much of the analyzer code is going to being combined from Solr and Lucene to provide a single coherent analyzer library.  This is great news for people who have been using Lucene and pulling in Solr analyzers and is good for Solr users because it now has many more people keeping an eye on Solr&#8217;s analyzers as well as new Lucene analyzers showing up sooner (things like the WordDelimiterFilter, etc.)  Another example is the spatial work that we&#8217;ve been working pretty heavily on (see <a href="https://issues.apache.org/jira/browse/SOLR-773">SOLR-773</a>, <a href="https://issues.apache.org/jira/browse/SOLR-1568">SOLR-1568</a> and <a href="https://issues.apache.org/jira/browse/LUCENE-2350">LUCENE-2350</a>).  With the combination of the two development projects, it is now much easier for us to make sure there is a single, integrated way of delivering spatial search across both the Java API and the Solr REST-like API.</p>
<p>Moreover, in the short run, existing Lucene and Solr users should notice no difference in terms of the products, user communities and the like.  In the long run, it should make for less repeated code, faster integration, more test coverage and a larger, cohesive development team as well as more of Solr&#8217;s capabilities available in pure library form as well as many of Lucene&#8217;s cutting edge capabilities appearing sooner in Solr (flexible indexing and scoring, etc.)</p>
<p>Wrapping up, congrats to Lucene and all of the new top level projects!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/04/21/news-flash-apache-lucene-gives-birth-to-triplets/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Apache Lucene Connector Framework now in Incubation at the ASF</title>
		<link>http://www.lucidimagination.com/blog/2010/01/20/apache-lucene-connector-framework-now-in-incubation-at-the-asf/</link>
		<comments>http://www.lucidimagination.com/blog/2010/01/20/apache-lucene-connector-framework-now-in-incubation-at-the-asf/#comments</comments>
		<pubDate>Wed, 20 Jan 2010 20:16:13 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucene Connector Framework]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[nutch]]></category>
		<category><![CDATA[PyLucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1509</guid>
		<description><![CDATA[<h1>Short Version</h1>
<p>The Apache Lucene Connector Framework project has officially entered incubation.  LCF, for short, is going to be a framework for connecting to content repositories like Sharepoint, Documentum, etc. and will make it easy to hook into Lucene, Solr, Nutch, Mahout, Tika, while, of course, remaining agnostic of the final destination of the data.  See the <a href="http://incubator.apache.org/connectors/">Connectors website</a> and the <a href="http://wiki.apache.org/incubator/LuceneConnectorFrameworkProposal">original proposal</a> for more info.  Help wanted!</p>
<h1>Long Version</h1>
<h2>Background</h2>
<p>A while back, <a href="http://www.metacarta.com">MetaCarta</a>&#8230;</p>]]></description>
			<content:encoded><![CDATA[<h1>Short Version</h1>
<p>The Apache Lucene Connector Framework project has officially entered incubation.  LCF, for short, is going to be a framework for connecting to content repositories like Sharepoint, Documentum, etc. and will make it easy to hook into Lucene, Solr, Nutch, Mahout, Tika, while, of course, remaining agnostic of the final destination of the data.  See the <a href="http://incubator.apache.org/connectors/">Connectors website</a> and the <a href="http://wiki.apache.org/incubator/LuceneConnectorFrameworkProposal">original proposal</a> for more info.  Help wanted!</p>
<h1>Long Version</h1>
<h2>Background</h2>
<p>A while back, <a href="http://www.metacarta.com">MetaCarta</a>, a spatial search company, approached us about open sourcing their internally developed Connector Framework at the <a href="http://www.apache.org">Apache Software Foundation</a>.  After several discussions and a whole bunch of legwork getting a proposal together, the LCF is now officially launched in the <a href="http://incubator.apache.org/">Apache Incubator</a>!  We&#8217;ve already got a great roster of committers lined up and are working to incorporate the software grant from MetaCarta, from which we can build out a first release, so stay tuned!  Lucid Imagination, of course, is a big supporter of this project and we look forward to it&#8217;s success!</p>
<h2>What is a Connector Framework?</h2>
<p>To quote the proposal:</p>
<blockquote><p>[The Lucene] Connector Framework is an extendible [sic] incremental crawler, which uses a database to manage configuration and crawl history, and provides reasonably high performance in accessing content in multiple repositories for the main purpose of search engine indexing. Connector Framework also establishes a repository-specific security model which can be used to limit search user access to repository content based on a user&#8217;s identity. Connector Framework also includes existing connectors and authorities for:</p>
<ul>
<li>File system</li>
<li>Windows shares</li>
<li>JDBC-supported databases</li>
<li>RSS feeds</li>
<li>General websites</li>
<li>LiveLink [from OpenText]</li>
<li>Documentum [from EMC]</li>
<li>SharePoint [from Microsoft]</li>
<li>Meridio [from Meridio]</li>
<li>Memex [from Memex]</li>
<li>FileNet [from IBM]</li>
</ul>
</blockquote>
<p>There are two pieces in particular to highlight in the quote.  First of all, it&#8217;s an extensible framework, meaning new connectors can be added without the need for application developers writing &#8220;one-off&#8221; code just for that connector.  For anyone who&#8217;s lived that pain, you know first hand what I mean.  In fact, I&#8217;ve already heard from others who are thinking of contributing their connectors for other data stores as well!  Second, the framework accounts for repository specific security.  In corporate environments, this is vital to making sure that the right people, and only the right people, have access to the right information at the right time.</p>
<h2>Why is this important?</h2>
<p>Many, many search engines, not too mention many other applications, have either rolled their own connectors or bought a company that provides them.  Connectors, in some situations, are the cost of entry into  certain markets, but are rarely the feature that seals the deal.  By making these open source, we can all share the cost of maintaining it while increasing the quality of a piece of software well beyond what any one company can achieve.  Beyond that, we hope the repository companies will also step up and contribute (some are already quite open), as making it easier to access these repositories will no doubt lead to more applications, which of course should mean more sales for said companies.</p>
<h2>How can you contribute?</h2>
<p>For starters, subscribe to the <a href="http://incubator.apache.org/connectors/mail.html">mailing lists</a>.  Then check out the <a href="http://cwiki.apache.org/confluence/display/CONNECTORS/HowToContribute">How To Contribute page</a> on the Wiki.  Beyond that, chip in with your connector use cases on the mailing lists and be a part of the community.</p>
<h2>What&#8217;s next?</h2>
<p>First off, the community will have to process the software grant from MetaCarta and then commit the code to LCF&#8217;s Subversion <a href="https://svn.apache.org/repos/asf/incubator/lcf">repository</a>.  From there, we&#8217;ll do just like any Apache project does and look to build out not only the code, but also the community, all on the path to graduating from the Incubator and taking our place as a full-fledged Lucene subproject.  Keep your eyes here and on the mailing lists and websites for more information in the future!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/01/20/apache-lucene-connector-framework-now-in-incubation-at-the-asf/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>The Apache Lucene Ecosystem: My view of 2009</title>
		<link>http://www.lucidimagination.com/blog/2009/12/24/the-apache-lucene-ecosystem-my-view-of-2009/</link>
		<comments>http://www.lucidimagination.com/blog/2009/12/24/the-apache-lucene-ecosystem-my-view-of-2009/#comments</comments>
		<pubDate>Thu, 24 Dec 2009 15:53:02 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[Droids]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucy]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[nutch]]></category>
		<category><![CDATA[Open Relevance Project]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[PyLucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>
		<category><![CDATA[ZooKeeper]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1429</guid>
		<description><![CDATA[<p>It&#8217;s that time of year, so I thought I would take a look back at the year that was for the <a href="http://lucene.apache.org">Lucene Ecosystem</a> and maybe look ahead just a little bit too.</p>
<p>First and foremost, it should be obvious to even the most casual observer that the Apache Lucene communities are thriving.  Not only is it a great time to be involved in open source, it&#8217;s a great time to be involved in Lucene.  Both &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s that time of year, so I thought I would take a look back at the year that was for the <a href="http://lucene.apache.org">Lucene Ecosystem</a> and maybe look ahead just a little bit too.</p>
<p>First and foremost, it should be obvious to even the most casual observer that the Apache Lucene communities are thriving.  Not only is it a great time to be involved in open source, it&#8217;s a great time to be involved in Lucene.  Both as a committer and as an employee of Lucid Imagination, I&#8217;m continuously amazed at the vibe produced by the people using the Lucene suite of libraries, tools and applications.  People are routinely solving both large scale and really hard problems using the Lucene ecosystem and they are doing it on time and on budget.  For instance, this year alone, I&#8217;ve seen companies and individuals using Lucene and Solr to provide search in production environments with document counts ranging from the few tens of thousands all the way up to 5-10 billion plus and query rates that barely register a blip to 1000+ QPS.  I&#8217;ve also seen many people using Lucene to power recommendation engines, content management systems, machine learning/NLP applications and log analysis tools.</p>
<p>Much to my initial surprise, the number one reason I hear for why they chose Lucene: flexibility. (I thought it would be the fact that they are free to use, but that is just icing on the proverbial cake, I guess)  Namely, Lucene gives them the flexibility to build what they want or simply to use it out of the box.  It gives them the flexibility to bring in other tools from other open source projects or other commercial vendors, all without compromising speed or scale.</p>
<p>With that in mind, I thought I would give some highlights of both the top level project (TLP &#8212; http://lucene.apache.org &#8212; the ASF project that &#8220;houses&#8221; all of the Lucene related subprojects) that is Lucene as well as the individual projects.  (I&#8217;m not involved in all them, so please correct me if I&#8217;m wrong!)</p>
<h1>Lucene TLP</h1>
<p>Whew!  It&#8217;s been a busy year for the Lucene TLP.  We started the <a href="http://lucene.apache.org/openrelevance">Open Relevance Project</a> (ORP), added <a href="http://lucene.apache.org/pylucene">PyLucene</a> (a Python port of Lucene) and successfully graduated a .NET version of Lucene from ASF incubation, not to mention the fact that the Lucene PMC is responsible for overseeing the release of all the various bits and bytes for each and every subproject (which is a lot of releases!)  We also, for the first time ever, organized two days of Lucene related talks at <a href="http://www.us.apachecon.org">ApacheCon US</a> plus two days of training and meetups. (In the past, organization was always handled by the ASF Conference Committee).</p>
<p>In looking ahead for the TLP, I see a continued focus on providing quality software across all the projects.  Additionally, keep your ears open, as there is a new sub project brewing that I think will really make it even easier for people to deploy Lucene based solutions.  Finally, just as Lucene gave birth to Apache Hadoop and is happy to see it doing so well, there is <a href="http://www.lucidimagination.com/search/document/5a41be454d503779/possible_contribution_at_somewhat_of_a_tangent_to_mahout">growing talk</a> that Lucene will look to see Apache Mahout off as it&#8217;s own TLP.  Of course, none of that is in stone yet!</p>
<p>For those looking for more on the big picture that is Lucene, see my <a href="http://www.us.apachecon.com/c/acus2009/sessions/428">talk</a> at ApacheCon US for more details on the ecosystem.  Not sure why the slides aren&#8217;t there, so I put them <a href="http://people.apache.org/~gsingers/apacheconUS09/luceneEcosystem.pptx">here</a>.</p>
<h1>Lucene Java</h1>
<p><a href="http://lucene.apache.org/java">Lucene Java</a> (i.e. what everyone knows as &#8220;Lucene&#8221;) continues to not only provide a rock solid indexing and search API, it continues to push forward with new capabilities.  In 2009, Lucene did 4 releases (2.4.1, 2.9.0, 2.9.1 and 3.0.0).  2.9.0 was probably the most interesting, as it significantly improved performance in a number of areas, while 3.0.0 removed all of the deprecated APIs and finally, officially, dropped support for Java JDK 1.4.  I&#8217;ll leave it to the reader to go look up all the features and changes as they are numerous.</p>
<p>Looking ahead, the phrase of the year appears to be &#8220;flexible indexing&#8221;.  Flex Indexing looks to make it even easier for people to custom craft what is in their index, whether that is rich token attributes (aka &#8220;typed&#8221; payloads), alternative scoring models (like Okapi BM25) or a bare bones index designed for blazing fast speed.</p>
<h1>Solr</h1>
<p>With Lucene as the engine, <a href="http://lucene.apache.org/solr">Solr</a> has evolved into quite the car.   Building on all of the goodness that is Lucene, Solr, in 2009, released version 1.4 with a whole slew of new features, faster implementations and bug fixes.  Highlights for 1.4 include: improved filtering and faceting performance, support for clustering, rich document indexing via Apache Tika, multi-select faceting (see Lucid&#8217;s very own <a href="http://search.lucidimagination.com">search.lucidimagination.com</a> for a demo), many new Query capabilities and a whole bevy of new Components (Terms, Term Vectors, Auto-suggest, deduplication and Statistics on result sets) that truly make Solr an incredible search platform.</p>
<p>Looking ahead, Solr 1.5 (2.0?) is already in the works and looks to have even more <a href="http://www.lucidimagination.com/blog/2009/12/12/apache-solr-1-5-on-the-move-with-more-functionality/">functionality</a>.  For instance, a lot of work is underway to integrate Apache ZooKeeper and other distributed capabilities, which will help make deploying Solr at scale even easier.  Meanwhile, many are hard at work adding &#8220;field collapsing&#8221; (search result grouping/deduplication) and spatial (local/geo) search.</p>
<h1>Mahout</h1>
<p>It&#8217;s been a very exciting year (in my completely biased opinion!) for <a href="http://lucene.apache.org/mahout">Mahout</a>, the scalable machine learning project under Lucene.  In 2009, Mahout shepherded through it&#8217;s very first release (0.1) built on the strength of a few dedicated volunteers working to add capabilities for clustering, categorization and collaborative filtering.  Next came 0.2 with many new features (frequent patternset mining, Latent Dirichlet Allocation, Random Decision Forests, new recommendation capabilities) API and performance improvements and a growing list of people who stopped lurking and stepped up to help out.  Towards the end of the year, Mahout is already reaching a list volume that I find difficult to keep up with if I miss a day or two.  For starters, we have taken on the task of integrating/transforming the <a href="http://acs.lbl.gov/~hoschek/colt/">Colt</a> matrix library for our needs.  We are also working on adding truly large scale recommendation capabilities plus adding in a Support Vector Machine implementation and Logistic Regression.  Not only that, but the mahout-user@lucene.apache.org mailing list continues to be a valuable resources for people seeking practical advice on deploying machine learning in production environments regardless of the choice of Mahout or not.</p>
<p>In 2010, I suspect Mahout will become it&#8217;s own TLP, with several sub projects roughly divided as: core/utilities, recommendations (Taste) and NLP.  Of course, until it happens, this is just speculation.  I also think Mahout will look to finalize its APIs for a 1.0 release.</p>
<h1>Nutch</h1>
<p>In 2009, <a href="http://lucene.apache.org/nutch">Apache Nutch</a> released the long awaited version 1.0.  This release contained many new indexing and scoring capabilities, as well as integration with Solr.  The community continues to be focused on providing large scale crawling and search capabilities by leveraging Apache Hadoop and Lucene/Solr.  Currently, the community is actively looking at modularizing Nutch to allow it to more easily plug in other ecosystem components like Tika and Solr while focusing on the primary task of obtaining and managing content via crawling.</p>
<h1>Tika</h1>
<p><a href="http://lucene.apache.org/tika/">Apache Tika</a> is a content extraction framework for &#8220;rich&#8221; documents like Adobe PDF and Microsoft Office.  In 2009, Tika released versions 0.3, 0.4 and 0.5, all with incremental improvements designed to make it more stable and easier to use.  Each release also seemed to carry with it a new list of supported file formats as more and more people join the project to lend a hand.</p>
<p>Coming up, I suspect Tika will look to finalize a 1.0 release at some point in 2009 as well as focus in on standardizing, if such a thing is possible, on the metadata artifacts produced by Tika.</p>
<h1>Open Relevance Project</h1>
<p>The <a href="http://lucene.apache.org/openrelevance">ORP</a> is a project that has been in my brain for several years now and finally got off the ground in 2009.  The goal of ORP is to provide corpora, queries, judgments and other tools to help search and machine learning projects discuss relevance in a completely open way.  While the project is really young, it is slowly but surely building up steam by adding some basic tools and collections thanks to the hard work of several individuals.  In 2010, look for ORP to build out a more complete toolset while attracting more users and contributors.  It will also be vital for the ORP to create its own versioned corpora for download (free!) so that all experiments can be reliably reproduced.</p>
<h1>Droids</h1>
<p><a href="http://incubator.apache.org/droids">Droids</a> is a standalone crawler framework currently in incubation at the ASF.  Development was active in 2009, but has not yet had a release.  For now, it is a Spring based framework that allows one to quickly build out agents that can go and crawl and process content.</p>
<h1>Lucene.NET</h1>
<p>In 2009, <a href="http://incubator.apache.org/lucene.net/">Lucene.NET</a> graduated (some infrastructure changes still need to happen) from ASF incubation and became a full-fledged member of the Lucene ecosystem.  While I&#8217;m not closely involved with Lucene.NET, the community continues to provide value to those looking for a solid search library in .NET.  Since the project is mostly autogenerated from the Java sources, the .NET version has tracked the Lucene Java releases fairly closely.</p>
<p>Looking forward, I expect the .NET version will strive to maintain a lockstep march with Lucene releases.</p>
<h1>PyLucene</h1>
<p>Similar to .NET, <a href="http://lucene.apache.org/pylucene">PyLucene</a> produces a Python port of Lucene Java.  In 2009, PyLucene was formerly welcomed into the Lucene fold via a software donation by Andi Vajda.  It continues to produce releases of PyLucene in lockstep with Lucene Java.</p>
<h1>Lucy</h1>
<p><a href="http://lucene.apache.org/lucy">Lucy</a> is a &#8220;loose&#8221; &#8216;C&#8217; port of Lucene.  Lucy finally got off the ground in 2009 and is steadily working on building out a core search library that provides fast search capabilities for languages like Perl, C and Ruby.</p>
<p>For 2010, look for Lucy to continue to grow its community while adding capabilities.</p>
<h1>Moving Forward</h1>
<p>While the past is, of course, no prediction of the future, I think it&#8217;s safe to say Lucene is looking to continue to provide significant capabilities and value to both well established and new communities alike.  With open source, you never know where the next good idea is coming from, so make sure to stay tuned both here and on the mailing lists for more insight and more cool new capabilities.</p>
<p>Happy Holidays and here&#8217;s to an Open Source 2010!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2009/12/24/the-apache-lucene-ecosystem-my-view-of-2009/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>SF Bay Area Meetup Slides Available</title>
		<link>http://www.lucidimagination.com/blog/2009/06/05/sf-bay-area-meetup-slides-available/</link>
		<comments>http://www.lucidimagination.com/blog/2009/06/05/sf-bay-area-meetup-slides-available/#comments</comments>
		<pubDate>Fri, 05 Jun 2009 15:01:44 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[Droids]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=733</guid>
		<description><![CDATA[<p>Slides from the first Lucene/Solr SF Bay Area meetup are now available <a href="http://www.meetup.com/SFBay-Lucene-Solr-Meetup/files/">here</a>.</p>
<p>Thanks to everyone who participated.&#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>Slides from the first Lucene/Solr SF Bay Area meetup are now available <a href="http://www.meetup.com/SFBay-Lucene-Solr-Meetup/files/">here</a>.</p>
<p>Thanks to everyone who participated.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2009/06/05/sf-bay-area-meetup-slides-available/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>ApacheCon Europe Follow Up</title>
		<link>http://www.lucidimagination.com/blog/2009/04/01/apachecon-europe-follow-up/</link>
		<comments>http://www.lucidimagination.com/blog/2009/04/01/apachecon-europe-follow-up/#comments</comments>
		<pubDate>Wed, 01 Apr 2009 11:16:08 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[Droids]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=601</guid>
		<description><![CDATA[<p>Another year, another successful <a href="http://www.eu.apachecon.com/">ApacheCon Europe</a>, at least as far as Lucene, Solr and I are concerned.  This year, like last, Erik Hatcher and I had trainings on Lucene and Solr.  Both were well attended, despite the economy, showing once again the power of open source and the fact that people are still invested in search.  (If you missed the training, see <a href="http://www.lucidimagination.com/How-We-Can-Help/Training">here</a> for alternatives.)</p>
<p>During the conference, there were several talks on Lucene, &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>Another year, another successful <a href="http://www.eu.apachecon.com/">ApacheCon Europe</a>, at least as far as Lucene, Solr and I are concerned.  This year, like last, Erik Hatcher and I had trainings on Lucene and Solr.  Both were well attended, despite the economy, showing once again the power of open source and the fact that people are still invested in search.  (If you missed the training, see <a href="http://www.lucidimagination.com/How-We-Can-Help/Training">here</a> for alternatives.)</p>
<p>During the conference, there were several talks on Lucene, Solr,  Mahout and Droids.  Slides are available at:</p>
<ul>
<li><a href="http://www.eu.apachecon.com/c/aceu2009/sessions/136">Introducing Mahout</a></li>
<li><a href="http://www.eu.apachecon.com/c/aceu2009/sessions/137">Lucene/Solr Case Studies</a></li>
<li><a href="http://www.eu.apachecon.com/c/aceu2009/sessions/138">Advanced Indexing</a> (slides are missing, but should be up sometime soon)</li>
<li><a href="http://www.eu.apachecon.com/c/aceu2009/sessions/165">Apache Droids</a></li>
<li><a href="http://www.eu.apachecon.com/c/aceu2009/sessions/250">Best of Breed: HTTP Server, Forrest, Solr and Droids</a></li>
<li><a href="http://www.eu.apachecon.com/c/aceu2009/sessions/251">Apache Solr: A Case Study</a></li>
</ul>
<p>Additionally, for the first time,  we had a <a href="http://wiki.apache.org/lucene-java/LuceneMeetupMarch2009">Lucene Meetup</a> (sponsored by Lucid).  I&#8217;d estimate there were around 60 people there and we had some good discussions on Tika, Lucene, Solr and Mahout.    Also, Uwe Schindler presented his new TrieRange Query capabilities.  Slides are available <a href="http://www.thetaphi.de/share/Schindler-TrieRange.ppt">here</a>.</p>
<p>Finally, my favorite part of the conference is always the individual conversations with people using the Lucene ecosystem to solve their problems.  Each year, it seems, people have more and more new ideas about how to use Lucene and Solr, many of which go beyond &#8220;traditional&#8221; search.  Over the coming months, I think you will see more and more of Lucid highlighting all Lucene ecosystem users through our <a href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Podcasts-and-Videos">Podcasts</a>, <a href="http://www.lucidimagination.com/Community/Marketplace/Application-Showcase-Wiki">Wiki Showcase</a> and other features coming soon.  So, if you think you&#8217;ve got something cool using the Lucene ecosystem, add a comment below or drop us a line at feedback@lucidimagination.com</p>
<p>UPDATE: 4/9/09:  Uri has sent me his slides and they can be downloaded at: <a href="http://www.lucidimagination.com/blog/wp-content/uploads/2009/04/apache-conference-2009.pdf">Apache Solr Case Study</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2009/04/01/apachecon-europe-follow-up/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

