<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lucid Imagination &#187; Lucene</title>
	<atom:link href="http://www.lucidimagination.com/blog/tag/lucene/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lucidimagination.com/blog</link>
	<description>Exclusively dedicated to Apache Lucene/Solr open source search technology</description>
	<lastBuildDate>Sat, 04 Feb 2012 01:12:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Estimating Memory and Storage for Lucene/Solr</title>
		<link>http://www.lucidimagination.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/</link>
		<comments>http://www.lucidimagination.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/#comments</comments>
		<pubDate>Wed, 14 Sep 2011 17:27:00 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[disk usage]]></category>
		<category><![CDATA[Grant Ingersoll]]></category>
		<category><![CDATA[memory]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4000</guid>
		<description><![CDATA[<p>Many times, clients ask us to help them estimate memory usage or disk space usage or to share benchmarks as they build out there search system. Doing so is always an interesting process, as I&#8217;ve always been wary of claims about benchmarks (for instance, one of the old tricks of performance benchmark hacking is to &#8220;cat XXX &#62; /dev/null&#8221; to load everything into memory first, which isn&#8217;t what most people do when running their system) &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>Many times, clients ask us to help them estimate memory usage or disk space usage or to share benchmarks as they build out there search system. Doing so is always an interesting process, as I&#8217;ve always been wary of claims about benchmarks (for instance, one of the old tricks of performance benchmark hacking is to &#8220;cat XXX &gt; /dev/null&#8221; to load everything into memory first, which isn&#8217;t what most people do when running their system) and or estimates because I know there are so many variables involved that it is possible to vary the results quite significantly depending on marketing goals. Thus, I tend to be pragmatic (which I think the Lucene/Solr community does as well) and focus on what do my tests show for my specific data and my specific use cases.</p>
<p>For instance, for testing memory, it&#8217;s pretty easy to set up a series of tests that start with a small heap size and successively grow it until no Out Of Memory Errors (OOME) occur. Then, to be on the safe side, add 1 GB of memory to the heap.  It works for the large majority of people. Ironically, for Solr at least, this usually ends up with a heap size somewhere between 6-12 GBs for a system doing &#8220;consumer search&#8221; with faceting, etc. and reasonably sized caches on an index in the 10-50 million docs range. Sure, there are systems that go beyond this or are significantly less (I just saw one the other day that has around 200M docs in less than 3 GB of RAM while handling decent search load), but the 6-12 GB seems to be a nice sweet spot for the application and the JVM, especially when it comes to garbage collection, while still giving the operating system enough room to do it&#8217;s job.  Too much heap and garbage may pile up and give you <em>ohmygodstoptheworld</em> full garbage collections at some point in the future.  Too little heap and you get the dreaded OOME.  Also too much heap relative to total RAM and you choke off the OS.  Besides, that range also has a nice business/operations side effect in that 16 GBs of RAM has a nice performance/cost benefit for many people.</p>
<p>Recently, however, I thought it would be good to get beyond the inherent hand waving above and attempt to come up with a theoretical (with a little bit of empiricism thrown in) model for estimating memory usage and disk space.   After a few discussions on <a href="http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2011-09-13 for assumptions">IRC</a> with McCandless and others, I put together a <span style="text-decoration: underline;"><strong>DRAFT</strong></span> <a href="http://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/size-estimator-lucene-solr.xls">Excel spreadsheet</a> that allows people to model both memory and disk space (based on the formula in <a href="http://www.lucidimagination.com/devzone/references/books-and-publications">Lucene in Action 2nd ed.</a> &#8211; LIA2), after filling in some assumptions about their applications (I put in defaults.)   First a few caveats:</p>
<ol>
<li>This is just an estimate, don&#8217;t construe it for what you are actually seeing in your system.</li>
<li>It is a DRAFT.  It is likely missing a few things, but I am putting it up here and in Subversion as a means to gather feedback.  I reserve the right to have messed up the calculations.</li>
<li>I feel the values might be a little bit low for the memory estimator, especially the Lucene section.</li>
<li>It&#8217;s only good for <a href="svn.apache.org/repos/asf/lucene/dev/trunk">trunk</a>.  I don&#8217;t think it will be right for 3.3 or 3.4.</li>
<li>The goal is to try to establish a model for the &#8220;high water mark&#8221; of memory and disk, not necessarily the typical case.</li>
<li>It inherently assumes you are searching and indexing on the same machine, which is often not the case.</li>
<li>There are still a couple of TODOs in the model.  More to come later.</li>
</ol>
<p>As for using the memory estimator, the primary things to fill in are the number of documents, number of unique terms and information on sorting and indexed fields, but you can also mess with all of the other assumptions.  For Solr, there are entries for estimating cache memory usage.  Keep in mind that the assumption for caching is that they are full, which often is not the case and not even feasible.  For instance, your system may only ever employ 5 or 6 different filters.</p>
<p>The disk space estimator is much more straightforward and based on LIA2&#8242;s fairly simple formula of:</p>
<blockquote>
<div>disk space used(original) = 1/3 original for each indexed field + 1 * original for stored + 2 * original per field with term vectors</div>
</blockquote>
<p>&nbsp;</p>
<p>It will be interesting to see how some of the new flexible indexing capabilities in trunk effect the results of this equation.  Also note, I&#8217;ve seen some applications where the size of the indexed fields is as low as 20%.</p>
<p>Hopefully, people will find this useful as well as enhance it and <a href="https://issues.apache.org/jira/browse/LUCENE-3435">fix any bugs</a> in it.  In other words, feedback is welcome.  As with any model like this, YMMV!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Charlottesville, VA meetup</title>
		<link>http://www.lucidimagination.com/blog/2011/08/09/charlottesville-va-meetup/</link>
		<comments>http://www.lucidimagination.com/blog/2011/08/09/charlottesville-va-meetup/#comments</comments>
		<pubDate>Tue, 09 Aug 2011 14:25:42 +0000</pubDate>
		<dc:creator>Erik Hatcher</dc:creator>
				<category><![CDATA[Events]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[charlottesville]]></category>
		<category><![CDATA[Erik Hatcher]]></category>
		<category><![CDATA[virginia]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=3819</guid>
		<description><![CDATA[[ Monday, 15 August 2011; 18:00 to 21:00. ] <p>If you&#8217;re in the central VA, or even in the northern VA / DC area, come join us for the inaugural <a href="http://www.meetup.com/Charlottesville-Apache-Lucene-Solr-Meetup/events/25877811/">&#8220;Charlottesville Solr and Lucene Meetup&#8221;</a>.  Charlottesville is home to the co-authors of Manning&#8217;s &#8220;Lucene in Action&#8221; and Packt&#8217;s Solr &#8220;Solr 1.4 Enterprise Search Server&#8221; books.  This area is a hotbed of search activity thanks to NGIC and DIA calling Charlottesville home, and the many &#8230;</p>]]></description>
			<content:encoded><![CDATA[[ Monday, 15 August 2011; 18:00 to 21:00. ] <p>If you&#8217;re in the central VA, or even in the northern VA / DC area, come join us for the inaugural <a href="http://www.meetup.com/Charlottesville-Apache-Lucene-Solr-Meetup/events/25877811/">&#8220;Charlottesville Solr and Lucene Meetup&#8221;</a>.  Charlottesville is home to the co-authors of Manning&#8217;s &#8220;Lucene in Action&#8221; and Packt&#8217;s Solr &#8220;Solr 1.4 Enterprise Search Server&#8221; books.  This area is a hotbed of search activity thanks to NGIC and DIA calling Charlottesville home, and the many gov&#8217;t subcontractors supporting them here.  We are also home of hotelicopter, OpenQ, the University of Virginia, and many other organizations that use Lucene and Solr.</p>
<p>Looking forward to seeing you there.   Everyone in attendance will not only get to hear about the latest greatest enhancements to these technologies, but may also walk away with a cool Lucid t-shirt!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/08/09/charlottesville-va-meetup/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Überconf &#8211; No Fluff, Just Solr</title>
		<link>http://www.lucidimagination.com/blog/2011/07/19/uberconf-no-fluff-just-solr/</link>
		<comments>http://www.lucidimagination.com/blog/2011/07/19/uberconf-no-fluff-just-solr/#comments</comments>
		<pubDate>Tue, 19 Jul 2011 19:33:09 +0000</pubDate>
		<dc:creator>Erik Hatcher</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Erik Hatcher]]></category>
		<category><![CDATA[uberconf]]></category>
		<category><![CDATA[uberconf11]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=3784</guid>
		<description><![CDATA[[ Tuesday, 12 July 2011 to Friday, 15 July 2011. ] <p>I had the honor and pleasure of being invited to speak at <a href="http://uberconf.com/conference/denver/2011/07/home">Überconf</a> last week in the Denver, CO area.  <img class="alignright" src="http://www.nofluffjuststuff.com/images/2011/uber/UberConf_125x125_v2-01.png" alt="Überconf" /> The annual conference is organized by Jay Zimmerman of No Fluff, Just Stuff fame.  Überconf  has the same top-notch quality, at a grander scale &#8211; 10 concurrent tracks (woah!), full day pre-conference trainings (<a href="http://uberconf.com/conference/denver/2011/07/mobile_workshops">mobile, anyone?</a>), food (full breakfast!  that&#8217;s a REAL hearty &#8230;</p>]]></description>
			<content:encoded><![CDATA[[ Tuesday, 12 July 2011 to Friday, 15 July 2011. ] <p>I had the honor and pleasure of being invited to speak at <a href="http://uberconf.com/conference/denver/2011/07/home">Überconf</a> last week in the Denver, CO area.  <img class="alignright" src="http://www.nofluffjuststuff.com/images/2011/uber/UberConf_125x125_v2-01.png" alt="Überconf" /> The annual conference is organized by Jay Zimmerman of No Fluff, Just Stuff fame.  Überconf  has the same top-notch quality, at a grander scale &#8211; 10 concurrent tracks (woah!), full day pre-conference trainings (<a href="http://uberconf.com/conference/denver/2011/07/mobile_workshops">mobile, anyone?</a>), food (full breakfast!  that&#8217;s a REAL hearty bonus!), and many of the best technical presenters in the industry.  Lucene/Solr earned a full day &#8220;track&#8221; at Überconf.<br />
<span id="more-3784"></span><br />
One full day was dedicated to my three Solr presentations, one being a double-session &#8220;workshop&#8221;.  I began the day presenting &#8220;Rapid Prototyping with Solr&#8221;, demonstrating several quick dataset ingestion up to usable search user interface examples.  Is Solr right for your data and your environment?  Try it and see &#8211; it&#8217;ll be 15 minutes well spent!  <img src='http://www.lucidimagination.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Here are the slides for &#8220;Rapid Prototyping with Solr&#8221;:
<div style="width:340px" id="__ss_8600305"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/erikhatcher/rapid-prototyping-with-solr-8600305" title="Rapid Prototyping with Solr" target="_blank">Rapid Prototyping with Solr</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/8600305?rel=0" width="340" height="284" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<div style="padding:5px 0 12px"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/erikhatcher" target="_blank">Erik Hatcher</a> </div>
</p></div>
<p>My next talk, titled &#8220;Solr Recipes&#8221;, was a 3 hour workshop covering the common ways to get data into Solr, configure it, and leverage it within applications.  </p>
<p>Here are the &#8220;Solr Recipes&#8221; slides:
<div style="width:340px" id="__ss_8600306"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/erikhatcher/solr-recipes-workshop" title="Solr Recipes Workshop" target="_blank">Solr Recipes Workshop</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/8600306?rel=0" width="340" height="284" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<div style="padding:5px 0 12px"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/erikhatcher" target="_blank">Erik Hatcher</a> </div>
</p></div>
<p>And finally, to a few hardcore folks, I discussed &#8220;Lucene for Solr Developers&#8221;, which more broadly covered the various ways to extend Solr.  One  cool example (or so I think, at least) I built for this talk that I&#8217;ve put out there as food for thought is &#8220;auto-faceting&#8221;, having a way for Solr to determine the best facets to select for a given query.  See <a href="https://issues.apache.org/jira/browse/SOLR-2641">SOLR-2641</a> for my initial proof-of-concept implementation. </p>
<p>Here are the &#8220;Lucene for Solr Developers&#8221; slides:
<div style="width:340px" id="__ss_8600304"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/erikhatcher/lucene-for-solr-developers" title="Lucene for Solr Developers" target="_blank">Lucene for Solr Developers</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/8600304?rel=0" width="340" height="284" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<div style="padding:5px 0 12px"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/erikhatcher" target="_blank">Erik Hatcher</a> </div>
</p></div>
<p>There was so much going on that I barely got to tap into the conference experience myself, with the best part being the conversations had during meals, and between and during the scheduled talks.  I was able to reconnect with many long-time friends that I&#8217;ve made through No Fluff, Just Stuff, and made many new acquaintances &#8211; I won&#8217;t name names, as <a href="http://uberconf.com/conference/denver/2011/07/speakers">this list</a> covers most of them.</p>
<p>Thank you, Überconf, Jay, and fellow speakers and attendees for a stellar technical event.   If you missed it, don&#8217;t despair, <a href="http://www.nofluffjuststuff.com">No Fluff, Just Stuff</a> brings many of the same speakers and topics to an area near you.  I&#8217;ll be speaking at a few NFJS events during the second half of this year, including <a href="http://www.nofluffjuststuff.com/conference/raleigh/2011/08/home">Raleigh, NC</a>, <a href="http://www.nofluffjuststuff.com/conference/boston/2011/09/home">Boston, MA</a>, and, surely to rival the quality and size of Überconf,  <a href="http://www.therichwebexperience.com/conference/fort_lauderdale/2011/11/home">The Rich Web Experience</a> in Ft. Lauderdale, FL.</p>
<p>Let&#8217;s talk.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/07/19/uberconf-no-fluff-just-solr/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The scientific approach to search at Sensis</title>
		<link>http://www.lucidimagination.com/blog/2011/06/01/the-scientific-approach-to-search-at-sensis/</link>
		<comments>http://www.lucidimagination.com/blog/2011/06/01/the-scientific-approach-to-search-at-sensis/#comments</comments>
		<pubDate>Wed, 01 Jun 2011 20:40:56 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[enterprise search]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[lucene revolution]]></category>
		<category><![CDATA[lucid imagination]]></category>
		<category><![CDATA[Open Source Search]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=3630</guid>
		<description><![CDATA[<p><a href="http://www.lucenerevolution.org/blog/wp-content/uploads/2011/05/evolution.jpeg"><img class="alignright size-thumbnail wp-image-146" title="evolution" src="http://www.lucenerevolution.org/blog/wp-content/uploads/2011/05/evolution-150x150.jpg" alt="" width="150" height="150" /></a>Back in the 1990&#8242;s, Carnegie Mellon University developed the <a href="http://www.sei.cmu.edu/cmmi/">Capability Maturity Model</a>, a scale for determining how prepared a contractor&#8217;s processes were for a particular task.  If you&#8217;ve ever written software for anyone but yourself, you&#8217;ll recognize some of these definitions, which call to mind the famous characterization of the evolution of software.</p>
<p><a href="http://www.sensis.com.au/" target="_blank">Sensis</a>, &#8220;the search engine for Australians&#8221;, uses a modified version of this model to assess their own search processes.  It &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.lucenerevolution.org/blog/wp-content/uploads/2011/05/evolution.jpeg"><img class="alignright size-thumbnail wp-image-146" title="evolution" src="http://www.lucenerevolution.org/blog/wp-content/uploads/2011/05/evolution-150x150.jpg" alt="" width="150" height="150" /></a>Back in the 1990&#8242;s, Carnegie Mellon University developed the <a href="http://www.sei.cmu.edu/cmmi/">Capability Maturity Model</a>, a scale for determining how prepared a contractor&#8217;s processes were for a particular task.  If you&#8217;ve ever written software for anyone but yourself, you&#8217;ll recognize some of these definitions, which call to mind the famous characterization of the evolution of software.</p>
<p><a href="http://www.sensis.com.au/" target="_blank">Sensis</a>, &#8220;the search engine for Australians&#8221;, uses a modified version of this model to assess their own search processes.  It has five levels:</p>
<ol>
<li>Unmanaged: Set it and forget it, basically.</li>
<li>Ad Hoc: People work on the process part time, and innovation is led by individuals with an itch to scratch.</li>
<li>Monitored: There&#8217;s a defined team responsible for improvements, and it&#8217;s monitored for problems.</li>
<li>Managed: Improvements are methodically sought, and there are defined targets and metrics.</li>
<li>Optimized: This is obviously the target level, where machine learning leads to the best possible results.</li>
</ol>
<p>On day 1 of Lucene Revolution, <a href="http://lucenerevolution.org/2011/sessions-day-1#search-rees" target="_blank">Craig Rees talked about Sensis</a>&#8216; goal of moving up that maturity ladder as they made their data &#8212; millions of white pages and yellow pages listings &#8212; available via a developer API.  When it comes to search, Sensis is currently at the &#8220;Montitored&#8221; stage, and making the move up to &#8220;Managed&#8221;.</p>
<p>Slides for this session:</p>
<div style="width:595px" id="__ss_8167532"><object id="__sse8167532" width="595" height="497"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=reescraig-searchapissolrandthesensisjourney-110531192527-phpapp02&#038;stripped_title=rees-craig-search-ap-is-solr-and-the-sensis-journey&#038;userName=lucenerevolution" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed name="__sse8167532" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=reescraig-searchapissolrandthesensisjourney-110531192527-phpapp02&#038;stripped_title=rees-craig-search-ap-is-solr-and-the-sensis-journey&#038;userName=lucenerevolution" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="595" height="497"></embed></object> </div>
<p>It turns out that as much as you might like to, you can&#8217;t &#8220;jump levels&#8221; on the maturity index; you&#8217;ve got to earn it the hard way, and that&#8217;s what Sensis is doing.</p>
<p><a href="http://www.lucidimagination.com/blog/author/grant-ingersoll/" target="_blank">Grant Ingersoll</a> is always saying that if you&#8217;re not methodically testing your search results, you&#8217;re not testing at all, and Sensis definitely is a good example of methodically testing your results.  To formulate a &#8220;scientific approach&#8221;, they created &#8220;Gold Sets&#8221; of queries and results, which enables them to tweak their Lucene settings, then compare the results with their gold set of &#8220;perfect&#8221; results.</p>
<p>Of course, Sensis has a lot to test for.  Context is key, Craig points out; a 12 year old on his mobile phone in the schoolyard at noon probably shouldn&#8217;t get the same results as a 60 year old at home on his computer at 10pm.  And context has a lot of variables, such as time of day (or even time of year), location, device, or, and this is perhaps hardest to quantify, intent.</p>
<p>Sensis also has other hurdles to leap. For example, while conventional wisdom says that &#8220;rare&#8221; terms should be worth more, it can sometimes backfire in their data set, which is broad but not deep.  The term &#8220;flowers&#8221; is rare in &#8220;crematorium&#8221; listings, but if you search for &#8220;flowers&#8221;, a crematorium is probably NOT what you want.  They also have to deal with contextual synonyms.  For example, &#8220;bow&#8221; and &#8220;ribbon&#8221; are synonyms &#8212; unless you&#8217;re also looking for arrows.</p>
<p>So Sensis created a tool, the Search Quality Analysis and Testing system, or SQUAT.  (And yes, the audience snickered when he said, &#8220;All we need is SQUAT,&#8221; but at the same time, I think most of us were just a little bit jealous). The very first question was, &#8220;Are you releasing that as open source?&#8221;  The answer was, &#8220;We&#8217;re thinking about it, but not at this time.&#8221;  SQUAT enables Sensis to test for a variety of terms and know exactly how good the results were.  If a change doesn&#8217;t result in a positive effect on quality score, it doesn&#8217;t go into production.</p>
<p>And isn&#8217;t that what we all want?</p>
<p><em>Cross-posted with <a href="http://www.lucenerevolution.org/blog/?p=126">Lucene Revolution Blog</a>. Nicholas Chase is a guest blogger.This is one of a series of presentation summaries from the conference.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/06/01/the-scientific-approach-to-search-at-sensis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lucene    Revolution     Keynote    Highlights:    the   once    and future    history    of    open    source    and   enterprise    search</title>
		<link>http://www.lucidimagination.com/blog/2011/06/01/lucene-revolution-keynote-highlights-the-once-and-future-history-of-open-source-and-enterprise-search/</link>
		<comments>http://www.lucidimagination.com/blog/2011/06/01/lucene-revolution-keynote-highlights-the-once-and-future-history-of-open-source-and-enterprise-search/#comments</comments>
		<pubDate>Wed, 01 Jun 2011 20:20:48 +0000</pubDate>
		<dc:creator>tony.barreca</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[enterprise search]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[lucene revolution]]></category>
		<category><![CDATA[lucid imagination]]></category>
		<category><![CDATA[lucid works]]></category>
		<category><![CDATA[Open Source Search]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=3618</guid>
		<description><![CDATA[<p>Lucid Imagination founder Marc Krellenstein kicked off the Lucene Revolution yesterday with a keynote address covering the history of search. Here are the slides, followed by some highlights:</p>
<div style="width:595px" id="__ss_8167472"></div>
<p>Much as we might think of search technology as a 21st century internet thing, its back to when IBM was sued by the US government. By the early  days of the Internet, search—Lycos, Infoseek, Excite, and Alta Vista&#8211;began to accelerate the virtuous cycle of requirements and innovation.  &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>Lucid Imagination founder Marc Krellenstein kicked off the Lucene Revolution yesterday with a keynote address covering the history of search. Here are the slides, followed by some highlights:</p>
<div style="width:595px" id="__ss_8167472"><object id="__sse8167472" width="595" height="497"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=krellensteinmarc-keynotetheonceandfuturehistoryofenterprisesearchandopensource-110531191312-phpapp01&#038;stripped_title=krellenstein-marc-keynote-the-once-and-future-history-of-enterprise-search-and-open-source-8167472&#038;userName=lucenerevolution" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed name="__sse8167472" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=krellensteinmarc-keynotetheonceandfuturehistoryofenterprisesearchandopensource-110531191312-phpapp01&#038;stripped_title=krellenstein-marc-keynote-the-once-and-future-history-of-enterprise-search-and-open-source-8167472&#038;userName=lucenerevolution" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="595" height="497"></embed></object></div>
<p>Much as we might think of search technology as a 21st century internet thing, its back to when IBM was sued by the US government. By the early  days of the Internet, search—Lycos, Infoseek, Excite, and Alta Vista&#8211;began to accelerate the virtuous cycle of requirements and innovation.  Marc describes the evolution of the technology from centralized to distributed indexes; among the early players, only Lucene, Google, and Fast had distributed architectures in their initial releases.</p>
<p>There&#8217;s no denying the influence of Google on search, and their Page rank algorithm—a popularity-based authority metric—represented a breakthrough in search precision. Along with speed, it helped Google to its now-familiar market position. But in contrast to the public perception of internet search experience as the be-all and end-all, it is essential to understand the differences between &#8220;generic&#8221; public internet search and enterprise search. One important, little understood virtue of Google is that it is actually a combination of multiple search applications and techniques, tailored to a particular set of user behaviors; the magic is not in one technique or another, but in their deliberate combination.</p>
<p>Marc continues with a review of search fundamentals, precision and recall, and discusses what makes for high scores on each with an emphasis on enterprise requirements, resources, and methodologies.  With the stage thus set, the focus turns to the emergence of Lucene and Solr.</p>
<p>He makes the case that Lucene and Solr are industrial strength search technology and are as good as, if not better than, Google for search in terms of precision, performance, and relevance. This derives from the superior attributes of open source as a development model.</p>
<p>That said, as good as Lucene and Solr are, they are not perfect. Issues include a focus on core functionality, at the expense of gaps in certain areas of enterprise requirements. The good news is that these  addressed in a number of ways: community, consultants, commercialization, and an enterprise’s internal resources. Marc concluded his remarks with an overview of the competition and the competitive landscape, along with some thoughts on the future.</p>
<p>You can download the slides <a href="http://www.slideshare.net/signup?from=download&amp;from_source=http://www.slideshare.net/lucenerevolution/krellenstein-lucene-revolution2011keynoteoncefuturehistoryenterprise-searchopensource/download">here</a>.</p>
<p><em>Cross-posted with <a href="http://www.lucenerevolution.org/blog/?p=126">Lucene Revolution Blog</a>. Tony Barreca is a guest blogger.This is one of a series of presentation summaries from the conference.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/06/01/lucene-revolution-keynote-highlights-the-once-and-future-history-of-open-source-and-enterprise-search/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Solr and law enforcement: highly relevant results can be a crime</title>
		<link>http://www.lucidimagination.com/blog/2011/06/01/solr-and-law-enforcement-highly-relevant-results-can-be-a-crime/</link>
		<comments>http://www.lucidimagination.com/blog/2011/06/01/solr-and-law-enforcement-highly-relevant-results-can-be-a-crime/#comments</comments>
		<pubDate>Wed, 01 Jun 2011 20:00:14 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[enterprise search]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[lucene revolution]]></category>
		<category><![CDATA[lucid imagination]]></category>
		<category><![CDATA[Open Source Search]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=3614</guid>
		<description><![CDATA[<p>Imagine that you have to integrate and search data from 200 different sources, each of which uses a different structure (if they use a structure at all).  Your data may be incomplete, the same information is represented in different ways by different sources, and it&#8217;s often vague.  Oh, and if a user can&#8217;t find the correct result using a simple Google-like search, someone may literally get away with murder.</p>
<p>Welcome to Ronald Mayer&#8217;s world.  In &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>Imagine that you have to integrate and search data from 200 different sources, each of which uses a different structure (if they use a structure at all).  Your data may be incomplete, the same information is represented in different ways by different sources, and it&#8217;s often vague.  Oh, and if a user can&#8217;t find the correct result using a simple Google-like search, someone may literally get away with murder.</p>
<p>Welcome to Ronald Mayer&#8217;s world.  In his <a href="http://lucenerevolution.org/2011/sessions-day-2#highly-mayer" target="_blank">talk</a> on Day 2 of <a href="http://lucenerevolution.org">Lucene Revolution</a>, he described how <a href="http://www.forensiclogic.com/" target="_blank">Forensic Logic</a> aggregates data from local police departments, courts, and even federal agencies, so that law enforcement officers can get information on crimes and suspects not only in their own jurisdictions, but also in surrounding areas.</p>
<p>Slides for this session:</p>
<div style="width:480px" id="__ss_8167459"> <object id="__sse8167459" width="480" height="397"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=mayerronald-searchrelevanceforlawenforcement-110531191017-phpapp02&#038;stripped_title=mayer-ronald-search-relevance-for-law-enforcement-8167459&#038;userName=lucenerevolution" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed name="__sse8167459" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=mayerronald-searchrelevanceforlawenforcement-110531191017-phpapp02&#038;stripped_title=mayer-ronald-search-relevance-for-law-enforcement-8167459&#038;userName=lucenerevolution" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="397"></embed></object></div>
<p>He&#8217;s got lots of challenges, not the least of which are the differences in data formats.  You&#8217;d think there&#8217;d be a standard for this, and as he points out, there is.  In fact, there are lots of them.  In reality, the standards are so large that most agencies wind up using some subset, plus their own extensions.  And that doesn&#8217;t count the agencies that have all of their data in Word files in a folder in someone&#8217;s computer.  (It also doesn&#8217;t count data from external sources.  Apparently gangs are big into <a href="http://www.myspace.com" target="_blank">MySpace</a>.  Who knew?)</p>
<p>Once they&#8217;ve created a transformation for adding a new agency&#8217;s data, they have a whole host of practical issues to deal with.  For example, &#8220;at dusk&#8221; and &#8220;near any elementary school in the school district&#8221; are perfectly valid requests, and their system has to accomodate that.  They&#8217;ve also got to be able to take a statement such as &#8220;suspect is a caucasian male, approximately 6&#8217;4&#8243;, wearing a red ball cap and black jacket, leather&#8221; and have it come up for &#8220;tall caucasian baseball cap black leather jacket&#8221;.  Entity extraction and help from <a href="http://www.basistech.com/" target="_blank">Basis Technology</a> products helps associate an adjective with a noun, which makes things a bit easier.</p>
<p>Interestingly, Forensic also has to work in the other direction; if all of that information was encoded by field, it wouldn&#8217;t be searchable in a simple text search.  So another task they&#8217;ve had to perform is to de-normalize the data back into a searchable narrative.</p>
<p>Forensic gets good use from Lucene, and in fact has been contributing back; they have been making heavy use of the phrase field (<code>pf</code>) and phrase slop (<code>ps</code>) parameters in the new Extended Dismax parser but what they really needed was to be able to combine several sets of them into a single query.  So SOLR-2058 proposes (and implements) a new query syntax, <code>field~slop^boost</code>, that allows for independent pf and ps settings, such as:</p>
<p><code>pf2=important_text^10~10&amp;pf=important_text^100&amp;pf=important_text^100~10</code></p>
<p>Mayer says that although searches might take a second or three to return data, relevance is much more important in this case.</p>
<p>There are still problems to overcome.  For example, relative boosts are tricky.  How far away does an event have to happen before it&#8217;s as irrelevant as something that happened two years ago?  Forensic continually works on refining the process, tags, synonyms, and other parameters, so at any given moment, they&#8217;re always indexing the oldest (non-reindexed) documents.</p>
<p>Remember that the next time you get pulled over.</p>
<p><em>Cross-posted with <a href="http://www.lucenerevolution.org/blog/?p=126">Lucene Revolution Blog</a>. Nicholas Chase is a guest blogger.This is one of a series of presentation summaries from the conference.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/06/01/solr-and-law-enforcement-highly-relevant-results-can-be-a-crime/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>More like this: from semantics to new business model for Canoo and Axel Springer</title>
		<link>http://www.lucidimagination.com/blog/2011/06/01/more-like-this-from-semantics-to-new-business-model-for-canoo-and-axel-springer/</link>
		<comments>http://www.lucidimagination.com/blog/2011/06/01/more-like-this-from-semantics-to-new-business-model-for-canoo-and-axel-springer/#comments</comments>
		<pubDate>Wed, 01 Jun 2011 19:40:08 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[enterprise search]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[lucene revolution]]></category>
		<category><![CDATA[lucid imagination]]></category>
		<category><![CDATA[Open Source Search]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=3609</guid>
		<description><![CDATA[<p>It wasn&#8217;t the biggest lesson learned from <a href="http://lucenerevolution.org/2011/sessions-day-2#building-mijares" target="_blank">Alberto Mijares&#8217; talk</a> on Day 2 of Lucene Revolution, but the notion that funding issues can lead to a new and successful business model was uplifiting, at the very least.</p>
<p>Slides for this session:</p>
<div style="width:595px" id="__ss_8167447">  </div>
<p>When Mijares&#8217;s company, <a href="http://www.canoo.com/" target="_blank">Canoo Engineering AG</a>, met with Swiss newspaper publisher and media group <a href="http://www.axelspringer.de/en/index.html" target="_blank">Axel Springer</a>, they all agreed that what Axel Springer needed was to keep readers on the sites of &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>It wasn&#8217;t the biggest lesson learned from <a href="http://lucenerevolution.org/2011/sessions-day-2#building-mijares" target="_blank">Alberto Mijares&#8217; talk</a> on Day 2 of Lucene Revolution, but the notion that funding issues can lead to a new and successful business model was uplifiting, at the very least.</p>
<p>Slides for this session:</p>
<div style="width:595px" id="__ss_8167447"> <object id="__sse8167447" width="595" height="497"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=mijarealberto-buildingsaaswsolr-110531190749-phpapp02&#038;stripped_title=mijare-alberto-building-saa-s-w-solr-8167447&#038;userName=lucenerevolution" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed name="__sse8167447" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=mijarealberto-buildingsaaswsolr-110531190749-phpapp02&#038;stripped_title=mijare-alberto-building-saa-s-w-solr-8167447&#038;userName=lucenerevolution" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="595" height="497"></embed></object> </div>
<p>When Mijares&#8217;s company, <a href="http://www.canoo.com/" target="_blank">Canoo Engineering AG</a>, met with Swiss newspaper publisher and media group <a href="http://www.axelspringer.de/en/index.html" target="_blank">Axel Springer</a>, they all agreed that what Axel Springer needed was to keep readers on the sites of their most popular newspapers longer, and to drive traffic from those papers to their less-well-known brands.   They also agreed that providing &#8220;related articles&#8221; was the way to do that.</p>
<p>What they didn&#8217;t agree on was how to pay for the development of such a service.</p>
<p>Still, Canoo soldiered on.  Deciding on Lucene/Solr as the tool of choice was an easy decision; Lucene&#8217;s &#8220;More Like This&#8221; function would be perfect for finding articles similar to the current page.</p>
<p>Well, almost perfect.  What Canoo discovered is that while More Like This does work out of the box, without semantics the results weren&#8217;t good enough for their purposes.  Furthermore, at the moment, it&#8217;s designed to work well in English, and the &#8220;financial language&#8221; of Switzerland is German.</p>
<p>They solved the problem with a combination of strategies.  The first was to use WMTrans, Canoo&#8217;s language tool and the foundation behind the linguistic sites <a href="http://Leo.de" target="_blank">Leo.de</a> and <a href="http://Canoo.net" target="_blank">Canoo.net</a>, to perform linguistic analysis. They then used Lucene&#8217;s analysis pipeline to add semantics to the data using external sources such as Wikipedia in order to get better results from More Like This.</p>
<p>But the &#8220;funding problem&#8221; remained. Finally, they hit on a solution: they&#8217;d provide &#8220;related articles&#8221;, but using the Software as a Service model.  This way, they could make the service available to other companies, and what was a one-off project became an ongoing business. Literally, the search turned from an implementation into a business opportunity, seeking new customers who are &#8216;more like this&#8217;.</p>
<p>Because of the semantics, not all companies will be able to take advantage of Canoo&#8217;s service; their system requires documents to have a significant size in order for the analysis to be accurate, though they&#8217;re working on shrinking that down.  But by providing their application as a service, Canoo has expanded their market considerably. So because Canoo was able to both (a) have confidence in an open source solution that would get them close to what they needed and (b) get commercial-grade support for that product (disclosure: Canoo is a client of Lucid Imagination, a sponsor of Lucene Revolution), they were willing to take a chance on a project even though the potential client wasn&#8217;t, leading to not one project, but an ongoing business.</p>
<p>And that lesson I was talking about?  Have faith in open source, and have faith in yourself; it can definitely pay off in the long run.</p>
<p><em>Cross-posted with <a href="http://www.lucenerevolution.org/blog/?p=126">Lucene Revolution Blog</a>. Nicholas Chase is a guest blogger.This is one of a series of presentation summaries from the conference.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/06/01/more-like-this-from-semantics-to-new-business-model-for-canoo-and-axel-springer/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>BeyondTrees and Today&#8217;s Newspaper:  Using Lucene to build a time machine</title>
		<link>http://www.lucidimagination.com/blog/2011/06/01/beyondtrees-and-the-new-york-times-using-lucene-to-build-a-time-machine/</link>
		<comments>http://www.lucidimagination.com/blog/2011/06/01/beyondtrees-and-the-new-york-times-using-lucene-to-build-a-time-machine/#comments</comments>
		<pubDate>Wed, 01 Jun 2011 19:20:18 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[enterprise search]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[lucene revolution]]></category>
		<category><![CDATA[lucid imagination]]></category>
		<category><![CDATA[Open Source Search]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=3605</guid>
		<description><![CDATA[<p>You&#8217;ve been hearing me do a lot of talking about finding meaning in data, so it may not come as a surprise that of all the track sessions at Lucene Revolution, perhaps the one I was looking forward to the most was the one I attended last, &#8220;<a href="http://lucenerevolution.org/2011/sessions-day-2#lots-veling" target="_blank">Lots of Facets, Fast</a>&#8220;, from Anne Veling.</p>
<p>Here are the slides for this session.</p>
<div id="__ss_8167428" style="width: 595px;">OK, so the title may not seem all that revolutionary, but it&#8217;s &#8230;</div>]]></description>
			<content:encoded><![CDATA[<p>You&#8217;ve been hearing me do a lot of talking about finding meaning in data, so it may not come as a surprise that of all the track sessions at Lucene Revolution, perhaps the one I was looking forward to the most was the one I attended last, &#8220;<a href="http://lucenerevolution.org/2011/sessions-day-2#lots-veling" target="_blank">Lots of Facets, Fast</a>&#8220;, from Anne Veling.</p>
<p>Here are the slides for this session.</p>
<div id="__ss_8167428" style="width: 595px;">OK, so the title may not seem all that revolutionary, but it&#8217;s not the facets themselves that are interesting:  it&#8217;s what he did with them.</div>
<p>Imagine you can see 160 years of history, all on one screen.  You can zoom and pan, you can look at a particular day, you can even do a search.  And when you do, the results come up not as a list, but as a heat map that shows where in history that topic appears, and how often.</p>
<p>That&#8217;s the application Anne Veling, <a href="http://beyondtrees.com/" target="_blank">BeyondTrees</a>, and <a href="http://www.proquest.com/" target="_blank">Proquest</a> built for a prominent US newspaper.</p>
<p>The system covers 160 years of newspapers, with every single issue &#8212; more than 58,000! &#8212; appearing on a single canvas.  When you do a search, the app sends an AJAX request that gets back a heat map specifying the color to overlay over each of the tiles, with brighter colors representing greater numbers of mentions in that particular issue.</p>
<p>And that&#8217;s where facets come in.  After all, what is a facet anyway?  It&#8217;s a count of how many results appear in a particular &#8220;bucket&#8221;.  In a traditional system, you might do a search for &#8220;Indians&#8221; and get something along the lines of</p>
<pre>...
1870 [37]
1871 [259]
1872 [135]
1873 [3]
1901 [25]
...</pre>
<p>and so on.  That&#8217;s exactly the information Veling needed for his heat map, only instead of years, each facet would represent one day, or one issue  of the newspaper.</p>
<p>But getting those facets posed some interesting challenges, and presented some interesting opportunities.</p>
<p>For one thing, this may be one of the only times in your professional life you&#8217;ll ever see a Lucene index in which absolutely nothing is stored.  Because nothing is ever returned except the facet counts, nothing needs to be.  (Once the user finds the issue he or she wants, the paper is returned to them through a different process.)</p>
<p>From there, it was partly a matter of tuning &#8212; the <code>filterCache</code>, set by default at 512, needed to be at least 60,000 so that the entire set of facets could be resident in memory, plus he used DocSets to reduce memory requirements &#8212; and partly a matter of knowing the data.</p>
<p>Rather than having each document check against all 58K+ facets to see where it belongs, he instead did a hierarchical check; first by decade, then by year, then by month, then by day.  Also, because he knew that a document could belong to only one facet, he created a custom runtime collector that stops looking when it finds one.  By making those changes, he was able to reduce the number of checks per document from 58560 to just 34.5.  Considering that he was dealing with more than 28 million documents, it&#8217;s no surprise that he was able to increase performance by a factor of 30.</p>
<p>The result of all this is a beautiful system in which you can clearly see patterns such as the 1978 newspaper strike, leap years, or the fact that this renowned Sunday paper didn&#8217;t initially exist at all. You&#8217;ve probably seen those old science fiction movies where the time machine has a dial that selects years. Here, the very sleek iPad interface lets you zoom in, in an experience reminiscent of the flying function of Google earth &#8212; and instead of traveling through time, you travel through space.</p>
<p>Next up, Veling will work on a similar Belgian system, representing over 1.2 million books, CDs, and DVDs.</p>
<p>The real beauty of this project isn&#8217;t so much the actual interface (though it is beautiful) but the fact that it exists at all.  This kind of innovative thinking is the direction we need to go in order to give meaning to all this information we are beginning to see.</p>
<p>It&#8217;s a new day, and it is great to see Lucene and Solr at the forefront of the revolution.</p>
<p><em>Cross-posted with <a href="http://www.lucenerevolution.org/blog/?p=126">Lucene Revolution Blog</a>. Nicholas Chase is a guest blogger.This is one of a series of presentation summaries from the conference.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/06/01/beyondtrees-and-the-new-york-times-using-lucene-to-build-a-time-machine/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Integrating Advanced Text Analytics into Solr/Lucene</title>
		<link>http://www.lucidimagination.com/blog/2011/06/01/integrating-advanced-text-analytics-into-solrlucene/</link>
		<comments>http://www.lucidimagination.com/blog/2011/06/01/integrating-advanced-text-analytics-into-solrlucene/#comments</comments>
		<pubDate>Wed, 01 Jun 2011 19:00:23 +0000</pubDate>
		<dc:creator>tony.barreca</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[enterprise search]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[lucene revolution]]></category>
		<category><![CDATA[lucid imagination]]></category>
		<category><![CDATA[Open Source Search]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=3601</guid>
		<description><![CDATA[<p>“Metadata is king!” Thus proclaimed Steve Kearns of Basis Technology, <a href="http://www.lucenerevolution.org/sponsors#basistech">Platinum Sponsor of Lucene Revolution,</a> at the start of this <a href="http://lucenerevolution.org/2011/sessions-day-1#integrating-kearns">standing-room-only session on Day 1</a> of the conference. Why? Because it provides a way to enhance otherwise unstructured data with a considerable amount of structure.</p>
<p>Here are the slides for this session.</p>
<div style="width:595px" id="__ss_8167385"></div>
<p>With this premise in place, Steve discussed the use and integration of advanced analytics in the document-processing pipeline, focusing on the three levels &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>“Metadata is king!” Thus proclaimed Steve Kearns of Basis Technology, <a href="http://www.lucenerevolution.org/sponsors#basistech">Platinum Sponsor of Lucene Revolution,</a> at the start of this <a href="http://lucenerevolution.org/2011/sessions-day-1#integrating-kearns">standing-room-only session on Day 1</a> of the conference. Why? Because it provides a way to enhance otherwise unstructured data with a considerable amount of structure.</p>
<p>Here are the slides for this session.</p>
<div style="width:595px" id="__ss_8167385"><object id="__sse8167385" width="595" height="497"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=kearnssteve-integratingadvancedtextanalyticsintosolr-110531185928-phpapp01&#038;stripped_title=kearns-steve-integrating-advanced-text-analytics-into-solr-8167385&#038;userName=lucenerevolution" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed name="__sse8167385" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=kearnssteve-integratingadvancedtextanalyticsintosolr-110531185928-phpapp01&#038;stripped_title=kearns-steve-integrating-advanced-text-analytics-into-solr-8167385&#038;userName=lucenerevolution" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="595" height="497"></embed></object></div>
<p>With this premise in place, Steve discussed the use and integration of advanced analytics in the document-processing pipeline, focusing on the three levels to which they apply: namely, the document, sub-document, and cross-document levels.</p>
<p>Meta-data derivable at the document level includes identification of the language in which the document is written and the category in which is it properly placed. Steve touches on some especially interesting challenges posed by Asian languages, and mentioned the fact that this level of analysis was useful for creating document search dashboards, useful to those responsible for assessing and maintaining document search quality.</p>
<p>The amount of information that can be gleaned from sub-document analysis is immense. Some of the processes involved at this level include basic stemming and its cousin, lemmatization. Among the more advanced techniques are entity extraction, relationship and event extraction, sentiment analysis, and the mapping of extracted items to real-world concepts in a process called “co-reference resolution.”</p>
<p>Key uses of cross-document analysis include, for example, document clustering, i.e., finding a set of documents that are “more like” one another than another set would be.</p>
<p>An aspect of the presentation of great interest to Solr users focuses on how to integrate analytics, like those provided by Basis, into the Solr pipeline. Not surprisingly, there are a lot of ways to do this. The biggest question you need to answer is: Should I run the analytics within Solr, or should treat them as external calls?</p>
<p>Steve wrapped this useful talk with some approaches to both techniques, including <code>UpdateRequest</code> processing and a list of some tools (e.g., UIMA, GATE, and OpenPipeline) to consider when the time to implement arrives.</p>
<p><em>Cross-posted with <a href="http://www.lucenerevolution.org/blog/?p=126">Lucene Revolution Blog</a>. Tony Barreca is a guest blogger.This is one of a series of presentation summaries from the conference.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/06/01/integrating-advanced-text-analytics-into-solrlucene/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Stephen Dunn and the Guardian: How being open makes them better</title>
		<link>http://www.lucidimagination.com/blog/2011/06/01/stephen-dunn-and-the-guardian-how-being-open-makes-them-better/</link>
		<comments>http://www.lucidimagination.com/blog/2011/06/01/stephen-dunn-and-the-guardian-how-being-open-makes-them-better/#comments</comments>
		<pubDate>Wed, 01 Jun 2011 18:57:37 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[enterprise search]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[lucene revolution]]></category>
		<category><![CDATA[lucid imagination]]></category>
		<category><![CDATA[Open Source Search]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=3624</guid>
		<description><![CDATA[<p>What a way to start out a conference on using data!  <a href="http://lucenerevolution.com/2011/sessions-day-1#news-guardian" target="_blank">Stephen Dunn&#8217;s keynote</a> for Day 1 of Lucene Revolution &#8212; the <a href="http://http://www.guardian.co.uk/" target="_blank">Guardian</a>&#8216;s opening up of its content using an API, and how Lucene/Solr was involved in that &#8212; was interesting all by itself, but he himself is also a good speaker, engaging the audience.  A great way to start the day. Here&#8217;s a video clip of his interview:</p>
<p><a href="http://www.lucidimagination.com/files/video/lr2011/lucene_revolution_2011_stephen_dunn.mov">Stephen Dunn and the Guardian: </a>&#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>What a way to start out a conference on using data!  <a href="http://lucenerevolution.com/2011/sessions-day-1#news-guardian" target="_blank">Stephen Dunn&#8217;s keynote</a> for Day 1 of Lucene Revolution &#8212; the <a href="http://http://www.guardian.co.uk/" target="_blank">Guardian</a>&#8216;s opening up of its content using an API, and how Lucene/Solr was involved in that &#8212; was interesting all by itself, but he himself is also a good speaker, engaging the audience.  A great way to start the day. Here&#8217;s a video clip of his interview:</p>
<p><a href="http://www.lucidimagination.com/files/video/lr2011/lucene_revolution_2011_stephen_dunn.mov">Stephen Dunn and the Guardian: How being open makes them better</a></p>
<p>Stephen is from <a href="http://theguardian.co.uk">The Guardian</a>, the second oldest newspaper in the United Kingdom.  If you&#8217;ve been following along, you know that newspapers have been having a hard time of it lately, but the Guardian is in a unique (or at least very rare) position; the paper is actually owned by a trust, so short-term profits can take a back seat to long-term planning and experimentation.  That gave the Guardian the freedom to experiment with their online presence.</p>
<p>This results in two interesting aspects of his talk.  The first is technological; in 1999, when they moved online, they found that they had steeply increasing traffic.  That&#8217;s good, of course, but that also meant increasing load on their database, which is not so good.</p>
<p>So they started doing searching through Solr, and the load on their database leveled off, even though traffic increased.  So the Guardian thought, &#8220;great, what else can we move to search?&#8221;  The answer turned out to be &#8220;everything&#8221;, it seems.</p>
<p>Basically, being able to find data using search (as opposed to using an RDBMS) enables the Guardian to go from being a publisher to being a platform.  After requesting an <a href="http://guardian.mashery.com/" target="_blank">API key</a>, developers can basically do whatever they want with not just current Guardian data, but also past articles, <a href="http://www.guardian.co.uk/data" target="_blank">curated data</a>, and also information from a separate <a href="http://www.guardian.co.uk/open-platform/blog/announcing-the-guardian-politics-api" target="_blank">Politics API</a> that records voting records and other political data.</p>
<p>Of course from a  business standpoint, the question is, &#8220;how can this possibly be a good idea?&#8221;  The answer turns out to be twofold.</p>
<p>First off, articles come with advertising, so even if they&#8217;re being displayed in someone else&#8217;s application, the ads are still being displayed and to a new audience, so that&#8217;s additional revenue for the Guardian.  That one&#8217;s obvious.</p>
<p>The second reason isn&#8217;t so obvious.  Basically it comes down to an acknowledgement that there are people who know more about the topics they cover than they do, and by keeping their data open, rather than locking it behind a pay wall (as some of their contemporaries have done or are doing) the Guardian makes it possible to get participation from those people.  The example that he gave is the <a href="http://www.guardian.co.uk/world/interactive/2011/mar/22/middle-east-protest-interactive-timeline" target="_blank">Arab Spring</a>, where people in the region basically followed Al Jazeera and The Guardian.  These people also responded to Guardian content in social media, so where their competitors might get a dozen comments, all from people in the UK, the Guardian had hundreds of tweets (or more) from people with a variety of perspectives.</p>
<p><a href="http://www.guardian.co.uk/open-platform" target="_blank">The Open Platform</a>, as they call it, also provides another way to get additional perspectives; Guardian data can be built into applications that use their <a href="http://www.guardian.co.uk/open-platform/what-is-the-microapp-framework" target="_target">Micro Apps API</a>, which then enables those apps to be integrated back into Guardian sites and platforms (or other sites).  Similarly, their API enabled them to create a WordPress plugin.  That plugin then in turn enabled them to get outside subject matter experts, such as scientists, to blog for them and have other content added into the Guardian CMS.</p>
<p>So overall, the capabilities provided by using Lucene/Solr to search their data provides an opportunity for the Guardian to open themselves up as a platform, which in turn makes them a better publisher.</p>
<p>Definitely a win, and a great way to start the day.</p>
<p><em>Cross-posted with <a href="http://www.lucenerevolution.org/blog/?p=126">Lucene Revolution Blog</a>. Nicholas Chase is a guest blogger.This is one of a series of presentation summaries from the conference.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/06/01/stephen-dunn-and-the-guardian-how-being-open-makes-them-better/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
<enclosure url="http://www.lucidimagination.com/files/video/lr2011/lucene_revolution_2011_stephen_dunn.mov" length="0" type="video/quicktime" />
		</item>
	</channel>
</rss>

