<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lucid Imagination &#187; Lucene</title>
	<atom:link href="http://www.lucidimagination.com/blog/category/lucene/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lucidimagination.com/blog</link>
	<description>Exclusively dedicated to Apache Lucene/Solr open source search technology</description>
	<lastBuildDate>Sat, 04 Feb 2012 01:12:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Why Not AND, OR, And NOT?</title>
		<link>http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/</link>
		<comments>http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/#comments</comments>
		<pubDate>Wed, 28 Dec 2011 23:23:24 +0000</pubDate>
		<dc:creator>hossman</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[query parser]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4572</guid>
		<description><![CDATA[I really dislike the so called "Boolean Operators" ("AND", "OR", and "NOT") and generally discourage people from using them.  It's understandable that novice users may tend to think about the queries they want to run in those terms, but as you become more familiar with IR concepts in general, and what Solr specifically is capable of, I think it's a good idea to try to "set aside childish things" and start thinking (and encouraging your users to think) in terms of the superior "Prefix Operators" ("+", "-").]]></description>
			<content:encoded><![CDATA[<p><i>The following is written with Solr users in mind, but the principles apply to Lucene users as well.</i></p>
<p>
I really dislike the so called &#8220;Boolean Operators&#8221; (&#8220;AND&#8221;, &#8220;OR&#8221;, and &#8220;NOT&#8221;) and generally discourage people from using them.  It&#8217;s understandable that novice users may tend to think about the queries they want to run in those terms, but as you become more familiar with IR concepts in general, and what Solr specifically is capable of, I think it&#8217;s a good idea to try to &#8220;set aside childish things&#8221; and start thinking (and encouraging your users to think) in terms of the superior &#8220;Prefix Operators&#8221; (&#8220;+&#8221;, &#8220;-&#8221;).
</p>
<h2>Background: Boolean Logic Makes For Terrible Scores</h2>
<p>
<a href="https://en.wikipedia.org/wiki/Boolean_algebra">Boolean Algebra</a> is (as my father would put it) &#8220;pretty neat stuff&#8221; and the world as we know it most certainly wouldn&#8217;t exist with out it.  But when it comes to building a search engine, boolean logic tends to not be very helpful.  Depending on how you look at it, boolean logic is all about truth values and/or set intersections.  In either case, there is no concept of &#8220;relevancy&#8221; &#8212; either something is true or it&#8217;s false; either it is in a set, or it is not in the set.
</p>
<p>
When a user is looking for &#8220;all documents that contain the word &#8216;Alligator&#8217;&#8221; they aren&#8217;t going to very be happy if a search system applied simple boolean logic to just identify the <i>unordered set</i> of all matching documents.  Instead algorithms like TF/IDF are used to try and identify the <i>ordered list</i> of matching documents, such that the &#8220;best&#8221; matches come first.  Likewise, if a user is looking for &#8220;all documents that contain the words &#8216;Alligator&#8217; or &#8216;Crocodile&#8217;&#8221;, a simple boolean logic union of the sets of documents from the individual queries would not generate results as good as a query that took into the TF/IDF scores of the documents for the individual queries, as well as considering which documents matches <i>both</i> queries.  (The user is probably more interested in a document that discusses the similarities and differences between Alligators to Crocodiles then in documents that only mention one or the other a great many times).
</p>
<p>
This brings us to the crux of why I think it&#8217;s a bad idea to use the &#8220;Boolean Operators&#8221; in query strings: because it&#8217;s not how the underlying query structures actually work, and it&#8217;s not as expressive as the alternative for describing what you want.
</p>
<h2>BooleanQuery: Great Class, Bad Name</h2>
<p>
To really understand why the boolean operators are inferior to the prefix operators, you have to start by considering the underlying implementation.  The <a href="https://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a> class is probably one of the most misleading class names in the entire Lucene code base because it doesn&#8217;t model simple boolean logic query operations at all.  The basic function of a BooleanQuery is:
</p>
<ol>
<li>
A BooleanQuery consists of one or more <a href="https://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/search/BooleanClause.html">BooleanClauses</a>, each of which contains two pieces of information:</p>
<ul>
<li>A nested Query</li>
<li>An <a href="https://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/search/BooleanClause.Occur.html">Occur</a> flag, which has one of three values</li>
<ul>
<li><code>MUST</code> &#8211; indicating that documents must match this nested Query in order for the document to match the BooleanQuery, and the score from this subquery should contribute to the score for the BooleanQuery</li>
<li><code>MUST_NOT</code> &#8211; indicating that documents which match this nested Query are prohibited from matching the BooleanQuery</li>
<li><code>SHOULD</code> &#8211; indicating that documents which match this nested Query should have their score from the nested query contribute to the score from the BooleanQuery, but documents can be a match for the BooleanQuery even if they do not match the nested query</li>
</ul>
</ul>
</li>
<li>
If a BooleanQuery contains no <code>MUST</code> BooleanClauses, then a document is only considered a match against the BooleanQuery if one or more of the <code>SHOULD</code> BooleanClauses is a match.
</li>
<li>
The final score of a document which matches a BooleanQuery is based on the sum of the scores from all the matching <code>MUST</code> and <code>SHOULD</code> BooleanClauses, multiplied by a &#8220;coord factor&#8221; based on the ratio of the number of matching BooleanClauses to the total number of BooleanClauses in the BooleanQuery.
</li>
</ol>
<p>
These rules are not exactly simple to understand.  They are certainly more complex then boolean logic truth tables, but that&#8217;s because they are more powerful.  The examples below show how easy it is to implement &#8220;pure&#8221; boolean logic with BooleanQuery objects, but they only scratch the surface of what is possible with the BooleanQuery class:
</p>
<ul>
<li>
<p><b>Conjunction:</b> <code>(X &and; Y)</code></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.MUST);
q.add(Y, Occur.MUST);
</pre>
</li>
<li>
<p><b>Disjunction:</b> <code>(X &or; Y)</code></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Y, Occur.SHOULD);
</pre>
</li>
<li>
<p><b>Negation:</b> <code>(X &not; Z)</code></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Z, Occur.MUST_NOT);
</pre>
</li>
</ul>
<h2>Query Parser: Prefix Operators</h2>
<p>
In the Lucene <a href="https://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/queryParser/QueryParser.html">QueryParser</a> (and all of the other parsers that are based on it, like DisMax and EDisMax) the &#8220;prefix&#8221; operators &#8220;+&#8221; and &#8220;-&#8221; map directly to the Occur.MUST and Occur.MUST_NOT flags, while the <i>absence</i> of a prefix maps to the Occur.SHOULD flag by default. <i>(If you have any suggestions for a one character prefix syntax that could be used to explicitly indicate Occur.SHOULD, please comment with your suggestions, I&#8217;ve been trying to think of a good one for years.)</i>  So using the prefix syntax, you can express <i>all</i> of the permutations that the BooleanQuery class supports &#8212; not just simple boolean logic:
</p>
<ul>
<li>
<p><b><code>(+X +Y)</code></b> <i>&#8230; Conjunction, ie: (X &and; Y)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.MUST);
q.add(Y, Occur.MUST);
</pre>
</li>
<li>
<p><b><code>(X Y)</code></b> <i>&#8230; Disjunction, ie: (X &or; Y)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Y, Occur.SHOULD);
</pre>
</li>
<li>
<p><b><code>(+X -Z)</code></b> <i>&#8230; Negation, ie: (X &not; Z)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.MUST);
q.add(Z, Occur.MUST_NOT);
</pre>
</li>
<li>
<p><b><code>((X Y) -Z)</code></b> <i>&#8230; ((X &or; Y) &not; Z)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
BooleanQuery inner = new BooleanQuery();
inner.add(X, Occur.SHOULD);
inner.add(Y, Occur.SHOULD);
q.add(inner, Occur.SHOULD);
q.add(Z, Occur.MUST_NOT);
</pre>
</li>
<li>
<p><b><code>(X Y -Z)</code></b> <i>&#8230; Not expressible in simple boolean logic</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Y, Occur.SHOULD);
q.add(Z, Occur.MUST_NOT);
</pre>
</li>
</ul>
<p>
Note in particular the differences between the last two examples.  <code>(X Y -Z)</code> is a single BooleanQuery object containing three clauses, while <code>((X Y) -Z)</code> is a BooleanQuery containing two clauses, one of which is a nested BooleanQuery containing two clauses.  In both cases a document must match either &#8220;X&#8221; or &#8220;Y&#8221; and it can not match against &#8220;Z&#8221; (so the set of documents matched by each query will be the same) and in both cases the score of a document will be higher if it matches both the &#8220;X&#8221; and &#8220;Y&#8221; clauses; but because of the difference in their structures, the <i>scores</i> for these queries will be different for every document.  In particular, the coord factor will cause documents matching only one of &#8220;X&#8221; or &#8220;Y&#8221; (but not both) to have extremely different scores from each of these queries. <i>(This assumes that the <a href="https://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> is being used; it would be possible to write a custom Similarity to force the scores to be equivalent)</i>
</p>
<h2>Query Parser: &#8220;Boolean Operators&#8221;</h2>
<p>
The query parser also supports the so called &#8220;boolean operators&#8221; which can also be used to express boolean logic, as demonstrated in these examples:
</p>
<ul>
<li>
<p><b><code>(X AND Y)</code></b> <i>&#8230; Conjunction, ie: (X &and; Y)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.MUST);
q.add(Y, Occur.MUST);
</pre>
</li>
<li>
<p><b><code>(X OR Y)</code></b> <i>&#8230; Disjunction, ie: (X &or; Y)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Y, Occur.SHOULD);
</pre>
</li>
<li>
<p><b><code>(X NOT Z)</code></b> <i>&#8230; Negation, ie: (X &not; Z)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Z, Occur.MUST_NOT);
</pre>
</li>
<li>
<p><b><code>((X AND Y) OR Z)</code></b> <i>&#8230; ((X &and; Y) &or; Z)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
BooleanQuery inner = new BooleanQuery();
inner.add(X, Occur.MUST);
inner.add(Y, Occur.MUST);
q.add(inner, Occur.SHOULD);
q.add(Z, Occur.SHOULD);
</pre>
</li>
<li>
<p><b><code>((X OR Y) AND Z)</code></b> <i>&#8230; ((X &or; Y) &and; Z)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
BooleanQuery inner = new BooleanQuery();
inner.add(X, Occur.SHOULD);
inner.add(Y, Occur.SHOULD);
q.add(inner, Occur.MUST);
q.add(Z, Occur.MUST);
</pre>
</li>
<li>
<p><b><code>(X AND (Y NOT Z))</code></b> <i>&#8230; (X &and; (Y &not; Z))</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
BooleanQuery inner = new BooleanQuery();
inner.add(Y, Occur.MUST);
inner.add(Z, Occur.MUST_NOT);
q.add(X, Occur.MUST);
q.add(inner, Occur.MUST);
</pre>
</li>
</ul>
<p>
Please note how import it is to use parentheses to combine multiple operators in order in order to generate queries that correctly model boolean logic. As mentioned before, the BooleanQuery class supports an arbitrary number of clauses, so <code>(X OR Y OR Z)</code> is a single BooleanQuery with three clauses &#8212; it is not equivalent to either <code>((X OR Y) OR Z)</code> <i>or</i> <code>(X OR (Y OR Z))</code> because those result in a BooleanQuery with two clauses, one of which is a nested BooleanQuery.  As mentioned above when discussing the prefix operators, the scores from each of those queries will all be different depending on which clauses are matched by each document.
</p>
<p>
Things definitely get very confusing when these &#8220;boolean operators&#8221; are used in ways other then those described above.  In some cases this is because the query parser is trying to be forgiving about &#8220;natural language&#8221; style usage of operators that many boolean logic systems would consider a parse error.  In other cases, the behavior is bizarrely esoteric:
</p>
<ul>
<li>Queries are parsed left to right</li>
<li><code>NOT</code> sets the Occurs flag of the clause to it&#8217;s right to <code>MUST_NOT</code></li>
<li><code>AND</code> will change the Occurs flag of the clause to it&#8217;s left to <code>MUST</code> unless it has already been set to <code>MUST_NOT</code></li>
<li><code>AND</code> sets the Occurs flag of the clause to it&#8217;s right to <code>MUST</code></li>
<li><i>If</i> the <a href="https://wiki.apache.org/solr/SolrQuerySyntax">default operator</a> of the query parser has been set to &#8220;And&#8221;: <code>OR</code> will change the Occurs flag of the clause to it&#8217;s left to <code>SHOULD</code> unless it has already been set to <code>MUST_NOT</code></li>
<li><code>OR</code> sets the Occurs flag of the clause to it&#8217;s right to <code>SHOULD</code></li>
</ul>
<p>
Practically speaking this means that <code>NOT</code> takes precedence over <code>AND</code> which takes precedence over <code>OR</code> &#8212; but only if the default operator for the query parser has not been changed from the default (&#8220;Or&#8221;).  If the default operator is set to &#8220;And&#8221; then the behavior is just plain weird.
</p>
<h2>In Conclusion</h2>
<p>
I won&#8217;t try to defend or justify the way the query parser behaves when it encounters these &#8220;boolean operators&#8221;, because in many cases I don&#8217;t understand or agree with the behavior myself &#8212; but that&#8217;s not really the point of this article.  My goal isn&#8217;t to convince you that the behavior of these operators makes sense, quite the contrary my goal today is to point out that regardless of how these operators are parsed, they aren&#8217;t a good representation of the underlying functionality available in the BooleanQuery class.
</p>
<p>
Do yourself a favor and start <i>thinking</i> about BooleanQuery as a container of arbitrary nested queries annotated with <code>MUST</code>, <code>MUST_NOT</code>, or <code>SHOULD</code> and discover the power that is available to you beyond simple boolean logic.
</p>
<p><i>Many thanks to Bill Dueber whose <a href="http://robotlibrarian.billdueber.com/solr-and-boolean-operators/">recent related blog post</a> reminded me that I had some draft notes on this subject floating around my laptop waiting to finished up and posted online.</i></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>The Rich Web Experience</title>
		<link>http://www.lucidimagination.com/blog/2011/11/15/the-rich-web-experience/</link>
		<comments>http://www.lucidimagination.com/blog/2011/11/15/the-rich-web-experience/#comments</comments>
		<pubDate>Tue, 15 Nov 2011 19:16:08 +0000</pubDate>
		<dc:creator>Erik Hatcher</dc:creator>
				<category><![CDATA[Events]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4468</guid>
		<description><![CDATA[[ Tuesday, 29 November 2011 to Friday, 2 December 2011. ] <p><a href="http://therichwebexperience.com/conference/fort_lauderdale/2011/11/home"><img class="alignnone" title="The Rich Web Experience 2011" src="http://therichwebexperience.com/images/2011/header/richwebexperience.png" alt="" width="600" height="91" /></a></p>
<p>I&#8217;ll be speaking at the upcoming <a href="http://therichwebexperience.com/conference/fort_lauderdale/2011/11/home">Rich Web Experience conference</a> in Ft. Lauderdale, presenting an &#8220;Introduction to Solr&#8221;, &#8220;Solr Recipes&#8221;, and &#8220;Lucene for Solr Developers&#8221;.  I&#8217;ll be tying all of these presentations together into a cohesive search/Solr track going from the introduction, to recipes for common tasks, through advanced customization of Solr.&#8230;</p>]]></description>
			<content:encoded><![CDATA[[ Tuesday, 29 November 2011 to Friday, 2 December 2011. ] <p><a href="http://therichwebexperience.com/conference/fort_lauderdale/2011/11/home"><img class="alignnone" title="The Rich Web Experience 2011" src="http://therichwebexperience.com/images/2011/header/richwebexperience.png" alt="" width="600" height="91" /></a></p>
<p>I&#8217;ll be speaking at the upcoming <a href="http://therichwebexperience.com/conference/fort_lauderdale/2011/11/home">Rich Web Experience conference</a> in Ft. Lauderdale, presenting an &#8220;Introduction to Solr&#8221;, &#8220;Solr Recipes&#8221;, and &#8220;Lucene for Solr Developers&#8221;.  I&#8217;ll be tying all of these presentations together into a cohesive search/Solr track going from the introduction, to recipes for common tasks, through advanced customization of Solr.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/11/15/the-rich-web-experience/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bet You Didn&#8217;t Know Lucene Can&#8230;</title>
		<link>http://www.lucidimagination.com/blog/2011/11/14/bet-you-didnt-know-lucene-can/</link>
		<comments>http://www.lucidimagination.com/blog/2011/11/14/bet-you-didnt-know-lucene-can/#comments</comments>
		<pubDate>Mon, 14 Nov 2011 15:43:36 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4418</guid>
		<description><![CDATA[<p>Here are my ApacheCon 2011 slides for my talk &#8220;Bet You Didn&#8217;t Know Lucene Can&#8230;&#8221; :</p>
<p>&#160;</p>
<div id="__ss_10155480" style="width: 425px;"><strong style="display: block; margin: 12px 0 4px;"><a title="Bet you didn't know Lucene can..." href="http://www.slideshare.net/gsingers/bet-you-didnt-know-lucene-can">Bet you didn&#8217;t know Lucene can&#8230;</a></strong>
<div style="padding: 5px 0 12px;">View more <a href="http://www.slideshare.net/">presentations</a> from <a href="http://www.slideshare.net/gsingers">gsingers</a>.</div>
&#8230;</div>]]></description>
			<content:encoded><![CDATA[<p>Here are my ApacheCon 2011 slides for my talk &#8220;Bet You Didn&#8217;t Know Lucene Can&#8230;&#8221; :</p>
<p>&nbsp;</p>
<div id="__ss_10155480" style="width: 425px;"><strong style="display: block; margin: 12px 0 4px;"><a title="Bet you didn't know Lucene can..." href="http://www.slideshare.net/gsingers/bet-you-didnt-know-lucene-can">Bet you didn&#8217;t know Lucene can&#8230;</a></strong><object id="__sse10155480" width="425" height="355" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="allowFullScreen" value="true" /><param name="allowScriptAccess" value="always" /><param name="src" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=lucenecan-111114094003-phpapp01&amp;stripped_title=bet-you-didnt-know-lucene-can&amp;userName=gsingers" /><param name="allowscriptaccess" value="always" /><param name="allowfullscreen" value="true" /><embed id="__sse10155480" width="425" height="355" type="application/x-shockwave-flash" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=lucenecan-111114094003-phpapp01&amp;stripped_title=bet-you-didnt-know-lucene-can&amp;userName=gsingers" allowFullScreen="true" allowScriptAccess="always" allowscriptaccess="always" allowfullscreen="true" /></object></p>
<div style="padding: 5px 0 12px;">View more <a href="http://www.slideshare.net/">presentations</a> from <a href="http://www.slideshare.net/gsingers">gsingers</a>.</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/11/14/bet-you-didnt-know-lucene-can/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Happy Anniversary, Lucene!  10 years at the ASF</title>
		<link>http://www.lucidimagination.com/blog/2011/09/18/happy-anniversary-lucene-10-years-at-the-asf-3/</link>
		<comments>http://www.lucidimagination.com/blog/2011/09/18/happy-anniversary-lucene-10-years-at-the-asf-3/#comments</comments>
		<pubDate>Sun, 18 Sep 2011 18:05:38 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[nutch]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4050</guid>
		<description><![CDATA[<p>From a quiet start as a pet project to a giant in the industry, <a href="http://lucene.apache.org">Apache Lucene</a> is definitely the little (search) engine that could.  On September 18th, 2001 (at 16:29:48 UTC) Jason Van Zyl made the first <a href="http://svn.apache.org/viewvc?view=revision&#38;revision=149570">official import</a> of Doug Cutting&#8217;s Lucene project (which started in 1997 and was hosted on SourceForge) into <a href="http://www.apache.org">Apache&#8217;s</a> Jakarta project (check out the <a href="http://web.archive.org/web/20011202174653/http://jakarta.apache.org/">Wayback machine</a>).</p>
<p>And while I wasn&#8217;t around in the beginning, I thought I would &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>From a quiet start as a pet project to a giant in the industry, <a href="http://lucene.apache.org">Apache Lucene</a> is definitely the little (search) engine that could.  On September 18th, 2001 (at 16:29:48 UTC) Jason Van Zyl made the first <a href="http://svn.apache.org/viewvc?view=revision&amp;revision=149570">official import</a> of Doug Cutting&#8217;s Lucene project (which started in 1997 and was hosted on SourceForge) into <a href="http://www.apache.org">Apache&#8217;s</a> Jakarta project (check out the <a href="http://web.archive.org/web/20011202174653/http://jakarta.apache.org/">Wayback machine</a>).</p>
<p>And while I wasn&#8217;t around in the beginning, I thought I would offer up some (little) known tidbits, links, etc. about Lucene as an ode to the search library that has significantly changed the search world, as well as my own career:</p>
<ol>
<li>Lucene was <a href="http://www.lucidimagination.com/devzone/videos-podcasts/podcasts/interview-doug-cutting">Doug&#8217;s way of learning Java</a>!  How&#8217;s that for a start?  It took him 3 months, working 2 days a week to crank out the first version.</li>
<li>At the time, some commercial search engines could not do incremental updates of the index, meaning you had to re-index all your documents anytime you had an update.  Lucene has always had an incremental model, all the way through to today&#8217;s Near Real Time features that power the likes of <a href="http://www.twitter.com">Twitter</a> at 1 billion+ searches and 100M+ new documents per day.</li>
<li><a href="http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/document/Field.java?view=markup&amp;pathrev=149570">Field myField = Field.Text(&#8220;foo&#8221;, &#8220;bar&#8221;)</a>; anyone?  Or how about Field myField = Field.UnIndexed(&#8220;foo&#8221;, &#8220;bar&#8221;);</li>
<li>Back then, Lucene had it&#8217;s own <a href="http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/PorterStemmer.java?view=markup&amp;pathrev=149570">PorterStemmer</a>, now we just use <a href="http://snowball.tartarus.org">Snowball</a>.</li>
<li>Only 1 of the <a href="http://web.archive.org/web/20020213045032/http://jakarta.apache.org/lucene/docs/whoweare.html">original committers</a> still remains somewhat active.</li>
<li>Read the old <a href="http://web.archive.org/web/20020203084504/http://www.lucene.com/cgi-bin/faq/faqmanager.cgi">FAQ</a>!  True as it ever was.  (Mostly)</li>
<li>Lucene 2.3 drastically improved indexing performance thanks to a thorough overhaul of the innards while barely affecting the API.  4.0 will blow the doors off of previous versions in terms of speed and efficiency.</li>
<li>Lucene is Doug&#8217;s wife&#8217;s <a href="http://www.lucidimagination.com/devzone/videos-podcasts/podcasts/interview-doug-cutting">middle name</a>.</li>
<li>Lucene has evolved from offering a single vector space scoring model to one that now <a href="http://www.lucidimagination.com/blog/2011/09/12/flexible-ranking-in-lucene-4/">offers plug-n-play</a> ranking (BM25 anyone?)</li>
<li>Lucene is ubiquitous.  It powers search on everything from mobile devices to web scale engines.  I&#8217;ve seen indexes as small as 15% of the original content.  I&#8217;ve also seen indexes grow to several billion documents in size.  Lucene has been used as a caching store, an ORM, a cross language search engine, the guts of the popular <a href="http://lucene.apache.org/solr">Solr</a> search server, the retrieval engine for IBM&#8217;s <a href="http://www-03.ibm.com/innovation/us/watson/index.html">Watson</a> as well as several commercial search engines and pretty much everything in between.</li>
<li>Did you know <a href="http://hadoop.apache.org">Apache Hadoop</a> started as a subproject of Lucene?  Doug Cutting and Mike Cafarella first built out Hadoop in order to scale out indexing for the <a href="http://nutch.apache.org">Apache Nutch</a> project.  From there it was spun out to be a top level ASF project and has gone on to be the de facto choice for large scale distributed processing, much like Lucene is the de facto choice for search!  Lucene has also spun out <a href="http://mahout.apache.org">Mahout</a>, <a href="http://tika.apache.org">Tika</a>, Lucene.NET and Lucy!</li>
</ol>
<p>As for how Lucene&#8217;s impacted me?  In 2004, I took a job at the <a href="http://www.cnlp.org">Center for Natural Language Processing</a> at Syracuse University working for Dr. Liz Liddy.  My job was to build an Arabic-English cross language search engine.  Within a day or two of starting, <a href="http://www.linkedin.com/profile/view?id=10139209&amp;trk=tyah">Ozgur Yilmazel</a> (my boss at the time) said something to the effect of &#8220;we&#8217;ll be using Lucene for the implementation.  Go learn it.&#8221;  Digging in, I quickly needed a couple of features, the biggest one being Term Vectors, so I updated a patch from an earlier version of Lucene and managed to convince the committers at the time to commit it.  From there, I kept supplying patches.  Eventually, I was asked to be a committer.  Some time after that, Yonik Seeley and Marc Krellenstein approached a bunch of the committers about starting a company and here I am today at the company we (Erik, Yonik, Marc and I) founded back in 2007, <a href="http://www.lucidimagination.com">Lucid Imagination</a>.  I feel fortunate to have the opportunity to work on hard problems in an interesting field and for that, Lucene, in no small part, I thank  you.</p>
<p>But enough of my self-indulgence, how has Lucene impacted you?  When did you first start using it?  What&#8217;s your biggest index or fastest QPS?   What ways have you used Lucene beyond that of a search engine?  Leave a comment and let us know.</p>
<p>Happy 10th Anniversary, Lucene!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/09/18/happy-anniversary-lucene-10-years-at-the-asf-3/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Estimating Memory and Storage for Lucene/Solr</title>
		<link>http://www.lucidimagination.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/</link>
		<comments>http://www.lucidimagination.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/#comments</comments>
		<pubDate>Wed, 14 Sep 2011 17:27:00 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[disk usage]]></category>
		<category><![CDATA[Grant Ingersoll]]></category>
		<category><![CDATA[memory]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4000</guid>
		<description><![CDATA[<p>Many times, clients ask us to help them estimate memory usage or disk space usage or to share benchmarks as they build out there search system. Doing so is always an interesting process, as I&#8217;ve always been wary of claims about benchmarks (for instance, one of the old tricks of performance benchmark hacking is to &#8220;cat XXX &#62; /dev/null&#8221; to load everything into memory first, which isn&#8217;t what most people do when running their system) &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>Many times, clients ask us to help them estimate memory usage or disk space usage or to share benchmarks as they build out there search system. Doing so is always an interesting process, as I&#8217;ve always been wary of claims about benchmarks (for instance, one of the old tricks of performance benchmark hacking is to &#8220;cat XXX &gt; /dev/null&#8221; to load everything into memory first, which isn&#8217;t what most people do when running their system) and or estimates because I know there are so many variables involved that it is possible to vary the results quite significantly depending on marketing goals. Thus, I tend to be pragmatic (which I think the Lucene/Solr community does as well) and focus on what do my tests show for my specific data and my specific use cases.</p>
<p>For instance, for testing memory, it&#8217;s pretty easy to set up a series of tests that start with a small heap size and successively grow it until no Out Of Memory Errors (OOME) occur. Then, to be on the safe side, add 1 GB of memory to the heap.  It works for the large majority of people. Ironically, for Solr at least, this usually ends up with a heap size somewhere between 6-12 GBs for a system doing &#8220;consumer search&#8221; with faceting, etc. and reasonably sized caches on an index in the 10-50 million docs range. Sure, there are systems that go beyond this or are significantly less (I just saw one the other day that has around 200M docs in less than 3 GB of RAM while handling decent search load), but the 6-12 GB seems to be a nice sweet spot for the application and the JVM, especially when it comes to garbage collection, while still giving the operating system enough room to do it&#8217;s job.  Too much heap and garbage may pile up and give you <em>ohmygodstoptheworld</em> full garbage collections at some point in the future.  Too little heap and you get the dreaded OOME.  Also too much heap relative to total RAM and you choke off the OS.  Besides, that range also has a nice business/operations side effect in that 16 GBs of RAM has a nice performance/cost benefit for many people.</p>
<p>Recently, however, I thought it would be good to get beyond the inherent hand waving above and attempt to come up with a theoretical (with a little bit of empiricism thrown in) model for estimating memory usage and disk space.   After a few discussions on <a href="http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2011-09-13 for assumptions">IRC</a> with McCandless and others, I put together a <span style="text-decoration: underline;"><strong>DRAFT</strong></span> <a href="http://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/size-estimator-lucene-solr.xls">Excel spreadsheet</a> that allows people to model both memory and disk space (based on the formula in <a href="http://www.lucidimagination.com/devzone/references/books-and-publications">Lucene in Action 2nd ed.</a> &#8211; LIA2), after filling in some assumptions about their applications (I put in defaults.)   First a few caveats:</p>
<ol>
<li>This is just an estimate, don&#8217;t construe it for what you are actually seeing in your system.</li>
<li>It is a DRAFT.  It is likely missing a few things, but I am putting it up here and in Subversion as a means to gather feedback.  I reserve the right to have messed up the calculations.</li>
<li>I feel the values might be a little bit low for the memory estimator, especially the Lucene section.</li>
<li>It&#8217;s only good for <a href="svn.apache.org/repos/asf/lucene/dev/trunk">trunk</a>.  I don&#8217;t think it will be right for 3.3 or 3.4.</li>
<li>The goal is to try to establish a model for the &#8220;high water mark&#8221; of memory and disk, not necessarily the typical case.</li>
<li>It inherently assumes you are searching and indexing on the same machine, which is often not the case.</li>
<li>There are still a couple of TODOs in the model.  More to come later.</li>
</ol>
<p>As for using the memory estimator, the primary things to fill in are the number of documents, number of unique terms and information on sorting and indexed fields, but you can also mess with all of the other assumptions.  For Solr, there are entries for estimating cache memory usage.  Keep in mind that the assumption for caching is that they are full, which often is not the case and not even feasible.  For instance, your system may only ever employ 5 or 6 different filters.</p>
<p>The disk space estimator is much more straightforward and based on LIA2&#8242;s fairly simple formula of:</p>
<blockquote>
<div>disk space used(original) = 1/3 original for each indexed field + 1 * original for stored + 2 * original per field with term vectors</div>
</blockquote>
<p>&nbsp;</p>
<p>It will be interesting to see how some of the new flexible indexing capabilities in trunk effect the results of this equation.  Also note, I&#8217;ve seen some applications where the size of the indexed fields is as low as 20%.</p>
<p>Hopefully, people will find this useful as well as enhance it and <a href="https://issues.apache.org/jira/browse/LUCENE-3435">fix any bugs</a> in it.  In other words, feedback is welcome.  As with any model like this, YMMV!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Flexible ranking in Lucene 4</title>
		<link>http://www.lucidimagination.com/blog/2011/09/12/flexible-ranking-in-lucene-4/</link>
		<comments>http://www.lucidimagination.com/blog/2011/09/12/flexible-ranking-in-lucene-4/#comments</comments>
		<pubDate>Mon, 12 Sep 2011 18:51:40 +0000</pubDate>
		<dc:creator>robert.muir</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Relevancy]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=3951</guid>
		<description><![CDATA[<p>Over the summer I served as a <a title="Google Summer of Code" href="http://code.google.com/soc/">Google Summer of Code</a> mentor for David Nemeskey, PhD student at <a title="Eötvös Loránd University" href="http://www.elte.hu/en">Eötvös Loránd University</a>. David proposed to improve Lucene&#8217;s scoring architecture and implement some state-of-the-art ranking models with the new framework.</p>
<p>These improvements are now committed to Lucene&#8217;s trunk: you can use these models in tandem with all of Lucene&#8217;s features (boosts, slops, explanations, etc) and queries (term, phrase, spans, etc). A <a title="SOLR-2754: create Solr similarity factories for new ranking algorithms" href="https://issues.apache.org/jira/browse/SOLR-2754">JIRA issue</a> has been created &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>Over the summer I served as a <a title="Google Summer of Code" href="http://code.google.com/soc/">Google Summer of Code</a> mentor for David Nemeskey, PhD student at <a title="Eötvös Loránd University" href="http://www.elte.hu/en">Eötvös Loránd University</a>. David proposed to improve Lucene&#8217;s scoring architecture and implement some state-of-the-art ranking models with the new framework.</p>
<p>These improvements are now committed to Lucene&#8217;s trunk: you can use these models in tandem with all of Lucene&#8217;s features (boosts, slops, explanations, etc) and queries (term, phrase, spans, etc). A <a title="SOLR-2754: create Solr similarity factories for new ranking algorithms" href="https://issues.apache.org/jira/browse/SOLR-2754">JIRA issue</a> has been created to make it easy to use these models from Solr&#8217;s schema.xml.</p>
<p>Relevance ranking is the heart of the search engine, and I hope the additional models and flexibility will improve the user experience for Lucene: whether you&#8217;ve been frustrated with tuning TF/IDF weights and find an alternative model works better for your case, found it difficult to integrate custom logic that your application needs, or just want to experiment.</p>
<p>I&#8217;ll be giving a talk about how you can practically apply some of the upcoming Lucene 4 search features at <a title="Improved Search with Lucene 4" href="http://2011.lucene-eurocon.org/talks/20851">Lucene Eurocon</a> in October, and at the <a title="What's New in Lucene/Solr Meetup" href="http://www.meetup.com/SFBay-Lucene-Solr-Meetup/events/32514312/">SFBay Apache Lucene/Solr Meetup</a> later this month.</p>
<p>Some bullet points of the new scoring features:</p>
<ul>
<li>New ranking algorithms, in addition to Lucene&#8217;s <a title="Vector Space Model" href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model</a>:
<ul>
<li><a title="Okapi BM25" href="http://en.wikipedia.org/wiki/Okapi_BM25">Okapi BM25 Model</a></li>
<li><a title="Language Model" href="http://en.wikipedia.org/wiki/Language_model">Language Models</a></li>
<li><a title="Probability models for information retrieval based on divergence from randomness" href="http://theses.gla.ac.uk/1570/">Divergence from Randomness Models</a></li>
<li><a title="Information-based models for ad hoc IR" href="http://dl.acm.org/citation.cfm?id=1835490">Information-based Models</a></li>
</ul>
</li>
<li>Added key statistics to the index format to support additional scoring models.
<ul>
<li>Term- and field-level statistics for collection frequencies and deriving averages.</li>
<li>Additional document-level statistics for computing normalization factors.</li>
</ul>
</li>
<li>Decoupled <em>matching</em> from <em>ranking</em> in Lucene&#8217;s core search classes:<em></em>
<ul>
<li>Customize scoring without digging into the &#8220;guts&#8221;.</li>
<li><em></em>Customize <em>explanations</em>: essential for <a title="Debugging Search Application Relevance Issues" href="http://www.lucidimagination.com/devzone/technical-articles/debugging-search-application-relevance-issues">debugging relevance issues</a>.</li>
</ul>
</li>
<li>Powerful low-level <em>Similarity </em>API, supporting advanced use cases:
<ul>
<li>Incorporate per-document values from <a title="Heavy Committing: DocValues aka. Column Stride Fields in Lucene 4.0" href="http://www.lucidimagination.com/files/Willnauer%20Simon%20-%20DocValues%20Column%20Stride%20Fields%20in%20Lucene.pdf">Column Stride Fields</a> into the score.</li>
<li>Use different scoring parameters or algorithms for different fields.</li>
<li>Fuse multiple scoring algorithms into a combined score.</li>
</ul>
</li>
<li>Convenient high-level <em>SimilarityBase </em>for everything else:
<ul>
<li>Write your own scoring function in one Java method.</li>
<li>Easy access to available index statistics.</li>
</ul>
</li>
</ul>
<p>For more information about this GSOC project, take a look at its <a title="Google Summer of Code 2011 - Implementing State of the Art Ranking for Lucene" href="http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking">wiki page</a></p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/09/12/flexible-ranking-in-lucene-4/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Learn Lucene&#8230; deeper</title>
		<link>http://www.lucidimagination.com/blog/2011/09/12/learn-lucene/</link>
		<comments>http://www.lucidimagination.com/blog/2011/09/12/learn-lucene/#comments</comments>
		<pubDate>Mon, 12 Sep 2011 18:24:31 +0000</pubDate>
		<dc:creator>Erik Hatcher</dc:creator>
				<category><![CDATA[Events]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=3988</guid>
		<description><![CDATA[<p>You&#8217;re using Solr, or some other Lucene-based search solutions, &#8230; or you should and will be!  You are (or will be) building your solutions on top of a top-notch search library, Apache Lucene.</p>
<p>Solr makes using Lucene easier &#8211; you can index a variety of data sources easily, pretty much out of the box, and you can easily integrate features such as faceting, highlighting, and spellchecking &#8211; all without writing Java code. And if that&#8217;s &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>You&#8217;re using Solr, or some other Lucene-based search solutions, &#8230; or you should and will be!  You are (or will be) building your solutions on top of a top-notch search library, Apache Lucene.</p>
<p>Solr makes using Lucene easier &#8211; you can index a variety of data sources easily, pretty much out of the box, and you can easily integrate features such as faceting, highlighting, and spellchecking &#8211; all without writing Java code. And if that&#8217;s all you need and it works solidly for you, awesome! You can stop reading now and attend one of our other <a href="http://www.lucidimagination.com/services/training">excellent training courses</a> that fit your needs. But if you are a tinkerer and want to know what makes Solr shine, or if you need some new or improved feature read on&#8230;<br />
<span id="more-3988"></span><br />
Deeper down, Lucene is cranking &#8211; analyzing, buffering, and indexing your documents, merging segments, parsing queries, caching data structures, rapidly hopping around an inverted index, computing scores, navigating finite state machines, and much more.</p>
<p>So how do you go about learning Lucene deeper? You can start with our <a href="http://refcardz.dzone.com/refcardz/lucene?oid=hom37143">&#8220;Understanding Lucene&#8221; DZone refcard</a>.  And let&#8217;s not forget <a href="http://www.manning.com/hatcher3/">Lucene in Action</a>, as it&#8217;s the most polished, detailed, and well crafted documentation available on the Lucene library. And of course there&#8217;s the incredibly vibrant and helpful <a href="http://lucene.apache.org/">Lucene open source community</a>. Those resources will serve you well, but there&#8217;s no substitute for live, interactive, personal training to get you up to speed fast with best practices.</p>
<p>I&#8217;m in the process of overhauling our <a href="http://2011.lucene-eurocon.org/pages/training#lucene-workshop">Lucene training course</a>, that I&#8217;ll personally be delivering at <a href="http://2011.lucene-eurocon.org">Lucene EuroCon 2011 in Barcelona</a> next month. This new and improved course takes an activity-based approach to learning and using Lucene&#8217;s API, beginning with the common tasks in building solutions using Lucene, whether you&#8217;re building directly to Lucene&#8217;s API or you&#8217;re writing custom components for Solr.</p>
<p>One area that I&#8217;m particularly jazzed about teaching is &#8220;query parsing&#8221;, the process of taking a user (or machine&#8217;s) search request and turning it into the appropriate underlying Lucene Query object instance.  Many folks developing with Lucene are familiar with Lucene&#8217;s QueryParser.  But did you know there are a couple of other query parsers with special powers?  There&#8217;s the surround query parser, enabling sophisticated proximity SpanQuery clauses.  And there&#8217;s the mysterious &#8220;XML query parser&#8221; (don&#8217;t let the ugly sounding name dissuade you) that slots dynamic query parameters, such as coming from an &#8220;advanced search&#8221; request, into a tree structured query template.   There&#8217;s some more insight into the world of Lucene query parsers an <a href="http://www.lucidimagination.com/blog/2009/02/22/exploring-query-parsers/">&#8220;Exploring Query Parsers&#8221;</a> blog post.</p>
<p>What about all the Lucene contrib modules activity in the Lucene 3.x releases?   Here&#8217;s a bit of the goodnesses: better Unicode handling with the ICU tokenizers and filters, improved stemming, and many other analysis improvements, field grouping/collapsing, and block join/query for handling particular parent/child relationships.</p>
<p>Come learn the latest about the amazing Lucene library at <a href="http://2011.lucene-eurocon.org/">Lucene EuroCon</a>!  You, your boss, and your projects will all be glad you did.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/09/12/learn-lucene/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Charlottesville, VA meetup</title>
		<link>http://www.lucidimagination.com/blog/2011/08/09/charlottesville-va-meetup/</link>
		<comments>http://www.lucidimagination.com/blog/2011/08/09/charlottesville-va-meetup/#comments</comments>
		<pubDate>Tue, 09 Aug 2011 14:25:42 +0000</pubDate>
		<dc:creator>Erik Hatcher</dc:creator>
				<category><![CDATA[Events]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[charlottesville]]></category>
		<category><![CDATA[Erik Hatcher]]></category>
		<category><![CDATA[virginia]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=3819</guid>
		<description><![CDATA[[ Monday, 15 August 2011; 18:00 to 21:00. ] <p>If you&#8217;re in the central VA, or even in the northern VA / DC area, come join us for the inaugural <a href="http://www.meetup.com/Charlottesville-Apache-Lucene-Solr-Meetup/events/25877811/">&#8220;Charlottesville Solr and Lucene Meetup&#8221;</a>.  Charlottesville is home to the co-authors of Manning&#8217;s &#8220;Lucene in Action&#8221; and Packt&#8217;s Solr &#8220;Solr 1.4 Enterprise Search Server&#8221; books.  This area is a hotbed of search activity thanks to NGIC and DIA calling Charlottesville home, and the many &#8230;</p>]]></description>
			<content:encoded><![CDATA[[ Monday, 15 August 2011; 18:00 to 21:00. ] <p>If you&#8217;re in the central VA, or even in the northern VA / DC area, come join us for the inaugural <a href="http://www.meetup.com/Charlottesville-Apache-Lucene-Solr-Meetup/events/25877811/">&#8220;Charlottesville Solr and Lucene Meetup&#8221;</a>.  Charlottesville is home to the co-authors of Manning&#8217;s &#8220;Lucene in Action&#8221; and Packt&#8217;s Solr &#8220;Solr 1.4 Enterprise Search Server&#8221; books.  This area is a hotbed of search activity thanks to NGIC and DIA calling Charlottesville home, and the many gov&#8217;t subcontractors supporting them here.  We are also home of hotelicopter, OpenQ, the University of Virginia, and many other organizations that use Lucene and Solr.</p>
<p>Looking forward to seeing you there.   Everyone in attendance will not only get to hear about the latest greatest enhancements to these technologies, but may also walk away with a cool Lucid t-shirt!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/08/09/charlottesville-va-meetup/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Überconf &#8211; No Fluff, Just Solr</title>
		<link>http://www.lucidimagination.com/blog/2011/07/19/uberconf-no-fluff-just-solr/</link>
		<comments>http://www.lucidimagination.com/blog/2011/07/19/uberconf-no-fluff-just-solr/#comments</comments>
		<pubDate>Tue, 19 Jul 2011 19:33:09 +0000</pubDate>
		<dc:creator>Erik Hatcher</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Erik Hatcher]]></category>
		<category><![CDATA[uberconf]]></category>
		<category><![CDATA[uberconf11]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=3784</guid>
		<description><![CDATA[[ Tuesday, 12 July 2011 to Friday, 15 July 2011. ] <p>I had the honor and pleasure of being invited to speak at <a href="http://uberconf.com/conference/denver/2011/07/home">Überconf</a> last week in the Denver, CO area.  <img class="alignright" src="http://www.nofluffjuststuff.com/images/2011/uber/UberConf_125x125_v2-01.png" alt="Überconf" /> The annual conference is organized by Jay Zimmerman of No Fluff, Just Stuff fame.  Überconf  has the same top-notch quality, at a grander scale &#8211; 10 concurrent tracks (woah!), full day pre-conference trainings (<a href="http://uberconf.com/conference/denver/2011/07/mobile_workshops">mobile, anyone?</a>), food (full breakfast!  that&#8217;s a REAL hearty &#8230;</p>]]></description>
			<content:encoded><![CDATA[[ Tuesday, 12 July 2011 to Friday, 15 July 2011. ] <p>I had the honor and pleasure of being invited to speak at <a href="http://uberconf.com/conference/denver/2011/07/home">Überconf</a> last week in the Denver, CO area.  <img class="alignright" src="http://www.nofluffjuststuff.com/images/2011/uber/UberConf_125x125_v2-01.png" alt="Überconf" /> The annual conference is organized by Jay Zimmerman of No Fluff, Just Stuff fame.  Überconf  has the same top-notch quality, at a grander scale &#8211; 10 concurrent tracks (woah!), full day pre-conference trainings (<a href="http://uberconf.com/conference/denver/2011/07/mobile_workshops">mobile, anyone?</a>), food (full breakfast!  that&#8217;s a REAL hearty bonus!), and many of the best technical presenters in the industry.  Lucene/Solr earned a full day &#8220;track&#8221; at Überconf.<br />
<span id="more-3784"></span><br />
One full day was dedicated to my three Solr presentations, one being a double-session &#8220;workshop&#8221;.  I began the day presenting &#8220;Rapid Prototyping with Solr&#8221;, demonstrating several quick dataset ingestion up to usable search user interface examples.  Is Solr right for your data and your environment?  Try it and see &#8211; it&#8217;ll be 15 minutes well spent!  <img src='http://www.lucidimagination.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Here are the slides for &#8220;Rapid Prototyping with Solr&#8221;:
<div style="width:340px" id="__ss_8600305"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/erikhatcher/rapid-prototyping-with-solr-8600305" title="Rapid Prototyping with Solr" target="_blank">Rapid Prototyping with Solr</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/8600305?rel=0" width="340" height="284" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<div style="padding:5px 0 12px"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/erikhatcher" target="_blank">Erik Hatcher</a> </div>
</p></div>
<p>My next talk, titled &#8220;Solr Recipes&#8221;, was a 3 hour workshop covering the common ways to get data into Solr, configure it, and leverage it within applications.  </p>
<p>Here are the &#8220;Solr Recipes&#8221; slides:
<div style="width:340px" id="__ss_8600306"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/erikhatcher/solr-recipes-workshop" title="Solr Recipes Workshop" target="_blank">Solr Recipes Workshop</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/8600306?rel=0" width="340" height="284" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<div style="padding:5px 0 12px"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/erikhatcher" target="_blank">Erik Hatcher</a> </div>
</p></div>
<p>And finally, to a few hardcore folks, I discussed &#8220;Lucene for Solr Developers&#8221;, which more broadly covered the various ways to extend Solr.  One  cool example (or so I think, at least) I built for this talk that I&#8217;ve put out there as food for thought is &#8220;auto-faceting&#8221;, having a way for Solr to determine the best facets to select for a given query.  See <a href="https://issues.apache.org/jira/browse/SOLR-2641">SOLR-2641</a> for my initial proof-of-concept implementation. </p>
<p>Here are the &#8220;Lucene for Solr Developers&#8221; slides:
<div style="width:340px" id="__ss_8600304"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/erikhatcher/lucene-for-solr-developers" title="Lucene for Solr Developers" target="_blank">Lucene for Solr Developers</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/8600304?rel=0" width="340" height="284" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<div style="padding:5px 0 12px"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/erikhatcher" target="_blank">Erik Hatcher</a> </div>
</p></div>
<p>There was so much going on that I barely got to tap into the conference experience myself, with the best part being the conversations had during meals, and between and during the scheduled talks.  I was able to reconnect with many long-time friends that I&#8217;ve made through No Fluff, Just Stuff, and made many new acquaintances &#8211; I won&#8217;t name names, as <a href="http://uberconf.com/conference/denver/2011/07/speakers">this list</a> covers most of them.</p>
<p>Thank you, Überconf, Jay, and fellow speakers and attendees for a stellar technical event.   If you missed it, don&#8217;t despair, <a href="http://www.nofluffjuststuff.com">No Fluff, Just Stuff</a> brings many of the same speakers and topics to an area near you.  I&#8217;ll be speaking at a few NFJS events during the second half of this year, including <a href="http://www.nofluffjuststuff.com/conference/raleigh/2011/08/home">Raleigh, NC</a>, <a href="http://www.nofluffjuststuff.com/conference/boston/2011/09/home">Boston, MA</a>, and, surely to rival the quality and size of Überconf,  <a href="http://www.therichwebexperience.com/conference/fort_lauderdale/2011/11/home">The Rich Web Experience</a> in Ft. Lauderdale, FL.</p>
<p>Let&#8217;s talk.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/07/19/uberconf-no-fluff-just-solr/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Putting your search skills to the test: Lucid Certified Apache Solr/Lucene Developer Program</title>
		<link>http://www.lucidimagination.com/blog/2011/05/12/putting-your-search-skills-to-the-test-lucid-certified-apache-solrlucene-developer-program/</link>
		<comments>http://www.lucidimagination.com/blog/2011/05/12/putting-your-search-skills-to-the-test-lucid-certified-apache-solrlucene-developer-program/#comments</comments>
		<pubDate>Thu, 12 May 2011 13:01:58 +0000</pubDate>
		<dc:creator>David M. Fishman</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Enterprise Search]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Open Source]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=3438</guid>
		<description><![CDATA[<p>One of the singular qualities of search technology is its breadth: if it&#8217;s been written down (albeit digitally), you can search it, and if you can search it, you can build a search app for it. That&#8217;s part of what makes Solr/Lucene so alluring for application development &#8212; you can build it to search just about anything, for anyone, in any way. Inspiring breadth, however, can be pretty daunting to master.</p>
<p>How, then, can you &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>One of the singular qualities of search technology is its breadth: if it&#8217;s been written down (albeit digitally), you can search it, and if you can search it, you can build a search app for it. That&#8217;s part of what makes Solr/Lucene so alluring for application development &#8212; you can build it to search just about anything, for anyone, in any way. Inspiring breadth, however, can be pretty daunting to master.</p>
<p>How, then, can you know how much you know about search with Solr and Lucene? In the world of Apache open source, there&#8217;s <a href="http://apache.org/foundation/how-it-works.html#meritocracy">a clear meritocracy</a> of peer review: contributors, committers, and active membership in the PMC. In theory, it&#8217;s a distinction anyone of sufficient talent and single-minded focus can achieve &#8212; just like anyone of sufficient talent and single-minded focus can make it to the NBA, or win the Nobel prize, or join the New York Philharmonic.</p>
<p>So you probably know your stuff if you&#8217;ve won the Nobel prize, made the NBA, or played the solo for <a href="http://en.wikipedia.org/wiki/Clarinet_Concerto_%28Mozart%29">Mozart&#8217;s Clarinet Concerto</a> at <a href="http://www.barrypopik.com/index.php/new_york_city/entry/how_do_you_get_to_carnegie_hall/">Carnegie Hall</a>, or you&#8217;re a Lucene/Solr contributor-or-committer. But what if you have not done any of those things, how do you know you know? Equally important, how do your peers or potential employers know how well you know your open source search stuff?</p>
<p>While there are more professional basketball players than Lucene/Solr committers, there are many, many more capable, talented, experienced Solr/Lucene application developers who are not going to &#8216;go pro&#8217; in the Apache meritocracy. And the demand for Solr application development skills is exploding as interest and uptake of the leading open source application development technology spread like wildfire through organizations large and small. (<a href="http://lucenerevolution.com/">Lucene Revolution, May 25-26 in San Francisco</a>, will be packed with these people &#8212; <a href="http://us.ootoweb.com/luceneregistration">sign up today</a> if you haven&#8217;t already. And read on for another special  opportunity at Lucene Revolution).</p>
<p><a href="http://www.lucidimagination.com/blog/wp-content/uploads/2011/05/CertificationLogo.png"><img class="alignright size-medium wp-image-3468" title="CertificationLogo" src="http://www.lucidimagination.com/blog/wp-content/uploads/2011/05/CertificationLogo-300x287.png" alt="" width="210" height="201" /></a>It&#8217;s exactly for the broad base of interested, committed search application developers that <a href="http://www.lucidimagination.com/About/Company-News/Lucid-Imagination-Launches-Certification-Program-Apache-SolrLucene-Developers">today we&#8217;re introducing</a> the <a href="http://www.lucidimagination.com/certification">Lucid Certified Apache Solr/Lucene Developer Program</a>; a certification exam designed to benchmark development skills and experience in building applications with Apache Solr.</p>
<p>Designed with Prometric and a team of subject matter experts comprised of Apache Lucene/Solr committers, developers, and trainers, the test is <a href="http://www.lucidimagination.com/certification/FAQ#a6">designed to rigorously assess </a>a broad base of search skills and experience, and provide the closest reasonable approximation possible to a standard measure of  search skills and experience.  It&#8217;s delivered via <a href="http://www.prometric.com/lucid/default.htm">Prometric.com</a>, consists of multiple choice questions, and costs $250. The test reflects a carefully selected, broad range of topics intended to reflect the real-world challenges and landscape of search application development problems, which <a href="http://www.lucidimagination.com/topics">you can see here</a>.</p>
<p><a href="http://www.opensourceconnections.com/">Eric Pugh</a>, who <a href="http://www.packtpub.com/solr-1-4-enterprise-search-server/book">wrote the book on Solr</a>, says this:</p>
<blockquote><p>“I  expect that the Lucid Imagination certification will quickly  become  the gold standard benchmark for whether someone who claims Solr  and  Lucene expertise truly possesses it. Oftentimes, a buyer of services   has to take the leap of faith from sales pitch to execution that the   knowledge is truly there. This certification can show, without a doubt,   that the holder truly has the knowledge required to deliver a  successful  Solr/Lucene implementation. In the open source world, there  are very  few marks of authenticity: committer status, published author,  and now  the Lucid Solr certification. Just as the CPA certification  shows a high  level of knowledge and ability in the accounting industry,  the Lucid  Imagination Solr certification demonstrates unquestionable  knowledge and  experience in successful Solr/Lucene search engine  implementation.”</p></blockquote>
<p>It&#8217;s important to be clear about what the certification is <strong>not:</strong></p>
<ul>
<li>It&#8217;s not easy: don&#8217;t expect to take your first Solr course one day and pass the exam the next.</li>
<li>It&#8217;s not a substitute for experience: if you&#8217;ve only built one Solr application, earlier this morning, using the wiki demo that runs locally in your browser, you won&#8217;t pass.</li>
<li>It&#8217;s not a substitute for training: taking a class from an expert may not be sufficient, but it will really help (and <a href="http://training.lucidimagination.com">we offer the most professional-grade courses</a> available; did I mention <a href="http://www.lucenerevolution.org/training">Lucene Revolution has classes available</a>, too?)</li>
<li>It&#8217;s not a casual conceptual overview: expect to answer detailed questions on everything from Lucene fundamentals to Solr debug output.</li>
<li>It&#8217;s not a simple checklist of facts: you&#8217;ll have to demonstrate judgement calls in identifying correct answers to topic areas tied to searching, indexing, deployment, data source types, etc.</li>
</ul>
<p>Testing as a pedagogical method &#8212; a mechanism for driving learning &#8212; is not the be-all-end-all of education (you probably didn&#8217;t think highly  of classmates who asked the teacher, &#8220;Will this be on the test?&#8221;). But it turns out that tests can have <a href="http://www.nytimes.com/2011/01/21/science/21memory.html">a salutary impact on acquiring and retaining knowledge</a>, according to <a href="http://www.sciencemag.org/content/early/2011/01/19/science.1199327.abstract">a recent article in Science</a>.</p>
<p>We expect that this test will help level the playing field for a broad range of application developers to acquire and prove their Solr/Lucene application development skills &#8212; and help employers who want to take full advantage of the best open-source search technology on the planet find the men and women who have the stuff to do it.</p>
<p>If you&#8217;re coming to Lucene Revolution, the exam will be offered there for free &#8212; a savings of $250 over the regular price. Details are <a href="http://lucenerevolution.com/2011/certification">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/05/12/putting-your-search-skills-to-the-test-lucid-certified-apache-solrlucene-developer-program/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

