<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lucid Imagination &#187; Solr</title>
	<atom:link href="http://www.lucidimagination.com/blog/category/solr/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lucidimagination.com/blog</link>
	<description>Exclusively dedicated to Apache Lucene/Solr open source search technology</description>
	<lastBuildDate>Sat, 04 Feb 2012 01:12:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Solr and LucidWorks feature matrix available</title>
		<link>http://www.lucidimagination.com/blog/2012/01/03/solr-and-lucidworks-feature-matrix-available/</link>
		<comments>http://www.lucidimagination.com/blog/2012/01/03/solr-and-lucidworks-feature-matrix-available/#comments</comments>
		<pubDate>Tue, 03 Jan 2012 21:51:08 +0000</pubDate>
		<dc:creator>Cassandra Targett</dc:creator>
				<category><![CDATA[LucidWorks]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4589</guid>
		<description><![CDATA[<p>We get asked a lot by customers what&#8217;s in a new Solr/Lucene release that applies to them, and with our own LucidWorks Platform available, customers naturally want to know what they&#8217;ll get that they don&#8217;t already have. If you&#8217;re happily running along on Solr 1.4, why or when should you update to a newer version? Should you migrate to LucidWorks?</p>
<p>So we decided to try to put together a matrix of major features and show &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>We get asked a lot by customers what&#8217;s in a new Solr/Lucene release that applies to them, and with our own LucidWorks Platform available, customers naturally want to know what they&#8217;ll get that they don&#8217;t already have. If you&#8217;re happily running along on Solr 1.4, why or when should you update to a newer version? Should you migrate to LucidWorks?</p>
<p>So we decided to try to put together a matrix of major features and show in which versions they are available. Solr 1.4 is pretty old by now, so it naturally appears not to hold up well against Solr 3.5, Solr Trunk, or LucidWorks. Think of it as the base from which the later features in the list grow.</p>
<p>This was an interesting exercise to work through. It&#8217;s easy to read through the changes.txt for each release and try to include everything in a list such as this (and our Support guys are probably disappointed that I didn&#8217;t do that), but I tried to keep it to the major innovations or bug fixes so it stays somewhat readable. But there&#8217;s always the question of whether it&#8217;s too much or too little detail.</p>
<p>I hope it&#8217;s useful and we&#8217;d like to know what you think. Is it worthwhile? Should we go to deeper detail? Could the features use more explanation? Look it over at <a href="http://www.lucidimagination.com/devzone/references/feature-matrix-solr-and-lucidworks">Feature Matrix for Solr and LucidWorks</a> and share your suggestions in the comments.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2012/01/03/solr-and-lucidworks-feature-matrix-available/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Why Not AND, OR, And NOT?</title>
		<link>http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/</link>
		<comments>http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/#comments</comments>
		<pubDate>Wed, 28 Dec 2011 23:23:24 +0000</pubDate>
		<dc:creator>hossman</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[query parser]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4572</guid>
		<description><![CDATA[I really dislike the so called "Boolean Operators" ("AND", "OR", and "NOT") and generally discourage people from using them.  It's understandable that novice users may tend to think about the queries they want to run in those terms, but as you become more familiar with IR concepts in general, and what Solr specifically is capable of, I think it's a good idea to try to "set aside childish things" and start thinking (and encouraging your users to think) in terms of the superior "Prefix Operators" ("+", "-").]]></description>
			<content:encoded><![CDATA[<p><i>The following is written with Solr users in mind, but the principles apply to Lucene users as well.</i></p>
<p>
I really dislike the so called &#8220;Boolean Operators&#8221; (&#8220;AND&#8221;, &#8220;OR&#8221;, and &#8220;NOT&#8221;) and generally discourage people from using them.  It&#8217;s understandable that novice users may tend to think about the queries they want to run in those terms, but as you become more familiar with IR concepts in general, and what Solr specifically is capable of, I think it&#8217;s a good idea to try to &#8220;set aside childish things&#8221; and start thinking (and encouraging your users to think) in terms of the superior &#8220;Prefix Operators&#8221; (&#8220;+&#8221;, &#8220;-&#8221;).
</p>
<h2>Background: Boolean Logic Makes For Terrible Scores</h2>
<p>
<a href="https://en.wikipedia.org/wiki/Boolean_algebra">Boolean Algebra</a> is (as my father would put it) &#8220;pretty neat stuff&#8221; and the world as we know it most certainly wouldn&#8217;t exist with out it.  But when it comes to building a search engine, boolean logic tends to not be very helpful.  Depending on how you look at it, boolean logic is all about truth values and/or set intersections.  In either case, there is no concept of &#8220;relevancy&#8221; &#8212; either something is true or it&#8217;s false; either it is in a set, or it is not in the set.
</p>
<p>
When a user is looking for &#8220;all documents that contain the word &#8216;Alligator&#8217;&#8221; they aren&#8217;t going to very be happy if a search system applied simple boolean logic to just identify the <i>unordered set</i> of all matching documents.  Instead algorithms like TF/IDF are used to try and identify the <i>ordered list</i> of matching documents, such that the &#8220;best&#8221; matches come first.  Likewise, if a user is looking for &#8220;all documents that contain the words &#8216;Alligator&#8217; or &#8216;Crocodile&#8217;&#8221;, a simple boolean logic union of the sets of documents from the individual queries would not generate results as good as a query that took into the TF/IDF scores of the documents for the individual queries, as well as considering which documents matches <i>both</i> queries.  (The user is probably more interested in a document that discusses the similarities and differences between Alligators to Crocodiles then in documents that only mention one or the other a great many times).
</p>
<p>
This brings us to the crux of why I think it&#8217;s a bad idea to use the &#8220;Boolean Operators&#8221; in query strings: because it&#8217;s not how the underlying query structures actually work, and it&#8217;s not as expressive as the alternative for describing what you want.
</p>
<h2>BooleanQuery: Great Class, Bad Name</h2>
<p>
To really understand why the boolean operators are inferior to the prefix operators, you have to start by considering the underlying implementation.  The <a href="https://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a> class is probably one of the most misleading class names in the entire Lucene code base because it doesn&#8217;t model simple boolean logic query operations at all.  The basic function of a BooleanQuery is:
</p>
<ol>
<li>
A BooleanQuery consists of one or more <a href="https://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/search/BooleanClause.html">BooleanClauses</a>, each of which contains two pieces of information:</p>
<ul>
<li>A nested Query</li>
<li>An <a href="https://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/search/BooleanClause.Occur.html">Occur</a> flag, which has one of three values</li>
<ul>
<li><code>MUST</code> &#8211; indicating that documents must match this nested Query in order for the document to match the BooleanQuery, and the score from this subquery should contribute to the score for the BooleanQuery</li>
<li><code>MUST_NOT</code> &#8211; indicating that documents which match this nested Query are prohibited from matching the BooleanQuery</li>
<li><code>SHOULD</code> &#8211; indicating that documents which match this nested Query should have their score from the nested query contribute to the score from the BooleanQuery, but documents can be a match for the BooleanQuery even if they do not match the nested query</li>
</ul>
</ul>
</li>
<li>
If a BooleanQuery contains no <code>MUST</code> BooleanClauses, then a document is only considered a match against the BooleanQuery if one or more of the <code>SHOULD</code> BooleanClauses is a match.
</li>
<li>
The final score of a document which matches a BooleanQuery is based on the sum of the scores from all the matching <code>MUST</code> and <code>SHOULD</code> BooleanClauses, multiplied by a &#8220;coord factor&#8221; based on the ratio of the number of matching BooleanClauses to the total number of BooleanClauses in the BooleanQuery.
</li>
</ol>
<p>
These rules are not exactly simple to understand.  They are certainly more complex then boolean logic truth tables, but that&#8217;s because they are more powerful.  The examples below show how easy it is to implement &#8220;pure&#8221; boolean logic with BooleanQuery objects, but they only scratch the surface of what is possible with the BooleanQuery class:
</p>
<ul>
<li>
<p><b>Conjunction:</b> <code>(X &and; Y)</code></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.MUST);
q.add(Y, Occur.MUST);
</pre>
</li>
<li>
<p><b>Disjunction:</b> <code>(X &or; Y)</code></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Y, Occur.SHOULD);
</pre>
</li>
<li>
<p><b>Negation:</b> <code>(X &not; Z)</code></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Z, Occur.MUST_NOT);
</pre>
</li>
</ul>
<h2>Query Parser: Prefix Operators</h2>
<p>
In the Lucene <a href="https://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/queryParser/QueryParser.html">QueryParser</a> (and all of the other parsers that are based on it, like DisMax and EDisMax) the &#8220;prefix&#8221; operators &#8220;+&#8221; and &#8220;-&#8221; map directly to the Occur.MUST and Occur.MUST_NOT flags, while the <i>absence</i> of a prefix maps to the Occur.SHOULD flag by default. <i>(If you have any suggestions for a one character prefix syntax that could be used to explicitly indicate Occur.SHOULD, please comment with your suggestions, I&#8217;ve been trying to think of a good one for years.)</i>  So using the prefix syntax, you can express <i>all</i> of the permutations that the BooleanQuery class supports &#8212; not just simple boolean logic:
</p>
<ul>
<li>
<p><b><code>(+X +Y)</code></b> <i>&#8230; Conjunction, ie: (X &and; Y)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.MUST);
q.add(Y, Occur.MUST);
</pre>
</li>
<li>
<p><b><code>(X Y)</code></b> <i>&#8230; Disjunction, ie: (X &or; Y)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Y, Occur.SHOULD);
</pre>
</li>
<li>
<p><b><code>(+X -Z)</code></b> <i>&#8230; Negation, ie: (X &not; Z)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.MUST);
q.add(Z, Occur.MUST_NOT);
</pre>
</li>
<li>
<p><b><code>((X Y) -Z)</code></b> <i>&#8230; ((X &or; Y) &not; Z)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
BooleanQuery inner = new BooleanQuery();
inner.add(X, Occur.SHOULD);
inner.add(Y, Occur.SHOULD);
q.add(inner, Occur.SHOULD);
q.add(Z, Occur.MUST_NOT);
</pre>
</li>
<li>
<p><b><code>(X Y -Z)</code></b> <i>&#8230; Not expressible in simple boolean logic</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Y, Occur.SHOULD);
q.add(Z, Occur.MUST_NOT);
</pre>
</li>
</ul>
<p>
Note in particular the differences between the last two examples.  <code>(X Y -Z)</code> is a single BooleanQuery object containing three clauses, while <code>((X Y) -Z)</code> is a BooleanQuery containing two clauses, one of which is a nested BooleanQuery containing two clauses.  In both cases a document must match either &#8220;X&#8221; or &#8220;Y&#8221; and it can not match against &#8220;Z&#8221; (so the set of documents matched by each query will be the same) and in both cases the score of a document will be higher if it matches both the &#8220;X&#8221; and &#8220;Y&#8221; clauses; but because of the difference in their structures, the <i>scores</i> for these queries will be different for every document.  In particular, the coord factor will cause documents matching only one of &#8220;X&#8221; or &#8220;Y&#8221; (but not both) to have extremely different scores from each of these queries. <i>(This assumes that the <a href="https://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> is being used; it would be possible to write a custom Similarity to force the scores to be equivalent)</i>
</p>
<h2>Query Parser: &#8220;Boolean Operators&#8221;</h2>
<p>
The query parser also supports the so called &#8220;boolean operators&#8221; which can also be used to express boolean logic, as demonstrated in these examples:
</p>
<ul>
<li>
<p><b><code>(X AND Y)</code></b> <i>&#8230; Conjunction, ie: (X &and; Y)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.MUST);
q.add(Y, Occur.MUST);
</pre>
</li>
<li>
<p><b><code>(X OR Y)</code></b> <i>&#8230; Disjunction, ie: (X &or; Y)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Y, Occur.SHOULD);
</pre>
</li>
<li>
<p><b><code>(X NOT Z)</code></b> <i>&#8230; Negation, ie: (X &not; Z)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Z, Occur.MUST_NOT);
</pre>
</li>
<li>
<p><b><code>((X AND Y) OR Z)</code></b> <i>&#8230; ((X &and; Y) &or; Z)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
BooleanQuery inner = new BooleanQuery();
inner.add(X, Occur.MUST);
inner.add(Y, Occur.MUST);
q.add(inner, Occur.SHOULD);
q.add(Z, Occur.SHOULD);
</pre>
</li>
<li>
<p><b><code>((X OR Y) AND Z)</code></b> <i>&#8230; ((X &or; Y) &and; Z)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
BooleanQuery inner = new BooleanQuery();
inner.add(X, Occur.SHOULD);
inner.add(Y, Occur.SHOULD);
q.add(inner, Occur.MUST);
q.add(Z, Occur.MUST);
</pre>
</li>
<li>
<p><b><code>(X AND (Y NOT Z))</code></b> <i>&#8230; (X &and; (Y &not; Z))</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
BooleanQuery inner = new BooleanQuery();
inner.add(Y, Occur.MUST);
inner.add(Z, Occur.MUST_NOT);
q.add(X, Occur.MUST);
q.add(inner, Occur.MUST);
</pre>
</li>
</ul>
<p>
Please note how import it is to use parentheses to combine multiple operators in order in order to generate queries that correctly model boolean logic. As mentioned before, the BooleanQuery class supports an arbitrary number of clauses, so <code>(X OR Y OR Z)</code> is a single BooleanQuery with three clauses &#8212; it is not equivalent to either <code>((X OR Y) OR Z)</code> <i>or</i> <code>(X OR (Y OR Z))</code> because those result in a BooleanQuery with two clauses, one of which is a nested BooleanQuery.  As mentioned above when discussing the prefix operators, the scores from each of those queries will all be different depending on which clauses are matched by each document.
</p>
<p>
Things definitely get very confusing when these &#8220;boolean operators&#8221; are used in ways other then those described above.  In some cases this is because the query parser is trying to be forgiving about &#8220;natural language&#8221; style usage of operators that many boolean logic systems would consider a parse error.  In other cases, the behavior is bizarrely esoteric:
</p>
<ul>
<li>Queries are parsed left to right</li>
<li><code>NOT</code> sets the Occurs flag of the clause to it&#8217;s right to <code>MUST_NOT</code></li>
<li><code>AND</code> will change the Occurs flag of the clause to it&#8217;s left to <code>MUST</code> unless it has already been set to <code>MUST_NOT</code></li>
<li><code>AND</code> sets the Occurs flag of the clause to it&#8217;s right to <code>MUST</code></li>
<li><i>If</i> the <a href="https://wiki.apache.org/solr/SolrQuerySyntax">default operator</a> of the query parser has been set to &#8220;And&#8221;: <code>OR</code> will change the Occurs flag of the clause to it&#8217;s left to <code>SHOULD</code> unless it has already been set to <code>MUST_NOT</code></li>
<li><code>OR</code> sets the Occurs flag of the clause to it&#8217;s right to <code>SHOULD</code></li>
</ul>
<p>
Practically speaking this means that <code>NOT</code> takes precedence over <code>AND</code> which takes precedence over <code>OR</code> &#8212; but only if the default operator for the query parser has not been changed from the default (&#8220;Or&#8221;).  If the default operator is set to &#8220;And&#8221; then the behavior is just plain weird.
</p>
<h2>In Conclusion</h2>
<p>
I won&#8217;t try to defend or justify the way the query parser behaves when it encounters these &#8220;boolean operators&#8221;, because in many cases I don&#8217;t understand or agree with the behavior myself &#8212; but that&#8217;s not really the point of this article.  My goal isn&#8217;t to convince you that the behavior of these operators makes sense, quite the contrary my goal today is to point out that regardless of how these operators are parsed, they aren&#8217;t a good representation of the underlying functionality available in the BooleanQuery class.
</p>
<p>
Do yourself a favor and start <i>thinking</i> about BooleanQuery as a container of arbitrary nested queries annotated with <code>MUST</code>, <code>MUST_NOT</code>, or <code>SHOULD</code> and discover the power that is available to you beyond simple boolean logic.
</p>
<p><i>Many thanks to Bill Dueber whose <a href="http://robotlibrarian.billdueber.com/solr-and-boolean-operators/">recent related blog post</a> reminded me that I had some draft notes on this subject floating around my laptop waiting to finished up and posted online.</i></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>Options to tune document’s relevance in Solr</title>
		<link>http://www.lucidimagination.com/blog/2011/12/14/options-to-tune-document%e2%80%99s-relevance-in-solr/</link>
		<comments>http://www.lucidimagination.com/blog/2011/12/14/options-to-tune-document%e2%80%99s-relevance-in-solr/#comments</comments>
		<pubDate>Wed, 14 Dec 2011 16:14:38 +0000</pubDate>
		<dc:creator>Tomás Fernández Löbbe</dc:creator>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[dismax]]></category>
		<category><![CDATA[relevance]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4542</guid>
		<description><![CDATA[<p>Working at Lucid Imagination a customer once asked me about how they could modify the score of the documents in Solr in order to get most relevant results higher in the results list. While I was trying to respond the question I realized that there are too many different options, and that not all of them are very easy to understand, so I decided to write some notes summarizing the most common/most used ways to &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>Working at Lucid Imagination a customer once asked me about how they could modify the score of the documents in Solr in order to get most relevant results higher in the results list. While I was trying to respond the question I realized that there are too many different options, and that not all of them are very easy to understand, so I decided to write some notes summarizing the most common/most used ways to do it. After that, many times I was asked the same question, so I decided to turn those notes into a blog post.</p>
<p>There are two stages where documents can be boosted: At index time and at query time.</p>
<h2>At Index Time</h2>
<p>This is probably the simplest way, because there are not too many options. It is also the most static way of adding boosts, as changing the boost for a documents would require re-indexing it.<br />
When updating documents using the XMLUpdateRequestHandler, the way to boost a document is to add the optional attribute “boost” to the doc element. When using SolrJ, the way to do it is by using the method<br />
<code><br />
document.setDocumentBoost(x)<br />
</code><br />
The default boost for a field is 1, so setting a value between 0 and 1 would down boost the document.</p>
<p>It is also possible to add different boosts to different fields of a document. The only requirement here is that the boosted fields must store the norms (“omitNorms” attribute in the schema must be set to “false”). The way of applying the boosts when using the XMLUpdateRequestHandler is similar to boosting the whole document, but instead of adding the “boost” attribute to the doc element, add it to the field element. When using SolrJ:<br />
<code><br />
document.addField(“title”, “Foo Bar”, x);<br />
</code></p>
<p>It’s important to know that the boost (either for a document or for a field) will be considered when calculating the final score for a document given a search. It is not the final score of the document. Boosting documents is not the same as sorting documents.</p>
<h2>At Query Time</h2>
<p>Boosting at query time is a little bit different than index time. It is much more dynamic as it doesn’t require re-indexing and can be specified with every new request to Solr. Also, what gets boosted is not a document or a field, but a subquery on the search. The simplest way to achieve query time boosting is by using the ^ character plus the boost number on the query, for example:</p>
<p><em>foo^5 bar</em></p>
<p>Much more complex expressions can also be used for query time boosting, like:</p>
<p><em>title:(foo bar)^5 OR content:(foo bar)^2 OR foo OR bar</em></p>
<p><em>title:(foo bar)^5 OR title:”foo bar”^20 OR …</em></p>
<p>The syntax can be very simple for simple cases, but it will get more and more complex with more complex use cases.</p>
<p>The above syntax is Lucene’s query syntax, it is supported by the Lucene Query Parser and the Extended Dismax Query Parser but not by the Dismax Query Parser.<br />
However, this syntax requires having an expert user who knows how to use it, or some application logic to inject it in the background after the user enters the query and before sending it to Solr. Dismax provides other alternatives for query time boosting, as dynamic as the previous one, but with a much easier syntax (all of them also supported by Extended Dismax).</p>
<h2>Query Time Boosting with the Dismax Query Parser</h2>
<h3>Boosting Fields</h3>
<p>The Dismax Query Parser (QP) will create a query that will be executed on many different fields, even if the user hasn’t specified any. This is one of the most important improvements of the Dismax QP over the Lucene QP. But sometimes, not all the fields have the same importance. Sometimes, a hit on the <em>title </em>field is more important than a hit on the <em>content </em>field, or a hit on the <em>content </em>can be more important than a hit on the <em>comments</em> field. The Dismax Query Parser provides the ability to consider some fields more important than others with the “qf” (named after “query fields”) parameter, the same that is used for specifying the different fields on which to execute the user query. A common value for this parameter could be:</p>
<p><em>qf=title^5 content^2 comments^0.5</em></p>
<p>This will translate a user query like “boo bar” into something similar to:</p>
<p><em>title:(foo bar)^5 OR content:(foo bar)^2 OR comments:(foo bar)^0.5</em></p>
<h3>Boosting Phrases</h3>
<p>The same as with query fields, Dismax Query Parser will execute the user query as a phrase query on the specified “phrase” fields. In this parameter, and in a similar way as in the qf parameter, a different boost for each of the phrase fields can be specified:</p>
<p><em>pf=title^20 content^10</em></p>
<p>This will translate a user query like <em>foo bar</em> into:</p>
<p><em>title:”foo bar”^20 OR content:”foo bar”^10</em></p>
<p>The last query will only be used for boosting the documents resulting from the original query.</p>
<h3>Boost Queries</h3>
<p>Sometimes it is necessary to boost some documents regardless of the user query. A typical example of boost queries is boosting sponsored documents. The user searches for “car rental”, but the application has some sponsored document that should be boosted. A good way of doing this is by using boost queries. A boost query is a query that will be executed on background after a user query, and that will boost the documents that matched it.</p>
<p>For this example, the boost query (specified by the “bq” parameter) would be something like:</p>
<p><em>bq=sponsored:true</em></p>
<p>The boost query won’t determine which documents are considered a hit an which are not, but it will just influence the score of the result.</p>
<h3>Boost Functions</h3>
<p>Boost Functions are very similar to boost queries; in fact, they can achieve the same goals. The difference between boost functions and boost queries is that the boost function is an arbitrary function instead of a query (see http://lucidworks.lucidimagination.com/display/solr/Function+Queries). A typical example of boost functions is boosting those documents that are more recent than others. Imagine a forum search application, where the user is searching for forum entries with the text “foo bar”. The application should display all the forum entries that talk about “foo bar” but usually the most recent entries are more important (most users will want to see updated entries, and not historical). The boost function will be executed on background after each user query, and will boost some documents in some way.</p>
<p>For this example, a boost function (specified by the “bf” parameter) could be something like:</p>
<p><em>bf=recip(ms(NOW,publicationDate),3.16e-11,1,1)</em></p>
<p>The same as with the boost queries, this function will not determine which documents are a hit and which are not, it will just add additional score to them.</p>
<p>A note on boost functions: boost functions can also be used with the Lucene QP by using the “_val_” special key inside the query.</p>
<h3>Tie Breaker</h3>
<p>The “tie” (tie breaker) parameter is very important, but not easy to understand. First it is important to understand what is a dismax (<a href="../2010/05/23/whats-a-dismax/">http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/</a>). With DisMax queries, the different terms of the user input are executed against different fields, if many of them hit (the term appears in different fields in the same document) the hit that scores higher is used, but what happens with the other sub-queries that hit in that document for the term? Well, that&#8217;s what the “tie” parameter defines. DisMax will calculate the score for a term query as:</p>
<p><code>score= [score of the top scoring subquery] + tie * (sum of other hitting subqueries)</code></p>
<p>In consequence, the “tie” parameter is a value between 0 and 1 that will define if the Dismax will only consider the max hit score for a term (setting tie=0), all the hits for a term (setting tie=1) or something between those two extremes.</p>
<h3>The boost Parameter</h3>
<p>The “boost” parameter is very similar to the “bf” parameter, but instead of adding its result to the final score, it will multiply it. This is only available in the “Extended Dismax Query Parser” or the “Lucid Query Parser”.</p>
<h3>A note on the parameters</h3>
<p>All the above parameters can be specified when configuring Solr (in the solrconfig.xml file) but they can also be changed on each request just by sending the parameter on the request with the new value.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/12/14/options-to-tune-document%e2%80%99s-relevance-in-solr/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>What&#8217;s with lowercasing wildcard (multiterm) queries in Solr?</title>
		<link>http://www.lucidimagination.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/</link>
		<comments>http://www.lucidimagination.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/#comments</comments>
		<pubDate>Tue, 29 Nov 2011 21:37:25 +0000</pubDate>
		<dc:creator>Erick Erickson</dc:creator>
				<category><![CDATA[schema]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[wildcards multiterm queryparser]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4476</guid>
		<description><![CDATA[<h1><span class="Apple-style-span" style="font-size: 20px;">Wildcard query terms aren&#8217;t analyzed, why is that?</span></h1>
<p>Prior to the current 3x branch (which will be released as 3.6) and the trunk (4.0) Solr code, users have frequently been perplexed by wildcard searching being un-analyzed, often manifesting in case sensitivity. Say you have an analysis chain in your schema.xml file defined as follows and a field named <code>lc_field</code> of this type:</p>
<pre>&#60;fieldType name="lowercase" class="solr.TextField" &#62;
  &#60;tokenizer class="solr.WhitespaceTokenizerFactory"/&#62;
  &#60;filter class="solr.LowercaseFilterFactory" /&#62;
&#60;/fieldType&#62;
</pre>
<p>Now, you index &#8230;</p>]]></description>
			<content:encoded><![CDATA[<h1><span class="Apple-style-span" style="font-size: 20px;">Wildcard query terms aren&#8217;t analyzed, why is that?</span></h1>
<p>Prior to the current 3x branch (which will be released as 3.6) and the trunk (4.0) Solr code, users have frequently been perplexed by wildcard searching being un-analyzed, often manifesting in case sensitivity. Say you have an analysis chain in your schema.xml file defined as follows and a field named <code>lc_field</code> of this type:</p>
<pre>&lt;fieldType name="lowercase" class="solr.TextField" &gt;
  &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
  &lt;filter class="solr.LowercaseFilterFactory" /&gt;
&lt;/fieldType&gt;
</pre>
<p>Now, you index the text &#8220;My Dog Has Fleas&#8221;. So far, so good. Searching on this field as<br />
<code>field_lc:fleas</code> returns the document, as does <code>field_lc:flea*</code>.</p>
<p>But now you search on <code>field_lc:Flea*</code> and you don&#8217;t get any results. What?!?!?! Nearly everyone scratches their heads about this, and it&#8217;s a question that often comes up on the Solr user&#8217;s list. Users wonder why the analysis chain above isn&#8217;t applied to the wildcard queries. It turns out that it&#8217;s trickier than you might think at first. What happens when a single input term gets split up into multiple parts? For instance, for those of you familiar with WordDelimiterFilterFactory (WDDF) that can split on case change. What does it mean to parse &#8216;fleA*&#8217;? Applying WDDF might well give the two tokens &#8216;fle&#8217; and &#8216;A&#8217; and possibly &#8216;fleA&#8217;. If a wildcard is present, what tokens should be emitted?</p>
<ol>
<ol>
<li>&#8216;fleA*&#8217;</li>
<li>&#8216;fle*&#8217;, &#8216;A*&#8217;, &#8216;fleA*&#8217;</li>
<li>&#8216;fle*&#8217;, &#8216;A*&#8217;</li>
<li>&lt;insert your solution here&gt;</li>
</ol>
</ol>
<p>You can, I daresay, create any rule that suits your fancy. And it&#8217;ll be wrong in some situations. Of particular horror is anything that produces &#8216;A*&#8217; as above, conceptually, you&#8217;d than have an enormous OR clause consisting of all the terms that started with &#8216;A&#8217; in your index. Unless you had a rule like &#8220;only do this if the preceding fragment was 2 characters or more&#8221;. But then someone would say &#8220;I need three characters&#8221;, so can WDDF provide a &#8220;wildCardMin=#&#8221; parameter? I have trouble keeping all the parameters with WDDF and how they interact in my mind already, going down this path would be a nightmare. And I haven&#8217;t even considered some of the <strong>really</strong> interesting issues, like how proximity would be incorporated in all this.</p>
<h3>Wildcards aren&#8217;t the only issue</h3>
<p>The same issue occurs with accent folding, normalizations, and, really, any other component of an analysis chain that somehow changes the query terms. This behavior has mostly been ignored in releases past, it&#8217;s been up to the application programmer to manually &#8220;do the right thing&#8221; before sending the query to Solr. This often involves operations such as lower-casing and accent folding on the application side when a wildcard is encountered.</p>
<h1>The new way of handling these cases</h1>
<p>As of <a title="SOLR-2438" href="https://issues.apache.org/jira/browse/SOLR-2438">SOLR-2438</a> this behavior is no longer true for a number of the most common cases. A query analysis chain that contains any of the following components will automatically &#8220;do the right thing&#8221; and apply them for multi-term queries. If your analysis chain consists of any of these elements, and you want them applied to &#8220;multi-term&#8221; queries, you don&#8217;t have to do anything at all, it will &#8220;just work&#8221;. At query time, the indicated transformations are applied to the query terms and everyone is happy. Or should be. Do note that it&#8217;s an all-or-nothing operation. <strong>All</strong> of the elements below that are found in the query analysis chain are applied to the multi-term terms.</p>
<ul>
<ul>
<li>ASCIIFoldingFilterFactory</li>
<li>LowerCaseFilterFactory</li>
<li>LowerCaseTokenizerFactory</li>
<li>MappingCharFilterFactory</li>
<li>PersianCharFilterFactory</li>
</ul>
</ul>
<p>Again, this effectively means you don&#8217;t need to care about these transformations any more. One note of explanation, though. I&#8217;ve talked about the &#8220;query analysis chain&#8221;. But what if you don&#8217;t have one? Remember that your <code>&lt;analyzer&gt;</code> tag can have several possible &#8216;type&#8217; parameters; &#8220;index&#8221;, or &#8220;query&#8221;, or none. Well, if a &#8216; type=&#8221;query&#8221; &#8216; is found, that analysis chain is inspected and any of the above components are recorded to be used on multi-term queries. If no &#8216; type=&#8221;query&#8221; &#8216; is found, then the &#8216; type=&#8221;index&#8221; &#8216; is used. And if no &#8216; type=&#8221;index&#8221; &#8216; is found, than the one with no &#8216;type&#8217; parameter is used.</p>
<h2>What does &#8220;multi-term&#8221; mean anyway?</h2>
<p>I&#8217;ve also sprinkled the phrase &#8220;mult-term&#8221; around, and sometimes &#8220;wildcard&#8221;. It turns out that the simple wildcard case is a specialization of a broader category of queries, including:</p>
<ul>
<ul>
<li>wildcard</li>
<li>range</li>
<li>prefix</li>
</ul>
</ul>
<p>All of these are now handled as above.</p>
<h3>Expert level schema possibilities</h3>
<p>All of the above is automatic, but there are three immediate questions:</p>
<ul>
<ul>
<li>what about some of the <em>other</em> components?</li>
<li>what if I need the old behavior?</li>
<li>what if I want something completely different?</li>
</ul>
</ul>
<p>It turns out that all three of these questions have the same answer. But before I outline it, I want to emphasize that <strong>you very probably don&#8217;t need to care about what follows!</strong> You might need to know about this in special cases, so I&#8217;ll mention it here.</p>
<p>In the above explanations, I wrote that &#8220;analysis chain is inspected and any of the above components are recorded to be used on multi-term queries&#8221;. Well, what actually happens is that there&#8217;s a new analysis chain in town that can be specified in the schema.xml file called, you guessed it, &#8220;multiterm&#8221;. You specify it like this as part of a <code>&lt;fieldType&gt;</code>:</p>
<pre>
&lt;analyzer type="multiterm" &gt;
  &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
  &lt;filter class="solr.ASCIIFoldingFilterFactory" /&gt;
  &lt;filter class="solr.YourFavoriteFilterFactoryHere" /&gt;
&lt;/analyzer&gt;
</pre>
<p>You can put <em>any</em> component that&#8217;s legal in a &#8216;type=&#8221;index&#8221; &#8216; or &#8216;type=&#8221;query&#8221; &#8216; analysis chain. If you wanted, for instance, to enforce the old-style behavior, you could specify</p>
<pre>  &lt;tokenizer class="solr.KeywordTokenizerFactory" /&gt;</pre>
<p>as the entire &#8220;multiterm&#8221; analysis chain. It seems a bit odd to use KeywordTokenizerFactory here, but this applies to the individual terms, not the entire input. So it&#8217;s in effect saying &#8220;don&#8217;t analyze the terms at all&#8221;. Sound familiar? This is just what happened historically.</p>
<h3>How does this relate to the automatic behavior?</h3>
<p>Well, what really happens under the covers is that if you don&#8217;t define your own &#8220;multiterm&#8221; analysis chain, Solr constructs one for you from the analyzers you <em>have</em> defined as outlined above; query, index or default, in that order.</p>
<h2>Waaaaay under the covers, down in the code</h2>
<p>All this is accomplished by making components &#8220;multiterm aware&#8221;. This means implementing the &#8220;MultiTermAwareComponent&#8221; interface. Currently, the components listed above are the only ones that implement this interface, but others may be good candidates, and some of these are listed in JIRA <a title="SOLR-2921" href="https://issues.apache.org/jira/browse/SOLR-2921">SOLR-2921</a>. By and large, implementing these in the code <em>may</em> be trivial. What&#8217;s <em>not</em> trivial is understanding what &#8220;the right thing&#8221; is. Some examples:</p>
<ul>
<ul>
<li>stemmers</li>
<li>various language-specific normalization filters</li>
<li>various language-specific lowercase filters.</li>
<li>various ICU filters</li>
</ul>
</ul>
<p>The reason these haven&#8217;t been made &#8220;multi term aware&#8221; is the usual open-source reason; &#8220;What we have is a good step forward, we shouldn&#8217;t delay everything in order to get the last use cases taken care of&#8221;. In other words the implementors (me in this case, with lots of help from others) are tired <img src='http://www.lucidimagination.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> .</p>
<p>Anyone who really understands what the right thing to do in the cases of components that do not yet implement &#8220;MultiTermAwareComponent&#8221; and could provide use cases for them would be giving us a great help, especially by providing examples illustrating the correct inputs and outputs for wildcard cases. And some examples of what should <em>not</em> come out as well. Or even better, a draft JUnit test that would show the expected behavior. Or even better yet, a full patch!</p>
<p>Any modification that potentially produces more than one token needs to be handled with care, see the code for LowerCaseTokenizerFactory for a case in point. Consider that Solr will now throw an exception if the transformation produces more than one token, so tread cautiously!</p>
<p>This change should remove a long-standing point of confusion for solr users. We&#8217;d be very interested in any feedback from the community, and especially any problems that crop up. SOLR-2438 has patches for both the 3x and 4x code lines, but it&#8217;s probably easier just to get a current 3x or 4x branch (or nightly build) if you want to test this &#8220;in the wild&#8221;; the code has been committed and built. There remains some work to be done to incorporate this change for more analysis components, anyone want to volunteer?</p>
<h2>Resources:</h2>
<p>This page on the Solr Wiki has the Wiki documentation: <a title="Multi Term Query Analysis" href="http://wiki.apache.org/solr/MultitermQueryAnalysis">Multi Term Query Analysis</a></p>
<p>Main JIRA (already in 3.6 and 4.0 code lines): <a title="SOLR-2438" href="https://issues.apache.org/jira/browse/SOLR-2438">SOLR-2438</a></p>
<p>JIRA for other components not yet &#8220;multi-term aware&#8221; that are possibilities in the future: <a title="SOLR-2921" href="https://issues.apache.org/jira/browse/SOLR-2921">SOLR-2921</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Rich Web Experience</title>
		<link>http://www.lucidimagination.com/blog/2011/11/15/the-rich-web-experience/</link>
		<comments>http://www.lucidimagination.com/blog/2011/11/15/the-rich-web-experience/#comments</comments>
		<pubDate>Tue, 15 Nov 2011 19:16:08 +0000</pubDate>
		<dc:creator>Erik Hatcher</dc:creator>
				<category><![CDATA[Events]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4468</guid>
		<description><![CDATA[[ Tuesday, 29 November 2011 to Friday, 2 December 2011. ] <p><a href="http://therichwebexperience.com/conference/fort_lauderdale/2011/11/home"><img class="alignnone" title="The Rich Web Experience 2011" src="http://therichwebexperience.com/images/2011/header/richwebexperience.png" alt="" width="600" height="91" /></a></p>
<p>I&#8217;ll be speaking at the upcoming <a href="http://therichwebexperience.com/conference/fort_lauderdale/2011/11/home">Rich Web Experience conference</a> in Ft. Lauderdale, presenting an &#8220;Introduction to Solr&#8221;, &#8220;Solr Recipes&#8221;, and &#8220;Lucene for Solr Developers&#8221;.  I&#8217;ll be tying all of these presentations together into a cohesive search/Solr track going from the introduction, to recipes for common tasks, through advanced customization of Solr.&#8230;</p>]]></description>
			<content:encoded><![CDATA[[ Tuesday, 29 November 2011 to Friday, 2 December 2011. ] <p><a href="http://therichwebexperience.com/conference/fort_lauderdale/2011/11/home"><img class="alignnone" title="The Rich Web Experience 2011" src="http://therichwebexperience.com/images/2011/header/richwebexperience.png" alt="" width="600" height="91" /></a></p>
<p>I&#8217;ll be speaking at the upcoming <a href="http://therichwebexperience.com/conference/fort_lauderdale/2011/11/home">Rich Web Experience conference</a> in Ft. Lauderdale, presenting an &#8220;Introduction to Solr&#8221;, &#8220;Solr Recipes&#8221;, and &#8220;Lucene for Solr Developers&#8221;.  I&#8217;ll be tying all of these presentations together into a cohesive search/Solr track going from the introduction, to recipes for common tasks, through advanced customization of Solr.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/11/15/the-rich-web-experience/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bet You Didn&#8217;t Know Lucene Can&#8230;</title>
		<link>http://www.lucidimagination.com/blog/2011/11/14/bet-you-didnt-know-lucene-can/</link>
		<comments>http://www.lucidimagination.com/blog/2011/11/14/bet-you-didnt-know-lucene-can/#comments</comments>
		<pubDate>Mon, 14 Nov 2011 15:43:36 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4418</guid>
		<description><![CDATA[<p>Here are my ApacheCon 2011 slides for my talk &#8220;Bet You Didn&#8217;t Know Lucene Can&#8230;&#8221; :</p>
<p>&#160;</p>
<div id="__ss_10155480" style="width: 425px;"><strong style="display: block; margin: 12px 0 4px;"><a title="Bet you didn't know Lucene can..." href="http://www.slideshare.net/gsingers/bet-you-didnt-know-lucene-can">Bet you didn&#8217;t know Lucene can&#8230;</a></strong>
<div style="padding: 5px 0 12px;">View more <a href="http://www.slideshare.net/">presentations</a> from <a href="http://www.slideshare.net/gsingers">gsingers</a>.</div>
&#8230;</div>]]></description>
			<content:encoded><![CDATA[<p>Here are my ApacheCon 2011 slides for my talk &#8220;Bet You Didn&#8217;t Know Lucene Can&#8230;&#8221; :</p>
<p>&nbsp;</p>
<div id="__ss_10155480" style="width: 425px;"><strong style="display: block; margin: 12px 0 4px;"><a title="Bet you didn't know Lucene can..." href="http://www.slideshare.net/gsingers/bet-you-didnt-know-lucene-can">Bet you didn&#8217;t know Lucene can&#8230;</a></strong><object id="__sse10155480" width="425" height="355" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="allowFullScreen" value="true" /><param name="allowScriptAccess" value="always" /><param name="src" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=lucenecan-111114094003-phpapp01&amp;stripped_title=bet-you-didnt-know-lucene-can&amp;userName=gsingers" /><param name="allowscriptaccess" value="always" /><param name="allowfullscreen" value="true" /><embed id="__sse10155480" width="425" height="355" type="application/x-shockwave-flash" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=lucenecan-111114094003-phpapp01&amp;stripped_title=bet-you-didnt-know-lucene-can&amp;userName=gsingers" allowFullScreen="true" allowScriptAccess="always" allowscriptaccess="always" allowfullscreen="true" /></object></p>
<div style="padding: 5px 0 12px;">View more <a href="http://www.slideshare.net/">presentations</a> from <a href="http://www.slideshare.net/gsingers">gsingers</a>.</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/11/14/bet-you-didnt-know-lucene-can/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Performance of Google&#8217;s V8 Javascript engine in Solr</title>
		<link>http://www.lucidimagination.com/blog/2011/11/10/performance-of-googles-v8-javascript-engine-in-solr/</link>
		<comments>http://www.lucidimagination.com/blog/2011/11/10/performance-of-googles-v8-javascript-engine-in-solr/#comments</comments>
		<pubDate>Thu, 10 Nov 2011 15:39:34 +0000</pubDate>
		<dc:creator>Emmanuel Espina</dc:creator>
				<category><![CDATA[Libraries]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4426</guid>
		<description><![CDATA[<p>The use of scripting languages to add new functionality to systems is something that I&#8217;ve always found very helpful. You don&#8217;t have to download the source code of the system, if it has “scriptable” parts you can add simple functionality in minutes without even compiling. Java provides this capabilities in particular with Javascript. You can refer to <a href="http://java.sun.com/developer/technicalArticles/J2SE/Desktop/scripting/">http://java.sun.com/developer/technicalArticles/J2SE/Desktop/scripting/</a> for more information on this.</p>
<p>Unfortunately, Java 6&#8242;s only included library is Rhino that converts the javascript &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>The use of scripting languages to add new functionality to systems is something that I&#8217;ve always found very helpful. You don&#8217;t have to download the source code of the system, if it has “scriptable” parts you can add simple functionality in minutes without even compiling. Java provides this capabilities in particular with Javascript. You can refer to <a href="http://java.sun.com/developer/technicalArticles/J2SE/Desktop/scripting/">http://java.sun.com/developer/technicalArticles/J2SE/Desktop/scripting/</a> for more information on this.</p>
<p>Unfortunately, Java 6&#8242;s only included library is Rhino that converts the javascript code into JVM code and its performance is not good. For reasons like these, the scripting languages in general are experiencing something that Java felt itself in the past: the popular belief that they are slow.</p>
<p>The performance is certaintly not that bad but in a critical application, people would prefer to develop native (java) components to keep the performance before losing performance with a probably unnecesary script. This entry is about performance of scripting languages; in particular, other Javascript engines that you can attach to Java.</p>
<p>Google chrome has surprised everyone with their blazing fast javascript engine: V8.</p>
<p>I downloaded V8 and tested it against the regular Rhino option. Actually I didn&#8217;t implement the neccesary wrappers to add V8 to Java, nor the JNI C programming necessary as a glue to access V8 functions (which are native in the traditional sense, real machine code running on real processors, like in the old days). I used a library that I found in internet: <a href="http://code.google.com/p/jav8/">http://code.google.com/p/jav8/</a>.</p>
<p>For the first test I downloaded a <a href="https://raw.github.com/espinaemmanuel/Blog-resources/master/lucid-11-10-2011/test.js">benchmark</a> from V8 that computes an RSA encryption of a text using Javascript. I know, you will feel that I&#8217;m cheating by using a benchmark that is not impartial, but the results are pretty conclusive anyway: <a title="http://v8.googlecode.com/svn/data/benchmarks/" href="http://v8.googlecode.com/svn/data/benchmarks/">http://v8.googlecode.com/svn/data/benchmarks/</a>.</p>
<p>The code (download <a href="https://raw.github.com/espinaemmanuel/Blog-resources/master/lucid-11-10-2011/main.java">here</a>) also shows the basic usage of the scripting functionality of Java.</p>
<pre>public class main {

	public static void main(String[] args) throws FileNotFoundException {

		ScriptEngineManager sm = new ScriptEngineManager();
		FileReader file = new FileReader("test.js");
		ScriptEngine jsEngine = sm.getEngineByName("jav8");

		int iter = Integer.parseInt(args[0]);

		try {
			long acum = 0;
			for(int i=0; i&lt;iter; i++){
				long start = System.currentTimeMillis();
				Object ob = jsEngine.eval(file);
				long end = System.currentTimeMillis();

				acum += end - start;
			}
			System.out.println(acum);
		} catch (ScriptException ex) {
			ex.printStackTrace();
		}
	}
}</pre>
<p>I created a simple bash script to run this many times with different number of iterations and these are the results (X has the iterations and Y has the time, so less is better)</p>
<p style="text-align: left;"><a href="http://www.lucidimagination.com/blog/wp-content/uploads/2011/11/plot-v8.png"><img class="aligncenter size-medium wp-image-4434" title="plot-v8" src="http://www.lucidimagination.com/blog/wp-content/uploads/2011/11/plot-v8-300x202.png" alt="" width="300" height="202" /></a></p>
<p>The performance difference is outstanding. Particularly interesting is the scalability of V8 when we add more iterations. V8 has an advanced cache system an surely that is helping to keep the performance as the iteration grows.</p>
<p>A real application that performs a set of repetitive tasks is the ScriptTransformer of the Data Import Handler of Solr. This transformer applies a function written in Javascript to each row of a table. This infamous component is something very useful but its performance has always been horrible.</p>
<p>To continue the tests I compared the standard Rhino script engine vs V8 applied to the ScriptTransformer of Solr. I had to modify the ScriptTransformer and remove its “reflection style” implementation (apparently to keep compatibility with 1.5, but quite ugly anyway) to make jav8 work with it. The test consisted in encrypting 5000 records of text from a database. (modified <a href="https://raw.github.com/espinaemmanuel/Blog-resources/master/lucid-11-10-2011/ScriptTransformer.java">ScriptTransformer.java</a>)</p>
<p>The results:</p>
<table class="aligncenter" width="300">
<tbody>
<tr>
<td valign="top" width="340"><strong>Engine</strong></td>
<td valign="top" width="340"><strong>Time taken (seconds)</strong></td>
</tr>
<tr>
<td valign="top" width="340">V8</td>
<td valign="top" width="340">12,347</td>
</tr>
<tr>
<td valign="top" width="340">Rhino</td>
<td valign="top" width="340">83,255</td>
</tr>
</tbody>
</table>
<p>Again the results show that V8 wins by a big margin.</p>
<p>The conclusion that we extract here is that adding script engines to our systems does not imply that the performance will be damaged. If you accept the use of native libraries (V8 in this case) a script engine can make your systems much more flexible without slowing them down.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/11/10/performance-of-googles-v8-javascript-engine-in-solr/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>From Barcelona to Vancouver with Lucene and Solr</title>
		<link>http://www.lucidimagination.com/blog/2011/10/22/barcelona-vancouver/</link>
		<comments>http://www.lucidimagination.com/blog/2011/10/22/barcelona-vancouver/#comments</comments>
		<pubDate>Sat, 22 Oct 2011 10:14:36 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4364</guid>
		<description><![CDATA[<p>With another <a href="http://lucene-eurocon.com/">Lucene Eurocon</a> successfully behind us (thanks Barcelona, you&#8217;ve been awesome!), it&#8217;s time to say hello to Vancouver for <a href="http://na11.apachecon.com/">ApacheCon</a>.  I&#8217;ll leave it to others to fill in the blanks on the Barcelona conference other than to say that I am continually amazed by the vibrancy of the Lucene/Solr community and especially grateful to all the committers and contributors who take the time to show up and give talks about how they leverage &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>With another <a href="http://lucene-eurocon.com/">Lucene Eurocon</a> successfully behind us (thanks Barcelona, you&#8217;ve been awesome!), it&#8217;s time to say hello to Vancouver for <a href="http://na11.apachecon.com/">ApacheCon</a>.  I&#8217;ll leave it to others to fill in the blanks on the Barcelona conference other than to say that I am continually amazed by the vibrancy of the Lucene/Solr community and especially grateful to all the committers and contributors who take the time to show up and give talks about how they leverage the world&#8217;s premier open source search engine.</p>
<p>For me personally, I&#8217;m on to Vancouver and ApacheCon for two primary things, besides of course the community bits that go with every ApacheCon:</p>
<ol>
<li>Providing the ApacheCon&#8217;s first ever <a href="http://na11.apachecon.com/talks/18395">Apache Mahout training on Monday, November 7th</a>.  It&#8217;s still not too late to sign up!</li>
<li>Giving a talk on alternative uses of Lucene/Solr other than traditional free text search (things like recommendation engines, classification, etc.)</li>
</ol>
<p>For the 2nd item, I&#8217;m also interested in hearing from you, the user, about interesting things you&#8217;ve done with Lucene/Solr that fall outside the norm of free text search.  If you care to share, please leave a comment on this post.</p>
<p>I&#8217;d be remiss if I didn&#8217;t also plug several other Lucid Imagination employees who are speaking at ApacheCon as well:</p>
<ol>
<li><a href="http://na11.apachecon.com/talks/19453">Solr Flair</a> by Erik Hatcher.  Erik will also be doing a <a href="http://na11.apachecon.com/talks/19454">2 day Solr training class</a>.  Registration is still open for this class as well.</li>
<li><a href="http://na11.apachecon.com/talks/19346">Apache Solr: Out of the Box</a> by Chris Hostetter</li>
</ol>
<p>Lucid Imagination is also sponsoring the Lucene/Solr <a href="https://wiki.apache.org/lucene-java/ApacheCon2011NaMeetup">meetup</a> on Wed. November 9th, so if you are in town, please feel free to drop by for a drink and a chat.</p>
<p>With that, I&#8217;ll simply say, I hope to see you in Vancouver in a few weeks!</p>
<p>-Grant</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/10/22/barcelona-vancouver/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Monitoring Apache Solr and LucidWorks with Zabbix</title>
		<link>http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/</link>
		<comments>http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/#comments</comments>
		<pubDate>Sun, 02 Oct 2011 19:42:16 +0000</pubDate>
		<dc:creator>alexey</dc:creator>
				<category><![CDATA[LucidWorks]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4061</guid>
		<description><![CDATA[<p>If you&#8217;re running Apache Solr in production, you count on it to deliver solid performance and expect it to be up at all times. Even if you tested your setup with expected data and query load, things can go wrong. Solving those problems as they appear, not only causes service downtime, but is a very unpleasant task. Imagine sleepless nights trying to figure out why your production system went down with an OutOfMemory error. Similar &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>If you&#8217;re running Apache Solr in production, you count on it to deliver solid performance and expect it to be up at all times. Even if you tested your setup with expected data and query load, things can go wrong. Solving those problems as they appear, not only causes service downtime, but is a very unpleasant task. Imagine sleepless nights trying to figure out why your production system went down with an OutOfMemory error. Similar situations are actually more common than desired &#8211; no free disk space, running out of file descriptors, no free memory for OS level file system cache, high cpu load and so forth.</p>
<p>There is special class of software programs called monitoring software that are widely used among system and network administrators. In our case we would like to monitor not only OS level metrics, but also Solr internal parameters and act accordingly. LucidWorks and Apache Solr provide lots of valuable information through a JMX interface, so you can hook that up into your monitoring tool.</p>
<p>Zabbix is one of the most popular open source monitoring tools. It has many features like an easy-to-use web interface, different ways to gather metrics data, an ability to keep this data in persistent storage, built-in graphing, notifications and alerting, flexible configuration and many more. One of the most compelling features of integrating it with Apache Solr is built-in JMX support (available only in Zabbix 2.0 beta release). Using this feature you can easily configure Zabbix server to pull JMX metrics out of any LucidWorks or Solr application. This is because all configuration settings (JMX attributes, graphs, triggers) are stored centrally on a Zabbix server, which means you can add a new attribute for all monitored servers or change the pulling frequency for servers with a single click.</p>
<p>Here are example graphs you can build in Zabbix:</p>
<p><em>1. Total number of documents in Solr index</em><br />
<a href="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/TotalNumberOfDocuments1.png"><img class="alignleft" style="padding-left: 0em; padding-top: 0.5em; padding-right: 0em; padding-bottom: 0.5em;" title="Total Number Of Documents" src="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/TotalNumberOfDocuments1.png" alt="" width="450" height="136" /></a></p>
<p><em>2. Search activity &#8211; number of search requests, errors and timeouts  </em><br />
Solr request handlers provide cumulative counter for number of requests, <a href="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/SearchActivity1.png"><img class="alignright" style="padding-left: 0.5em; padding-top: 0.5em; padding-right: 1em; padding-bottom: 0.5em;" title="Search Activity" src="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/SearchActivity1.png" alt="" width="225" height="88" /></a> but you are probably more interested in number of search requests per specific period of time, like per minute or per second. The trick here is that Zabbix provides a way to setup monitoring to store not the value as-is, but as a delta (simple change value or speed per second).</p>
<p><em>3. Solr document operations (adds, deletes by id or query)</em><br />
<a href="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/SolrDocumentOperations1.png"><img class="alignleft" style="padding-left: 0em; padding-top: 0.5em; padding-right: 0em; padding-bottom: 0.5em;" title="Solr Document Operations" src="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/SolrDocumentOperations1.png" alt="" width="450" height="136" /></a></p>
<p><em>4. Crawling activity</em><br />
LucidWorks provides different connectors/crawlers which you can use to <a href="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/CrawlingActivity1.png"><img class="alignright" style="padding-left: 0.5em; padding-top: 0.5em; padding-right: 1em; padding-bottom: 0.5em;" title="Crawling Activity" src="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/CrawlingActivity1.png" alt="" width="225" height="88" /></a> index documents into Solr. It also provides additional statistics about crawler behavior, like total number of documents, new and deleted documents, number of updated documents in iterative crawl, failures, etc.</p>
<p><em>5. Solr index operations (commits, optimizes, rollbacks)</em><br />
<a href="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/SolrIndexOperations1.png"><img class="alignleft" style="padding-left: 0em; padding-top: 0.5em; padding-right: 0em; padding-bottom: 0.5em;" title="Solr Index Operations" src="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/SolrIndexOperations1.png" alt="" width="450" height="136" /></a></p>
<p><em>6. Search Average Response Time</em><br />
<a href="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/SearchAverageResponseTime.png"><img class="alignright" style="padding-left: 0.5em; padding-top: 0.5em; padding-right: 1em; padding-bottom: 0.5em;" title="Search Average Response Time" src="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/SearchAverageResponseTime.png" alt="" width="225" height="88" /></a>Solr search request handler provides cumulative avgTimePerRequest value. The problem with this attribute is that  when your applications is running in production for a significant amount of time, current short term performance problems won&#8217;t cause significant effect on this aggregate metric. The solution is to use a Zabbix <em>calculated</em> item on delta change for <em>totalTime</em> and <em>requests</em> attributes. Here&#8217;s math expression to calculate average search response time for the last 5 minutes:</p>
<pre>sum("jmx[\"solr/collection1:type=/lucid,id=org.apache.solr.handler.StandardRequestHandler\",\"totalTime\"]",300)/sum("jmx[\"solr/collection1:type=/lucid,id=org.apache.solr.handler.StandardRequestHandler\",\"requests\"]",300)</pre>
<p>&nbsp;</p>
<p><em>7. Solr searcher warmup time</em><br />
This is an important metric <a href="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/SearcherWarmupTime1.png"><img class="alignright" style="padding-left: 0.5em; padding-top: 0.5em; padding-right: 1em; padding-bottom: 0.5em;" title="Searcher Warmup Time" src="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/SearcherWarmupTime1.png" alt="" width="225" height="88" /></a>if you pursue fast commit rate (near real time indexing) and don&#8217;t want to sacrifice fast faceting performance. You can configure monitoring tool to send alert in case of warmup time exceeds some pre-defined threshold.</p>
<p><em>8. Filter, query results and documents caches statistics (cache size, hits, hitratio, evictions, etc)</em></p>
<p class="alignleft"><a href="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/FilterCacheSize1.png"><img style="padding-left: 0em; padding-top: 0.5em; padding-right: 0.5em; padding-bottom: 0.5em;" title="Filter Cache Size" src="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/FilterCacheSize1.png" alt="" width="122" height="44" /></a> <a href="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/FilterCacheHitRatio1.png"><img style="padding-left: 0em; padding-top: 0.5em; padding-right: 0.5em; padding-bottom: 0.5em;" title="Filter Cache Hit Ratio" src="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/FilterCacheHitRatio1.png" alt="" width="122" height="44" /></a> <a href="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/DocumentCacheSize1.png"><img style="padding-left: 0em; padding-top: 0.5em; padding-right: 0.5em; padding-bottom: 0.5em;" title="Document Cache Size" src="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/DocumentCacheSize1.png" alt="" width="122" height="44" /></a> <a href="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/DocumentCacheHitRatio1.png"><img style="padding-left: 0em; padding-top: 0.5em; padding-right: 0.5em; padding-bottom: 0.5em;" title="Document Cache Hit Ratio" src="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/DocumentCacheHitRatio1.png" alt="" width="122" height="44" /></a> <a href="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/QueryResultCacheSize.png"><img style="padding-left: 0em; padding-top: 0.5em; padding-right: 0.5em; padding-bottom: 0.5em;" title="Query Result Cache Size" src="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/QueryResultCacheSize.png" alt="" width="122" height="44" /></a> <a href="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/QueryResultCacheHitRatio.png"><img style="padding-left: 0em; padding-top: 0.5em; padding-right: 0.5em; padding-bottom: 0.5em;" title="Document Cache Hit Ratio" src="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/QueryResultCacheHitRatio.png" alt="" width="122" height="44" /></a></p>
<p><em>9.  Java Heap Memory Usage</em><br />
<a href="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/HeapMemoryUsage1.png"><img class="alignleft" style="padding-left: 0em; padding-top: 0.5em; padding-right: 0em; padding-bottom: 0.5em;" title="Heap Memory Usage" src="http://www.lucidimagination.com/blog/wp-content/uploads/2011/09/HeapMemoryUsage1.png" alt="" width="450" height="136" /></a></p>
<p><strong>How would I know if my search server is down?</strong></p>
<p>There are two options &#8211; the obvious one is to set up your monitoring tool to issue search requests and verify response status or specific text on a search results page. Another option is to check the last time your monitoring tool retrieved an arbitrary JMX attribute from your application and assume the server is down if it&#8217;s longer than expected. In Zabbix there&#8217;s special function <a href="http://www.zabbix.com/documentation/1.8/manual/config/triggers#example_8">nodata</a> which you can use to achieve that.</p>
<p><strong>How would I know if I&#8217;m reaching a limits of my server and pro-actively react on this?</strong></p>
<p>This is a complex issue as there are many things that can go wrong (such as JVM heap memory, CPU load, disk space, file descriptors, etc.) and you should monitor them all. Zabbix has great example templates for OS and Java triggers that allow you to keep an eye on all those parameters.</p>
<p>For more information about Solr and LucidWorks JMX support, instructions how to configure Zabbix and Nagios, Zabbix configuration templates and other helpful tips please see the <a href="http://lucidworks.lucidimagination.com/display/lweug/Integrating+Monitoring+Services">Integrating Monitoring Services</a> section on Lucid documentation portal.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Happy Anniversary, Lucene!  10 years at the ASF</title>
		<link>http://www.lucidimagination.com/blog/2011/09/18/happy-anniversary-lucene-10-years-at-the-asf-3/</link>
		<comments>http://www.lucidimagination.com/blog/2011/09/18/happy-anniversary-lucene-10-years-at-the-asf-3/#comments</comments>
		<pubDate>Sun, 18 Sep 2011 18:05:38 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[nutch]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4050</guid>
		<description><![CDATA[<p>From a quiet start as a pet project to a giant in the industry, <a href="http://lucene.apache.org">Apache Lucene</a> is definitely the little (search) engine that could.  On September 18th, 2001 (at 16:29:48 UTC) Jason Van Zyl made the first <a href="http://svn.apache.org/viewvc?view=revision&#38;revision=149570">official import</a> of Doug Cutting&#8217;s Lucene project (which started in 1997 and was hosted on SourceForge) into <a href="http://www.apache.org">Apache&#8217;s</a> Jakarta project (check out the <a href="http://web.archive.org/web/20011202174653/http://jakarta.apache.org/">Wayback machine</a>).</p>
<p>And while I wasn&#8217;t around in the beginning, I thought I would &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>From a quiet start as a pet project to a giant in the industry, <a href="http://lucene.apache.org">Apache Lucene</a> is definitely the little (search) engine that could.  On September 18th, 2001 (at 16:29:48 UTC) Jason Van Zyl made the first <a href="http://svn.apache.org/viewvc?view=revision&amp;revision=149570">official import</a> of Doug Cutting&#8217;s Lucene project (which started in 1997 and was hosted on SourceForge) into <a href="http://www.apache.org">Apache&#8217;s</a> Jakarta project (check out the <a href="http://web.archive.org/web/20011202174653/http://jakarta.apache.org/">Wayback machine</a>).</p>
<p>And while I wasn&#8217;t around in the beginning, I thought I would offer up some (little) known tidbits, links, etc. about Lucene as an ode to the search library that has significantly changed the search world, as well as my own career:</p>
<ol>
<li>Lucene was <a href="http://www.lucidimagination.com/devzone/videos-podcasts/podcasts/interview-doug-cutting">Doug&#8217;s way of learning Java</a>!  How&#8217;s that for a start?  It took him 3 months, working 2 days a week to crank out the first version.</li>
<li>At the time, some commercial search engines could not do incremental updates of the index, meaning you had to re-index all your documents anytime you had an update.  Lucene has always had an incremental model, all the way through to today&#8217;s Near Real Time features that power the likes of <a href="http://www.twitter.com">Twitter</a> at 1 billion+ searches and 100M+ new documents per day.</li>
<li><a href="http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/document/Field.java?view=markup&amp;pathrev=149570">Field myField = Field.Text(&#8220;foo&#8221;, &#8220;bar&#8221;)</a>; anyone?  Or how about Field myField = Field.UnIndexed(&#8220;foo&#8221;, &#8220;bar&#8221;);</li>
<li>Back then, Lucene had it&#8217;s own <a href="http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/PorterStemmer.java?view=markup&amp;pathrev=149570">PorterStemmer</a>, now we just use <a href="http://snowball.tartarus.org">Snowball</a>.</li>
<li>Only 1 of the <a href="http://web.archive.org/web/20020213045032/http://jakarta.apache.org/lucene/docs/whoweare.html">original committers</a> still remains somewhat active.</li>
<li>Read the old <a href="http://web.archive.org/web/20020203084504/http://www.lucene.com/cgi-bin/faq/faqmanager.cgi">FAQ</a>!  True as it ever was.  (Mostly)</li>
<li>Lucene 2.3 drastically improved indexing performance thanks to a thorough overhaul of the innards while barely affecting the API.  4.0 will blow the doors off of previous versions in terms of speed and efficiency.</li>
<li>Lucene is Doug&#8217;s wife&#8217;s <a href="http://www.lucidimagination.com/devzone/videos-podcasts/podcasts/interview-doug-cutting">middle name</a>.</li>
<li>Lucene has evolved from offering a single vector space scoring model to one that now <a href="http://www.lucidimagination.com/blog/2011/09/12/flexible-ranking-in-lucene-4/">offers plug-n-play</a> ranking (BM25 anyone?)</li>
<li>Lucene is ubiquitous.  It powers search on everything from mobile devices to web scale engines.  I&#8217;ve seen indexes as small as 15% of the original content.  I&#8217;ve also seen indexes grow to several billion documents in size.  Lucene has been used as a caching store, an ORM, a cross language search engine, the guts of the popular <a href="http://lucene.apache.org/solr">Solr</a> search server, the retrieval engine for IBM&#8217;s <a href="http://www-03.ibm.com/innovation/us/watson/index.html">Watson</a> as well as several commercial search engines and pretty much everything in between.</li>
<li>Did you know <a href="http://hadoop.apache.org">Apache Hadoop</a> started as a subproject of Lucene?  Doug Cutting and Mike Cafarella first built out Hadoop in order to scale out indexing for the <a href="http://nutch.apache.org">Apache Nutch</a> project.  From there it was spun out to be a top level ASF project and has gone on to be the de facto choice for large scale distributed processing, much like Lucene is the de facto choice for search!  Lucene has also spun out <a href="http://mahout.apache.org">Mahout</a>, <a href="http://tika.apache.org">Tika</a>, Lucene.NET and Lucy!</li>
</ol>
<p>As for how Lucene&#8217;s impacted me?  In 2004, I took a job at the <a href="http://www.cnlp.org">Center for Natural Language Processing</a> at Syracuse University working for Dr. Liz Liddy.  My job was to build an Arabic-English cross language search engine.  Within a day or two of starting, <a href="http://www.linkedin.com/profile/view?id=10139209&amp;trk=tyah">Ozgur Yilmazel</a> (my boss at the time) said something to the effect of &#8220;we&#8217;ll be using Lucene for the implementation.  Go learn it.&#8221;  Digging in, I quickly needed a couple of features, the biggest one being Term Vectors, so I updated a patch from an earlier version of Lucene and managed to convince the committers at the time to commit it.  From there, I kept supplying patches.  Eventually, I was asked to be a committer.  Some time after that, Yonik Seeley and Marc Krellenstein approached a bunch of the committers about starting a company and here I am today at the company we (Erik, Yonik, Marc and I) founded back in 2007, <a href="http://www.lucidimagination.com">Lucid Imagination</a>.  I feel fortunate to have the opportunity to work on hard problems in an interesting field and for that, Lucene, in no small part, I thank  you.</p>
<p>But enough of my self-indulgence, how has Lucene impacted you?  When did you first start using it?  What&#8217;s your biggest index or fastest QPS?   What ways have you used Lucene beyond that of a search engine?  Leave a comment and let us know.</p>
<p>Happy 10th Anniversary, Lucene!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/09/18/happy-anniversary-lucene-10-years-at-the-asf-3/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

