<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lucid Imagination &#187; query parser</title>
	<atom:link href="http://www.lucidimagination.com/blog/tag/query-parser/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lucidimagination.com/blog</link>
	<description>Exclusively dedicated to Apache Lucene/Solr open source search technology</description>
	<lastBuildDate>Sat, 04 Feb 2012 01:12:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Why Not AND, OR, And NOT?</title>
		<link>http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/</link>
		<comments>http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/#comments</comments>
		<pubDate>Wed, 28 Dec 2011 23:23:24 +0000</pubDate>
		<dc:creator>hossman</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[query parser]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4572</guid>
		<description><![CDATA[I really dislike the so called "Boolean Operators" ("AND", "OR", and "NOT") and generally discourage people from using them.  It's understandable that novice users may tend to think about the queries they want to run in those terms, but as you become more familiar with IR concepts in general, and what Solr specifically is capable of, I think it's a good idea to try to "set aside childish things" and start thinking (and encouraging your users to think) in terms of the superior "Prefix Operators" ("+", "-").]]></description>
			<content:encoded><![CDATA[<p><i>The following is written with Solr users in mind, but the principles apply to Lucene users as well.</i></p>
<p>
I really dislike the so called &#8220;Boolean Operators&#8221; (&#8220;AND&#8221;, &#8220;OR&#8221;, and &#8220;NOT&#8221;) and generally discourage people from using them.  It&#8217;s understandable that novice users may tend to think about the queries they want to run in those terms, but as you become more familiar with IR concepts in general, and what Solr specifically is capable of, I think it&#8217;s a good idea to try to &#8220;set aside childish things&#8221; and start thinking (and encouraging your users to think) in terms of the superior &#8220;Prefix Operators&#8221; (&#8220;+&#8221;, &#8220;-&#8221;).
</p>
<h2>Background: Boolean Logic Makes For Terrible Scores</h2>
<p>
<a href="https://en.wikipedia.org/wiki/Boolean_algebra">Boolean Algebra</a> is (as my father would put it) &#8220;pretty neat stuff&#8221; and the world as we know it most certainly wouldn&#8217;t exist with out it.  But when it comes to building a search engine, boolean logic tends to not be very helpful.  Depending on how you look at it, boolean logic is all about truth values and/or set intersections.  In either case, there is no concept of &#8220;relevancy&#8221; &#8212; either something is true or it&#8217;s false; either it is in a set, or it is not in the set.
</p>
<p>
When a user is looking for &#8220;all documents that contain the word &#8216;Alligator&#8217;&#8221; they aren&#8217;t going to very be happy if a search system applied simple boolean logic to just identify the <i>unordered set</i> of all matching documents.  Instead algorithms like TF/IDF are used to try and identify the <i>ordered list</i> of matching documents, such that the &#8220;best&#8221; matches come first.  Likewise, if a user is looking for &#8220;all documents that contain the words &#8216;Alligator&#8217; or &#8216;Crocodile&#8217;&#8221;, a simple boolean logic union of the sets of documents from the individual queries would not generate results as good as a query that took into the TF/IDF scores of the documents for the individual queries, as well as considering which documents matches <i>both</i> queries.  (The user is probably more interested in a document that discusses the similarities and differences between Alligators to Crocodiles then in documents that only mention one or the other a great many times).
</p>
<p>
This brings us to the crux of why I think it&#8217;s a bad idea to use the &#8220;Boolean Operators&#8221; in query strings: because it&#8217;s not how the underlying query structures actually work, and it&#8217;s not as expressive as the alternative for describing what you want.
</p>
<h2>BooleanQuery: Great Class, Bad Name</h2>
<p>
To really understand why the boolean operators are inferior to the prefix operators, you have to start by considering the underlying implementation.  The <a href="https://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a> class is probably one of the most misleading class names in the entire Lucene code base because it doesn&#8217;t model simple boolean logic query operations at all.  The basic function of a BooleanQuery is:
</p>
<ol>
<li>
A BooleanQuery consists of one or more <a href="https://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/search/BooleanClause.html">BooleanClauses</a>, each of which contains two pieces of information:</p>
<ul>
<li>A nested Query</li>
<li>An <a href="https://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/search/BooleanClause.Occur.html">Occur</a> flag, which has one of three values</li>
<ul>
<li><code>MUST</code> &#8211; indicating that documents must match this nested Query in order for the document to match the BooleanQuery, and the score from this subquery should contribute to the score for the BooleanQuery</li>
<li><code>MUST_NOT</code> &#8211; indicating that documents which match this nested Query are prohibited from matching the BooleanQuery</li>
<li><code>SHOULD</code> &#8211; indicating that documents which match this nested Query should have their score from the nested query contribute to the score from the BooleanQuery, but documents can be a match for the BooleanQuery even if they do not match the nested query</li>
</ul>
</ul>
</li>
<li>
If a BooleanQuery contains no <code>MUST</code> BooleanClauses, then a document is only considered a match against the BooleanQuery if one or more of the <code>SHOULD</code> BooleanClauses is a match.
</li>
<li>
The final score of a document which matches a BooleanQuery is based on the sum of the scores from all the matching <code>MUST</code> and <code>SHOULD</code> BooleanClauses, multiplied by a &#8220;coord factor&#8221; based on the ratio of the number of matching BooleanClauses to the total number of BooleanClauses in the BooleanQuery.
</li>
</ol>
<p>
These rules are not exactly simple to understand.  They are certainly more complex then boolean logic truth tables, but that&#8217;s because they are more powerful.  The examples below show how easy it is to implement &#8220;pure&#8221; boolean logic with BooleanQuery objects, but they only scratch the surface of what is possible with the BooleanQuery class:
</p>
<ul>
<li>
<p><b>Conjunction:</b> <code>(X &and; Y)</code></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.MUST);
q.add(Y, Occur.MUST);
</pre>
</li>
<li>
<p><b>Disjunction:</b> <code>(X &or; Y)</code></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Y, Occur.SHOULD);
</pre>
</li>
<li>
<p><b>Negation:</b> <code>(X &not; Z)</code></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Z, Occur.MUST_NOT);
</pre>
</li>
</ul>
<h2>Query Parser: Prefix Operators</h2>
<p>
In the Lucene <a href="https://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/queryParser/QueryParser.html">QueryParser</a> (and all of the other parsers that are based on it, like DisMax and EDisMax) the &#8220;prefix&#8221; operators &#8220;+&#8221; and &#8220;-&#8221; map directly to the Occur.MUST and Occur.MUST_NOT flags, while the <i>absence</i> of a prefix maps to the Occur.SHOULD flag by default. <i>(If you have any suggestions for a one character prefix syntax that could be used to explicitly indicate Occur.SHOULD, please comment with your suggestions, I&#8217;ve been trying to think of a good one for years.)</i>  So using the prefix syntax, you can express <i>all</i> of the permutations that the BooleanQuery class supports &#8212; not just simple boolean logic:
</p>
<ul>
<li>
<p><b><code>(+X +Y)</code></b> <i>&#8230; Conjunction, ie: (X &and; Y)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.MUST);
q.add(Y, Occur.MUST);
</pre>
</li>
<li>
<p><b><code>(X Y)</code></b> <i>&#8230; Disjunction, ie: (X &or; Y)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Y, Occur.SHOULD);
</pre>
</li>
<li>
<p><b><code>(+X -Z)</code></b> <i>&#8230; Negation, ie: (X &not; Z)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.MUST);
q.add(Z, Occur.MUST_NOT);
</pre>
</li>
<li>
<p><b><code>((X Y) -Z)</code></b> <i>&#8230; ((X &or; Y) &not; Z)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
BooleanQuery inner = new BooleanQuery();
inner.add(X, Occur.SHOULD);
inner.add(Y, Occur.SHOULD);
q.add(inner, Occur.SHOULD);
q.add(Z, Occur.MUST_NOT);
</pre>
</li>
<li>
<p><b><code>(X Y -Z)</code></b> <i>&#8230; Not expressible in simple boolean logic</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Y, Occur.SHOULD);
q.add(Z, Occur.MUST_NOT);
</pre>
</li>
</ul>
<p>
Note in particular the differences between the last two examples.  <code>(X Y -Z)</code> is a single BooleanQuery object containing three clauses, while <code>((X Y) -Z)</code> is a BooleanQuery containing two clauses, one of which is a nested BooleanQuery containing two clauses.  In both cases a document must match either &#8220;X&#8221; or &#8220;Y&#8221; and it can not match against &#8220;Z&#8221; (so the set of documents matched by each query will be the same) and in both cases the score of a document will be higher if it matches both the &#8220;X&#8221; and &#8220;Y&#8221; clauses; but because of the difference in their structures, the <i>scores</i> for these queries will be different for every document.  In particular, the coord factor will cause documents matching only one of &#8220;X&#8221; or &#8220;Y&#8221; (but not both) to have extremely different scores from each of these queries. <i>(This assumes that the <a href="https://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> is being used; it would be possible to write a custom Similarity to force the scores to be equivalent)</i>
</p>
<h2>Query Parser: &#8220;Boolean Operators&#8221;</h2>
<p>
The query parser also supports the so called &#8220;boolean operators&#8221; which can also be used to express boolean logic, as demonstrated in these examples:
</p>
<ul>
<li>
<p><b><code>(X AND Y)</code></b> <i>&#8230; Conjunction, ie: (X &and; Y)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.MUST);
q.add(Y, Occur.MUST);
</pre>
</li>
<li>
<p><b><code>(X OR Y)</code></b> <i>&#8230; Disjunction, ie: (X &or; Y)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Y, Occur.SHOULD);
</pre>
</li>
<li>
<p><b><code>(X NOT Z)</code></b> <i>&#8230; Negation, ie: (X &not; Z)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
q.add(X, Occur.SHOULD);
q.add(Z, Occur.MUST_NOT);
</pre>
</li>
<li>
<p><b><code>((X AND Y) OR Z)</code></b> <i>&#8230; ((X &and; Y) &or; Z)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
BooleanQuery inner = new BooleanQuery();
inner.add(X, Occur.MUST);
inner.add(Y, Occur.MUST);
q.add(inner, Occur.SHOULD);
q.add(Z, Occur.SHOULD);
</pre>
</li>
<li>
<p><b><code>((X OR Y) AND Z)</code></b> <i>&#8230; ((X &or; Y) &and; Z)</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
BooleanQuery inner = new BooleanQuery();
inner.add(X, Occur.SHOULD);
inner.add(Y, Occur.SHOULD);
q.add(inner, Occur.MUST);
q.add(Z, Occur.MUST);
</pre>
</li>
<li>
<p><b><code>(X AND (Y NOT Z))</code></b> <i>&#8230; (X &and; (Y &not; Z))</i></p>
<pre>
BooleanQuery q = new BooleanQuery();
BooleanQuery inner = new BooleanQuery();
inner.add(Y, Occur.MUST);
inner.add(Z, Occur.MUST_NOT);
q.add(X, Occur.MUST);
q.add(inner, Occur.MUST);
</pre>
</li>
</ul>
<p>
Please note how import it is to use parentheses to combine multiple operators in order in order to generate queries that correctly model boolean logic. As mentioned before, the BooleanQuery class supports an arbitrary number of clauses, so <code>(X OR Y OR Z)</code> is a single BooleanQuery with three clauses &#8212; it is not equivalent to either <code>((X OR Y) OR Z)</code> <i>or</i> <code>(X OR (Y OR Z))</code> because those result in a BooleanQuery with two clauses, one of which is a nested BooleanQuery.  As mentioned above when discussing the prefix operators, the scores from each of those queries will all be different depending on which clauses are matched by each document.
</p>
<p>
Things definitely get very confusing when these &#8220;boolean operators&#8221; are used in ways other then those described above.  In some cases this is because the query parser is trying to be forgiving about &#8220;natural language&#8221; style usage of operators that many boolean logic systems would consider a parse error.  In other cases, the behavior is bizarrely esoteric:
</p>
<ul>
<li>Queries are parsed left to right</li>
<li><code>NOT</code> sets the Occurs flag of the clause to it&#8217;s right to <code>MUST_NOT</code></li>
<li><code>AND</code> will change the Occurs flag of the clause to it&#8217;s left to <code>MUST</code> unless it has already been set to <code>MUST_NOT</code></li>
<li><code>AND</code> sets the Occurs flag of the clause to it&#8217;s right to <code>MUST</code></li>
<li><i>If</i> the <a href="https://wiki.apache.org/solr/SolrQuerySyntax">default operator</a> of the query parser has been set to &#8220;And&#8221;: <code>OR</code> will change the Occurs flag of the clause to it&#8217;s left to <code>SHOULD</code> unless it has already been set to <code>MUST_NOT</code></li>
<li><code>OR</code> sets the Occurs flag of the clause to it&#8217;s right to <code>SHOULD</code></li>
</ul>
<p>
Practically speaking this means that <code>NOT</code> takes precedence over <code>AND</code> which takes precedence over <code>OR</code> &#8212; but only if the default operator for the query parser has not been changed from the default (&#8220;Or&#8221;).  If the default operator is set to &#8220;And&#8221; then the behavior is just plain weird.
</p>
<h2>In Conclusion</h2>
<p>
I won&#8217;t try to defend or justify the way the query parser behaves when it encounters these &#8220;boolean operators&#8221;, because in many cases I don&#8217;t understand or agree with the behavior myself &#8212; but that&#8217;s not really the point of this article.  My goal isn&#8217;t to convince you that the behavior of these operators makes sense, quite the contrary my goal today is to point out that regardless of how these operators are parsed, they aren&#8217;t a good representation of the underlying functionality available in the BooleanQuery class.
</p>
<p>
Do yourself a favor and start <i>thinking</i> about BooleanQuery as a container of arbitrary nested queries annotated with <code>MUST</code>, <code>MUST_NOT</code>, or <code>SHOULD</code> and discover the power that is available to you beyond simple boolean logic.
</p>
<p><i>Many thanks to Bill Dueber whose <a href="http://robotlibrarian.billdueber.com/solr-and-boolean-operators/">recent related blog post</a> reminded me that I had some draft notes on this subject floating around my laptop waiting to finished up and posted online.</i></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>Nested Queries in Solr</title>
		<link>http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/</link>
		<comments>http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/#comments</comments>
		<pubDate>Tue, 31 Mar 2009 21:36:55 +0000</pubDate>
		<dc:creator>yonik</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[function query]]></category>
		<category><![CDATA[local params]]></category>
		<category><![CDATA[nested queries]]></category>
		<category><![CDATA[query parser]]></category>
		<category><![CDATA[query parser plugin]]></category>
		<category><![CDATA[query syntax]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=586</guid>
		<description><![CDATA[<table class="plain" border="0" width="100%" bordercolor="#ffccff">
<tbody>
<tr valign="top">
<td width="60%">The ability to nest an arbitrary query type inside another query type is a useful feature that was quietly added to Solr some time ago, along with the support for query parser plugins to support different query types.I finally got around to <a href="http://issues.apache.org/jira/browse/SOLR-1046" target="_blank">fixing</a> nested queries for the function query parser, and figured it was high time I documented nested queries, along with the <a href="http://wiki.apache.org/solr/LocalParams" target="_blank">LocalParams</a> syntax that allows one to add metadata to a query parameter, </td></tr></tbody>&#8230;</table>]]></description>
			<content:encoded><![CDATA[<table class="plain" border="0" width="100%" bordercolor="#ffccff">
<tbody>
<tr valign="top">
<td width="60%">The ability to nest an arbitrary query type inside another query type is a useful feature that was quietly added to Solr some time ago, along with the support for query parser plugins to support different query types.I finally got around to <a href="http://issues.apache.org/jira/browse/SOLR-1046" target="_blank">fixing</a> nested queries for the function query parser, and figured it was high time I documented nested queries, along with the <a href="http://wiki.apache.org/solr/LocalParams" target="_blank">LocalParams</a> syntax that allows one to add metadata to a query parameter, or even change the type of a query (i.e. which query parser is used to parse the query string.)</td>
<td width="5%"> </td>
<td width="35%"><strong><em>You might also be interested in:</em></strong></p>
<ul>
<li><a href="http://www.lucidimagination.com/Solutions/Webinars/Analyze-This-Tips-and-tricks-getting-LuceneSolr-Analyzer-index-and-search-your-content">Analyze This! Tips and tricks on getting the Lucene/Solr Analyzer to index and search your content right</a> &#8211; On-demand Webinar</li>
<li><a href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Optimizing-Findability-Lucene-and-Solr">Optimizing Findability</a> &#8211; Tech Article</li>
<li><a href="http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr">Solr 1.4 Download</a></li>
<li><a href="http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr/Reference-Guide">Solr 1.4 Reference Guide</a></li>
</ul>
</td>
</tr>
</tbody>
</table>
<h3>Nested Queries in Lucene Syntax</h3>
<p>To embed a query of another type in a Lucene/Solr query string, simply use the magic field name <strong>_query_</strong>.  The following example embeds a lucene query <strong>type:poems </strong>into another lucene query:</p>
<pre><strong>text:"roses are red" AND _query_:"type:poems"</strong></pre>
<p>Now of course this isn&#8217;t too useful on it&#8217;s own, but it becomes very powerful in conjunction with the query parser framework and local params which allows us to change the types of queries.  The following example embeds a <a href="http://wiki.apache.org/solr/DisMaxRequestHandler" target="_blank">DisMax query</a> in a normal lucene query:</p>
<pre><strong>text:hi  AND  _query_:"{!dismax qf=title pf=title}how now brown cow"</strong></pre>
<p>And we can further use parameter defererencing in the local params syntax to make it easier for the front-end to compose the request:</p>
<pre>&amp;<strong>q=text:hi  AND  _query_:"{!dismax qf=title pf=title v=$qq}</strong>
&amp;<strong>qq=how now brown cow</strong></pre>
<h3>Nested Queries in Function Query Syntax</h3>
<p>This is the part that was previously broken, and is only fixed/available in Solr 1.4.  You can use query() function to embed any other type of query in a function query, and do computations on the relevancy scores returned by that query.  Some examples from the Solr wiki are <a href="http://wiki.apache.org/solr/FunctionQuery#head-da96f90c1632609bae6fd86c853b9e13e514ce89" target="_blank">here</a>.</p>
<h3>Pure Nested Query</h3>
<p>There is also a nested query parser plugin that allows one to create pure nested queries.  Is a nested query without any containing query even useful? Surprisingly yes, as it allows further decomposition of query requests.</p>
<p>For example, the following allows an easy way for the client to specify that they want some sort of recency date boost added into the relevancy score, while leaving the exact query type up to the Solr server config (via search handler defaults in solrconfig.xml)</p>
<p>The client query would specify the boost query as $datefunc:</p>
<pre><strong>q=how now brown cow&amp;bq={!query v=$datefunc}</strong></pre>
<p>And the defaults for the handler in solrconfig.xml would contain the actual definition of datefunc as a function query:</p>
<pre>&lt;lst name="defaults"&gt;
   &lt;str name="datefunc"&gt;{!func}recip(rord(date),1,1000,1000)&lt;/str&gt;
   [...]</pre>
<p>The same idea could be used to allow a client to switch between complex filters, without having to specify what those filters are.</p>
<p>Without the nested query parser type, it would only be possible to specify the query value in a separate place (via local params v=$param) not the type also.</p>
<h3>The Future</h3>
<p>An XML Query Parser is on the way via <a href="https://issues.apache.org/jira/browse/SOLR-839">SOLR-839</a> that will allow expressing arbitrarily complex Lucene queries in XML.  As the number of query parsers grows, the importance of being able to mix, match, and nest them will become increasingly important.   One of the first extensions to the XML query parser should be to hook in nested queries of course!</p>
<p>The subclasses of QParserPlugin show <a href="http://lucene.apache.org/solr/api/org/apache/solr/search/QParserPlugin.html" target="_blank">all of the query parsers currently available to Solr</a>.  If you can&#8217;t find the query parser you&#8217;re looking for, you can create your own and register it via solrconfig.xml!</p>
<hr />
<p><strong><em>You might also be interested in:</em></strong></p>
<ul>
<li><a href="http://www.lucidimagination.com/Solutions/Webinars/Analyze-This-Tips-and-tricks-getting-LuceneSolr-Analyzer-index-and-search-your-content">Analyze This! Tips and tricks on getting the Lucene/Solr Analyzer to index and search your content right</a> &#8211; On-demand Webinar</li>
<li><a href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Optimizing-Findability-Lucene-and-Solr">Optimizing Findability</a> &#8211; Tech Article</li>
<li><a href="http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr">Solr 1.4 Download</a></li>
<li><a href="http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr/Reference-Guide">Solr 1.4 Reference Guide</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
	</channel>
</rss>

