<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lucid Imagination &#187; qparser</title>
	<atom:link href="http://www.lucidimagination.com/blog/tag/qparser/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lucidimagination.com/blog</link>
	<description>Exclusively dedicated to Apache Lucene/Solr open source search technology</description>
	<lastBuildDate>Sat, 04 Feb 2012 01:12:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>What&#8217;s a &#8220;DisMax&#8221; ?</title>
		<link>http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/</link>
		<comments>http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/#comments</comments>
		<pubDate>Sun, 23 May 2010 21:52:52 +0000</pubDate>
		<dc:creator>hossman</dc:creator>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[dismax]]></category>
		<category><![CDATA[qparser]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=2066</guid>
		<description><![CDATA[The term "dismax" gets tossed around on the Solr lists frequently, which can be fairly confusing to new users.  Let's see if we can demystify it a bit.... ]]></description>
			<content:encoded><![CDATA[<p>
The term &#8220;dismax&#8221; gets tossed around on the Solr lists frequently, which can be fairly confusing to new users.  It originated as a shorthand name for the <a href="http://lucene.apache.org/solr/api/org/apache/solr/handler/DisMaxRequestHandler.html">DisMaxRequestHandler</a> (which I named after the <a href="http://lucene.apache.org/solr/api/org/apache/solr/util/SolrPluginUtils.DisjunctionMaxQueryParser.html">DisjunctionMaxQueryParser</a>, which I named after the <a href="http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/search/DisjunctionMaxQuery.html">DisjunctionMaxQuery</a> class that it uses heavily).  In recent years, the DisMaxRequestHandler and the StandardRequestHandler were both refactored into a single SearchHandler class, and now the term &#8220;dismax&#8221; usually refers to the <a href="http://lucene.apache.org/solr/api/org/apache/solr/search/DisMaxQParser.html">DisMaxQParser</a>.
</p>
<p> Clear as Mudd, right?
</p>
<p> Regardless of whether you use the DisMaxRequestHandler via the <code>qt=dismax</code> parameter, or use the SearchHandler with the DisMaxQParser via <code>defType=dismax</code> the end result is that your <code>q</code> parameter gets parsed by the DisjunctionMaxQueryParser.
</p>
<p> The <a href="http://search.lucidimagination.com/search/document/dfd8d8739ebc00f4/two_solr_announcements_cnet_product_search_and_dismax">original goals</a> of dismax (whichever meaning you might infer) have never changed:
</p>
<blockquote><p> &#8230; supports a simplified version of the Lucene QueryParser syntax. Quotes can be used to group phrases, and +/- can be used to denote mandatory and optional clauses &#8230; but all other Lucene query parser special characters are escaped to simplify the user experience. The handler takes responsibility for building a good query from the user&#8217;s input using BooleanQueries containing DisjunctionMaxQueries across fields and boosts you specify It also allows you to provide additional boosting queries, boosting functions, and filtering queries to artificially affect the outcome of all searches. These options can all be specified as default parameters for the handler in your solrconfig.xml or overridden the Solr query URL.
</p></blockquote>
<p> In short: You worry about what fields and boosts you want to use when you configure it, your users just give you words w/o worrying too much about syntax.
</p>
<p> The magic of dismax (in my opinion) comes from the query structure it produces.  What it essentially boils down to is <a href="http://en.wikipedia.org/wiki/Matrix_multiplication#Relationship_with_the_inner_product_and_the_outer_product">matrix multiplication</a>: a one column matrix of each &#8220;chunk&#8221; of your user&#8217;s input, multiplied by a one row matrix of the <code>qf</code> fields to produce a big matrix of every field:chunk permutation.  The matrix is then turned into a BooleanQuery consisting of DisjunctionMaxQueries for each row in the matrix.  DisjunctionMaxQuery is used because it&#8217;s score is determined by the maximum score of it&#8217;s subclauses &#8212; instead of the sum like a BooleanQuery &#8212; so no one word from the user input dominates the final score.  The best way to explain this is with an example, so let&#8217;s consider the following input&#8230;
</p>
<pre>
defType = dismax
     mm = 50%
     qf = features^2 name^3
      q = +"apache solr" search server
</pre>
<p> First off, we consider the &#8220;markup&#8221; characters of the parser that appear in this <code>q</code> string:
</p>
<ul>
<li>white space &#8211; dividing input string into chunk</li>
<li>quotes &#8211; makes a single phrase chunk</li>
<li>+ &#8211; makes a chunk mandatory</li>
</ul>
<p> So we have 3 &#8220;chunks&#8221; of user input:
</p>
<ul>
<li>&#8220;apache solr&#8221; (must match)</li>
<li>&#8220;search&#8221; (should match)</li>
<li>&#8220;server&#8221; (should match></li>
</ul>
<p> If we &#8220;multiply&#8221; that with our <code>qf</code> list <code>(features, name)</code> we get a matrix like this&#8230;
</p>
<table>
<tr>
<td style="padding: 1em; border: 1px dotted">features:&#8221;apache solr&#8221;</td>
<td style="padding: 1em; border: 1px dotted"> name:&#8221;apache solr&#8221;</td>
<td style="padding: 1em; border: none">(must match)</td>
</tr>
<tr>
<td style="padding: 1em; border: 1px dotted">features:&#8221;search&#8221;</td>
<td style="padding: 1em; border: 1px dotted">name:&#8221;search&#8221;</td>
<td style="padding: 1em; border: none">(should match)</td>
</tr>
<tr>
<td style="padding: 1em; border: 1px dotted">features:&#8221;server&#8221;</td>
<td style="padding: 1em; border: 1px dotted">name:&#8221;server&#8221;</td>
<td style="padding: 1em; border: none">(should match)</td>
</tr>
</table>
<p> If we then factor in the <code>mm</code> param to determing the &#8220;minimum number of &#8216;ShouldMatch&#8217; clauses that (ahem) must match&#8221; (50% of 2 == 1) we get the following query structure (in psuedo-code)&#8230;
</p>
<pre>
q = BooleanQuery(
  minNumberShouldMatch => 1,
  booleanClauses => ClauseList(
    MustMatch(DisjunctionMaxQuery(
      PhraseQuery("features","apache solr")^2,
      PhraseQuery("name","apache solr")^3)
    ),
    ShouldMatch(DisjunctionMaxQuery(
      TermQuery("features","search")^2,
      TermQuery("name","search")^3)
    ),
    ShouldMatch(DisjunctionMaxQuery(
      TermQuery("features","server")^2,
      TermQuery("name","server")^3))
    ));
</pre>
<p>
With me so far right?
</p>
<p>
Where people tend to get tripped up, is in thinking about how Solr&#8217;s per-field analysis configuration (in schema.xml) impacts all of this.  Our example above was pretty straight forward, but lets consider for a moment what might happen if:
</p>
<ul>
<li>
    The <code>name</code> field uses the WordDelimiterFilter at query time but <code>features</code> does not.
  </li>
<li>
    The <code>features</code> field is configured so that &#8220;the&#8221; is a stopword, but <code>name</code> is not.
  </li>
</ul>
<p> Now let&#8217;s look at what we get when our input parameters are structurally similar to what we had before, but just different enough to for WordDelimiterFilter and StopFilter to come into play&#8230;
</p>
<pre>
defType = dismax
     mm = 50%
     qf = features^2 name^3
      q = +"apache solr" the search-server
</pre>
<p>
  Our resulting query is going to be something like&#8230;
</p>
<pre>
q = BooleanQuery(
  minNumberShouldMatch => 1,
  booleanClauses => ClauseList(
    MustMatch(DisjunctionMaxQuery(
      PhraseQuery("features","apache solr")^2,
      PhraseQuery("name","apache solr")^3)
    ),
    ShouldMatch(DisjunctionMaxQuery(
      TermQuery("name","the")^3)
    ),
    ShouldMatch(DisjunctionMaxQuery(
      TermQuery("features","search-server")^2,
      PhraseQuery("name","search server")^3))
  ));
</pre>
<p> The use of WordDelimiterFilter hasn&#8217;t changed things very much: features is treating &#8220;search-server&#8221; as a single Term, while in the <code>name</code> field we are searching for the phrase &#8220;search server&#8221; &#8212; hopefully this shouldn&#8217;t surprise anyone given the use of WordDelimiterFilter for the name field (presumably that&#8217;s why it&#8217;s being used).  This DisjunctionMaxQuery still &#8220;makes sense&#8221;, but other fields with odd analysis that produce less/more Tokens then a &#8220;typical&#8221; field for the same thunk might produce queries that aren&#8217;t as easily to understand. In particular consider what has happened in our example with the word &#8220;the&#8221;:  Because &#8220;the&#8221; is a stop word in the <code>features</code> field, no Query object is produced for that field/chunk combination.  But a Query is produced for the <code>name</code> field, which means the total number of &#8220;ShouldMatch&#8221; clauses in our top level query is still 2 so our minNumberShouldMatch is still 1 (50% of 2 == 1).
</p>
<p> This type of situation tends to confuse a lot of people: since &#8220;the&#8221; is a stop word in one field, they don&#8217;t expect it to matter in the final query &#8212; but as long as at least one <code>qf</code> field produces a Token for it (<code>name</code> in our example) it will be included in the final query, and will contribute to the count of &#8220;ShouldMatch&#8221; clauses.
</p>
<p> So, what&#8217;s the take away from all of this?
</p>
<p> DisMax is a complicated creature.  When using it, you need to consider <a href="http://wiki.apache.org/solr/DisMaxRequestHandler">all of it&#8217;s options</a> carefully, and look at the <code>debugQuery=true</code> output while experimenting with different query strings and different analysis configurations to make really sure you understand how queries from your users will be parsed.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Ranges over Functions in Solr 1.4</title>
		<link>http://www.lucidimagination.com/blog/2009/07/06/ranges-over-functions-in-solr-14/</link>
		<comments>http://www.lucidimagination.com/blog/2009/07/06/ranges-over-functions-in-solr-14/#comments</comments>
		<pubDate>Tue, 07 Jul 2009 01:43:02 +0000</pubDate>
		<dc:creator>yonik</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[frange]]></category>
		<category><![CDATA[function query]]></category>
		<category><![CDATA[qparser]]></category>
		<category><![CDATA[range filter]]></category>
		<category><![CDATA[range query]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=778</guid>
		<description><![CDATA[<p>Solr 1.4 contains a new feature that allows range queries or range filters over arbitrary functions.  It&#8217;s implemented as a standard <a href="http://lucene.apache.org/solr/api/org/apache/solr/search/FunctionRangeQParserPlugin.html">Solr QParser plugin</a>, and thus easily available for use any place that accepts the standard <a href="http://wiki.apache.org/solr/SolrQuerySyntax">Solr Query Syntax</a> by specifying the <strong>frange </strong>query type.  Here&#8217;s an example of a filter specifying the lower and upper bounds for a function:</p>
<p><code>fq={!frange l=0 u=2.2}log(sum(user_ranking,editor_ranking))</code></p>
<p>The other interesting use for frange is to trade off memory &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>Solr 1.4 contains a new feature that allows range queries or range filters over arbitrary functions.  It&#8217;s implemented as a standard <a href="http://lucene.apache.org/solr/api/org/apache/solr/search/FunctionRangeQParserPlugin.html">Solr QParser plugin</a>, and thus easily available for use any place that accepts the standard <a href="http://wiki.apache.org/solr/SolrQuerySyntax">Solr Query Syntax</a> by specifying the <strong>frange </strong>query type.  Here&#8217;s an example of a filter specifying the lower and upper bounds for a function:</p>
<p><code>fq={!frange l=0 u=2.2}log(sum(user_ranking,editor_ranking))</code></p>
<p>The other interesting use for frange is to trade off memory for speed when doing range queries on any type of single-valued field.  For example, one can use <strong>frange </strong>on a string field provided that there is only one value per field, and that numeric functions are avoided.</p>
<p>For example, here is a filter that only allows authors between martin and rowling, specified using a standard range query:<br />
<code>fq=author_last_name:[martin TO rowling]</code></p>
<p>And the same filter using a function range query (<strong>frange</strong>):<br />
<code>fq={!frange l=martin u=rowling}author_last_name</code></p>
<p>This can lead to significant performance improvements for range queries with many terms between the endpoints, at the cost of memory to hold the un-inverted form of the field in memory (i.e. a FieldCache entry &#8211; same as would be used for sorting).  If the field in question is already being used for sorting or other function queries, there won&#8217;t be any additional memory overhead.</p>
<p>The following chart shows the results of a test of frange queries vs standard range queries on a string field with 200,000 unique values.  For example, frange was 14 times faster when executing a range query / range filter that covered 20% of the terms in the field.  For narrower ranges that matched less than 5% of the values, the traditional range query performed better.</p>
<table border="1">
<tbody>
<tr>
<th>Percent of terms covered</th>
<th>Fastest implementation</th>
<th>Speedup (how many times faster)</th>
</tr>
<tr>
<td>100%</td>
<td>frange</td>
<td>43.32</td>
</tr>
<tr>
<td>20%</td>
<td>frange</td>
<td>14.25</td>
</tr>
<tr>
<td>10%</td>
<td>frange</td>
<td>8.07</td>
</tr>
<tr>
<td>5%</td>
<td>frange</td>
<td>1.337</td>
</tr>
<tr>
<td>1%</td>
<td>normal range query</td>
<td>3.59</td>
</tr>
</tbody>
</table>
<p>Of course, Solr 1.4 also contains the new <a href="http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/">TrieRange </a>functionality that will generally have the best time/space profile for range queries over numeric fields.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2009/07/06/ranges-over-functions-in-solr-14/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>

