<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lucid Imagination &#187; BoostingTermQuery</title>
	<atom:link href="http://www.lucidimagination.com/blog/category/boostingtermquery/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lucidimagination.com/blog</link>
	<description>Exclusively dedicated to Apache Lucene/Solr open source search technology</description>
	<lastBuildDate>Sat, 04 Feb 2012 01:12:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Getting Started with Payloads</title>
		<link>http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/</link>
		<comments>http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/#comments</comments>
		<pubDate>Wed, 05 Aug 2009 14:20:52 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[BoostingTermQuery]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Payloads]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=884</guid>
		<description><![CDATA[<p>Mark Miller recently <a href="http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/">posted</a> a brief intro to Span Queries, so I thought I would piggyback on top of his work and show how to get started with <a href="http://www.lucidimagination.com/search/s:lucid?q=payload">Payloads</a> (see also [1]).</p>
<h2>Introduction</h2>
<p>Like Spans, payloads involve the position of terms, but go one step further.  Namely, a Payload in Apache Lucene is an arbitrary byte array stored at a specific position (i.e. a specific token/term) in the index.  A payload can be used to &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>Mark Miller recently <a href="http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/">posted</a> a brief intro to Span Queries, so I thought I would piggyback on top of his work and show how to get started with <a href="http://www.lucidimagination.com/search/s:lucid?q=payload">Payloads</a> (see also [1]).</p>
<h2>Introduction</h2>
<p>Like Spans, payloads involve the position of terms, but go one step further.  Namely, a Payload in Apache Lucene is an arbitrary byte array stored at a specific position (i.e. a specific token/term) in the index.  A payload can be used to store weights for specific terms or things like part of speech tags or other semantic information.  If you read Brin and Page&#8217;s (you know, the Google guys) original paper <a href="http://infolab.stanford.edu/~backrub/google.html">Anatomy of a Large-Scale Hypertextual Search Engine</a>,  they describe what is essentially a payload functionality, whereby they store information about font, etc. at a specific position in the index (remember when you could get your pages ranked number one by using really big fonts?) and then utilize it in search.</p>
<p>There are three parts to taking advantage of payloads in Lucene.  Solr requires a fourth step, which I will explain in a moment.</p>
<ol>
<li>Add a Payload to one or more Tokens during indexing.</li>
<li>Override the Similarity class to handle scoring payloads</li>
<li>Use a Payload aware Query during your search</li>
</ol>
<p>For Solr, step 3 requires you to have your own Query Parser, as none of the existing Solr Query Parsers support the BoostingTermQuery.  Thus, the third step for Solr is add a Query Parser that supports payloads (and Spans would be nice, too!  Please donate if you do this!)</p>
<h2>Adding Payloads during indexing</h2>
<p>(I&#8217;m using Lucene 2.9-dev)</p>
<p>I&#8217;m going to use the same indexing code I did for my post on <a href="http://www.lucidimagination.com/blog/2009/05/26/accessing-words-around-a-positional-match-in-lucene/">co-occurrence analysis</a>, but with a few modifications.</p>
<p>First off, I&#8217;m going to change Analyzers to one of my own creation:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">class</span> PayloadAnalyzer <span style="color: #000000; font-weight: bold;">extends</span> Analyzer <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">private</span> PayloadEncoder encoder<span style="color: #339933;">;</span>
&nbsp;
    PayloadAnalyzer<span style="color: #009900;">&#40;</span>PayloadEncoder encoder<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">encoder</span> <span style="color: #339933;">=</span> encoder<span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">public</span> TokenStream tokenStream<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span> fieldName, <span style="color: #003399;">Reader</span> reader<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      TokenStream result <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> WhitespaceTokenizer<span style="color: #009900;">&#40;</span>reader<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      result <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> LowerCaseFilter<span style="color: #009900;">&#40;</span>result<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      result <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> DelimitedPayloadTokenFilter<span style="color: #009900;">&#40;</span>result, <span style="color: #0000ff;">'|'</span>, encoder<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #000000; font-weight: bold;">return</span> result<span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span></pre></div></div>

<p>In this Analyzer, I have the basic whitespace tokenizer and a lower case filter, but then I add in the recently added DelimitedPayloadTokenFilter (DPTF).  The DPTF allows you to add payloads to tokens simply by marking up the tokens with a special character followed by the payload value.  For instance, I changed my sample docs from the co-occurrence example to now include payload information.  Specifically, I said that all nouns should be weighted by 10, all verbs by 5 and all adjectives by 2 (I used <a href="http://l2r.cs.uiuc.edu/~cogcomp/pos_demo.php">http://l2r.cs.uiuc.edu/~cogcomp/pos_demo.php</a> to tag the sentences, any errors are likely mine.)  Everything else has no payload.   I also stripped all punctuation. My DOCS array now looks like:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> DOCS <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
           <span style="color: #0000ff;">&quot;The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the lazy|2.0 brown|2.0 dogs|10.0&quot;</span>,
          <span style="color: #0000ff;">&quot;The quick red fox jumped over the lazy brown dogs&quot;</span>,<span style="color: #666666; font-style: italic;">//no boosts</span>
          <span style="color: #0000ff;">&quot;The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the old|2.0 box|10.0&quot;</span>,
          <span style="color: #0000ff;">&quot;Mary|10.0 had a little|2.0 lamb|10.0 whose fleece|10.0 was|5.0 white|2.0 as snow|10.0&quot;</span>,
          <span style="color: #0000ff;">&quot;Mary had a little lamb whose fleece was white as snow&quot;</span>,
          <span style="color: #0000ff;">&quot;Mary|10.0 takes on Wolf|10.0 Restoration|10.0 project|10.0 despite ties|10.0 to sheep|10.0 farming|10.0&quot;</span>,
          <span style="color: #0000ff;">&quot;Mary|10.0 who lives|5.0 on a farm|10.0 is|5.0 happy|2.0 that she|10.0 takes|5.0 a walk|10.0 every day|10.0&quot;</span>,
          <span style="color: #0000ff;">&quot;Moby|10.0 Dick|10.0 is|5.0 a story|10.0 of a whale|10.0 and a man|10.0 obsessed|10.0&quot;</span>,
          <span style="color: #0000ff;">&quot;The robber|10.0 wore|5.0 a black|2.0 fleece|10.0 jacket|10.0 and a baseball|10.0 cap|10.0&quot;</span>,
          <span style="color: #0000ff;">&quot;The English|10.0 Springer|10.0 Spaniel|10.0 is|5.0 the best|2.0 of all dogs|10.0&quot;</span>
  <span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span></pre></div></div>

<p>The DOCS array simply marks each noun, verb and adjective with a | (pipe symbol) and then a float indicating the boost.  I also added some docs that have no boosts at all to demonstrate the differences at query time.  The DPTF will then use this to encode the payloads using the PayloadEncoder. A PayloadEncoder is an interface that tells the DPTF how to convert the payload to a byte array. Also note that Lucene&#8217;s contrib/analysis package contains several other TokenFilters for adding payloads to a Token and, of course, you can write your own as well.  Furthermore, the PayloadHelper class can help encode/decode payloads for common types.</p>
<h2>Overriding the Similarity Class</h2>
<p>The next step, which should happen before indexing, is to override the Similarity class to handle payloads.  While it is isn&#8217;t strictly required that this happens before indexing in <strong>THIS</strong> case, it is a good habit to do in case you have made other changes to the Similarity class that are required during indexing (such as overriding how norms are encoded.)</p>
<p>Overriding the Similarity is done on both the IndexWriter and the IndexSearcher.  See [3] below for the full code, including the calls to set the similarity.  My Similarity implementation simply converts the byte array to a float and returns the float, as in:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">class</span> PayloadSimilarity <span style="color: #000000; font-weight: bold;">extends</span> DefaultSimilarity <span style="color: #009900;">&#123;</span>
    @Override
    <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">float</span> scorePayload<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span> fieldName, <span style="color: #000066; font-weight: bold;">byte</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> bytes, <span style="color: #000066; font-weight: bold;">int</span> offset, <span style="color: #000066; font-weight: bold;">int</span> length<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #000000; font-weight: bold;">return</span> PayloadHelper.<span style="color: #006633;">decodeFloat</span><span style="color: #009900;">&#40;</span>bytes, offset<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><span style="color: #666666; font-style: italic;">//we can ignore length here, because we know it is encoded as 4 bytes</span>
    <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<h2>Executing the Query</h2>
<p>Currently, Lucene has one payload aware Query called the BoostingTermQuery (BTQ for short,  see [2] for another Payload aware query that may be in Lucene 2.9), which can be used just like any other query.  For instance:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">IndexSearcher searcher <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> IndexSearcher<span style="color: #009900;">&#40;</span>dir, <span style="color: #000066; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
searcher.<span style="color: #006633;">setSimilarity</span><span style="color: #009900;">&#40;</span>payloadSimilarity<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
BoostingTermQuery btq <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> BoostingTermQuery<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Term<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;body&quot;</span>, <span style="color: #0000ff;">&quot;fox&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
TopDocs topDocs <span style="color: #339933;">=</span> searcher.<span style="color: #006633;">search</span><span style="color: #009900;">&#40;</span>btq, <span style="color: #cc66cc;">10</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">int</span> i <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&amp;</span>lt<span style="color: #339933;">;</span> topDocs.<span style="color: #006633;">scoreDocs</span>.<span style="color: #006633;">length</span><span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
   ScoreDoc doc <span style="color: #339933;">=</span> topDocs.<span style="color: #006633;">scoreDocs</span><span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
   <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Doc: &quot;</span> <span style="color: #339933;">+</span> doc.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
   <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Explain: &quot;</span> <span style="color: #339933;">+</span> searcher.<span style="color: #006633;">explain</span><span style="color: #009900;">&#40;</span>btq, doc.<span style="color: #006633;">doc</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>In this example, I create the BTQ and hand it to the searcher and then print out the results.  Easy peasy, yet so powerful.</p>
<p>Running this yields:</p>
<blockquote><p>&#8212;&#8212;&#8212;&#8211;<br />
Results for body:fox of type: org.apache.lucene.search.payloads.BoostingTermQuery<br />
Doc: doc=0 score=4.2344446<br />
Explain: 4.234444 = (MATCH) fieldWeight(body:fox in 0), product of:<br />
7.071068 = (MATCH) btq, product of:<br />
0.70710677 = tf(phraseFreq=0.5)<br />
10.0 = scorePayload(&#8230;)<br />
1.9162908 = idf(body: fox=3)<br />
0.3125 = fieldNorm(field=body, doc=0)</p>
<p>Doc: doc=2 score=4.2344446<br />
Explain: 4.234444 = (MATCH) fieldWeight(body:fox in 2), product of:<br />
7.071068 = (MATCH) btq, product of:<br />
0.70710677 = tf(phraseFreq=0.5)<br />
10.0 = scorePayload(&#8230;)<br />
1.9162908 = idf(body: fox=3)<br />
0.3125 = fieldNorm(field=body, doc=2)</p>
<p>Doc: doc=1 score=0.42344445<br />
Explain: 0.42344445 = (MATCH) fieldWeight(body:fox in 1), product of:<br />
0.70710677 = (MATCH) btq, product of:<br />
0.70710677 = tf(phraseFreq=0.5)<br />
1.0 = scorePayload(&#8230;)<br />
1.9162908 = idf(body: fox=3)<br />
0.3125 = fieldNorm(field=body, doc=1)</p></blockquote>
<p>Notice how both Doc 0 and Doc 2, which both contain the word &#8220;fox&#8221; in the body occur before doc 1 even though they all have the same term frequency and length.</p>
<p>Running the a simple TermQuery (ignoring payloads) with the exact same Term, on the other hand, yields:</p>
<blockquote><p>&#8212;&#8212;&#8212;&#8211;<br />
Results for body:fox of type: org.apache.lucene.search.TermQuery<br />
Doc: doc=0 score=0.59884083<br />
Explain: 0.59884083 = (MATCH) fieldWeight(body:fox in 0), product of:<br />
1.0 = tf(termFreq(body:fox)=1)<br />
1.9162908 = idf(docFreq=3, numDocs=10)<br />
0.3125 = fieldNorm(field=body, doc=0)</p>
<p>Doc: doc=1 score=0.59884083<br />
Explain: 0.59884083 = (MATCH) fieldWeight(body:fox in 1), product of:<br />
1.0 = tf(termFreq(body:fox)=1)<br />
1.9162908 = idf(docFreq=3, numDocs=10)<br />
0.3125 = fieldNorm(field=body, doc=1)</p>
<p>Doc: doc=2 score=0.59884083<br />
Explain: 0.59884083 = (MATCH) fieldWeight(body:fox in 2), product of:<br />
1.0 = tf(termFreq(body:fox)=1)<br />
1.9162908 = idf(docFreq=3, numDocs=10)<br />
0.3125 = fieldNorm(field=body, doc=2)</p></blockquote>
<p>As you can see, in the TermQuery case, all the docs are scored exactly the same.</p>
<h2>Next Steps</h2>
<p>As you can see from above, getting started with Payloads is pretty easy.  In reality, the only hard part is determining what exactly to put in your payload and then how it should factor into your score.  Lucene takes care of the rest.  Tools like UIMA, OpenNLP and other proprietary vendors can often be used to provide higher level lexical, syntactical and semantic information about tokens, thus giving you the power to create very expressive payloads and richer search applications.</p>
<p>[1] See Michael Busch&#8217;s talk at the last SF Meetup for more details on payloads: <a href="http://www.meetup.com/SFBay-Lucene-Solr-Meetup/files/">http://www.meetup.com/SFBay-Lucene-Solr-Meetup/files/</a></p>
<p>[2] <a href="https://issues.apache.org/jira/browse/LUCENE-1341">https://issues.apache.org/jira/browse/LUCENE-1341</a></p>
<p>[3] Full class:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">package</span> <span style="color: #006699;">com.lucidimagination.noodles</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">junit.framework.TestCase</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.store.Directory</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.store.RAMDirectory</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.index.IndexWriter</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.index.Term</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.document.Document</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.document.Field</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.analysis.Analyzer</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.analysis.TokenStream</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.analysis.WhitespaceTokenizer</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.analysis.LowerCaseFilter</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilter</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.analysis.payloads.PayloadEncoder</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.analysis.payloads.FloatEncoder</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.analysis.payloads.PayloadHelper</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.search.DefaultSimilarity</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.search.IndexSearcher</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.search.TopDocs</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.search.ScoreDoc</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.search.Query</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.search.TermQuery</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.search.payloads.BoostingTermQuery</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.search.spans.SpanNearQuery</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.search.spans.SpanQuery</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.Reader</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.IOException</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #008000; font-style: italic; font-weight: bold;">/**
 *
 *
 **/</span>
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> PayloadTest <span style="color: #000000; font-weight: bold;">extends</span> TestCase <span style="color: #009900;">&#123;</span>
Directory dir<span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> DOCS <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
          <span style="color: #0000ff;">&quot;The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the lazy|2.0 brown|2.0 dogs|10.0&quot;</span>,
          <span style="color: #0000ff;">&quot;The quick red fox jumped over the lazy brown dogs&quot;</span>,<span style="color: #666666; font-style: italic;">//no boosts</span>
          <span style="color: #0000ff;">&quot;The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the old|2.0 brown|2.0 box|10.0&quot;</span>,
          <span style="color: #0000ff;">&quot;Mary|10.0 had a little|2.0 lamb|10.0 whose fleece|10.0 was|5.0 white|2.0 as snow|10.0&quot;</span>,
          <span style="color: #0000ff;">&quot;Mary had a little lamb whose fleece was white as snow&quot;</span>,
          <span style="color: #0000ff;">&quot;Mary|10.0 takes on Wolf|10.0 Restoration|10.0 project|10.0 despite ties|10.0 to sheep|10.0 farming|10.0&quot;</span>,
          <span style="color: #0000ff;">&quot;Mary|10.0 who lives|5.0 on a farm|10.0 is|5.0 happy|2.0 that she|10.0 takes|5.0 a walk|10.0 every day|10.0&quot;</span>,
          <span style="color: #0000ff;">&quot;Moby|10.0 Dick|10.0 is|5.0 a story|10.0 of a whale|10.0 and a man|10.0 obsessed|10.0&quot;</span>,
          <span style="color: #0000ff;">&quot;The robber|10.0 wore|5.0 a black|2.0 fleece|10.0 jacket|10.0 and a baseball|10.0 cap|10.0&quot;</span>,
          <span style="color: #0000ff;">&quot;The English|10.0 Springer|10.0 Spaniel|10.0 is|5.0 the best|2.0 of all dogs|10.0&quot;</span>
  <span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
  <span style="color: #000000; font-weight: bold;">protected</span> PayloadSimilarity payloadSimilarity<span style="color: #339933;">;</span>
&nbsp;
  @Override
  <span style="color: #000000; font-weight: bold;">protected</span> <span style="color: #000066; font-weight: bold;">void</span> setUp<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">Exception</span> <span style="color: #009900;">&#123;</span>
    dir <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> RAMDirectory<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    PayloadEncoder encoder <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> FloatEncoder<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    IndexWriter writer <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> IndexWriter<span style="color: #009900;">&#40;</span>dir, <span style="color: #000000; font-weight: bold;">new</span> PayloadAnalyzer<span style="color: #009900;">&#40;</span>encoder<span style="color: #009900;">&#41;</span>, <span style="color: #000066; font-weight: bold;">true</span>, IndexWriter.<span style="color: #006633;">MaxFieldLength</span>.<span style="color: #006633;">UNLIMITED</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    payloadSimilarity <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> PayloadSimilarity<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    writer.<span style="color: #006633;">setSimilarity</span><span style="color: #009900;">&#40;</span>payloadSimilarity<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">int</span> i <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&amp;</span>lt<span style="color: #339933;">;</span> DOCS.<span style="color: #006633;">length</span><span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #003399;">Document</span> doc <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Document</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #003399;">Field</span> id <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Field</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;id&quot;</span>, <span style="color: #0000ff;">&quot;doc_&quot;</span> <span style="color: #339933;">+</span> i, <span style="color: #003399;">Field</span>.<span style="color: #006633;">Store</span>.<span style="color: #006633;">YES</span>, <span style="color: #003399;">Field</span>.<span style="color: #006633;">Index</span>.<span style="color: #006633;">NOT_ANALYZED_NO_NORMS</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      doc.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span>id<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #666666; font-style: italic;">//Store both position and offset information</span>
      <span style="color: #003399;">Field</span> text <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Field</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;body&quot;</span>, DOCS<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span>, <span style="color: #003399;">Field</span>.<span style="color: #006633;">Store</span>.<span style="color: #006633;">NO</span>, <span style="color: #003399;">Field</span>.<span style="color: #006633;">Index</span>.<span style="color: #006633;">ANALYZED</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      doc.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span>text<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      writer.<span style="color: #006633;">addDocument</span><span style="color: #009900;">&#40;</span>doc<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
    writer.<span style="color: #006633;">close</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">void</span> testPayloads<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">Exception</span> <span style="color: #009900;">&#123;</span>
    IndexSearcher searcher <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> IndexSearcher<span style="color: #009900;">&#40;</span>dir, <span style="color: #000066; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    searcher.<span style="color: #006633;">setSimilarity</span><span style="color: #009900;">&#40;</span>payloadSimilarity<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><span style="color: #666666; font-style: italic;">//set the similarity.  Very important</span>
    BoostingTermQuery btq <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> BoostingTermQuery<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Term<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;body&quot;</span>, <span style="color: #0000ff;">&quot;fox&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    TopDocs topDocs <span style="color: #339933;">=</span> searcher.<span style="color: #006633;">search</span><span style="color: #009900;">&#40;</span>btq, <span style="color: #cc66cc;">10</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    printResults<span style="color: #009900;">&#40;</span>searcher, btq, topDocs<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    TermQuery tq <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> TermQuery<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Term<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;body&quot;</span>, <span style="color: #0000ff;">&quot;fox&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    topDocs <span style="color: #339933;">=</span> searcher.<span style="color: #006633;">search</span><span style="color: #009900;">&#40;</span>tq, <span style="color: #cc66cc;">10</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    printResults<span style="color: #009900;">&#40;</span>searcher, tq, topDocs<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000066; font-weight: bold;">void</span> printResults<span style="color: #009900;">&#40;</span>IndexSearcher searcher, Query query, TopDocs topDocs<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">IOException</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;-----------&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Results for &quot;</span> <span style="color: #339933;">+</span> query <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot; of type: &quot;</span> <span style="color: #339933;">+</span> query.<span style="color: #006633;">getClass</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">getName</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">int</span> i <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&amp;</span>lt<span style="color: #339933;">;</span> topDocs.<span style="color: #006633;">scoreDocs</span>.<span style="color: #006633;">length</span><span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      ScoreDoc doc <span style="color: #339933;">=</span> topDocs.<span style="color: #006633;">scoreDocs</span><span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
      <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Doc: &quot;</span> <span style="color: #339933;">+</span> doc.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Explain: &quot;</span> <span style="color: #339933;">+</span> searcher.<span style="color: #006633;">explain</span><span style="color: #009900;">&#40;</span>query, doc.<span style="color: #006633;">doc</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">class</span> PayloadSimilarity <span style="color: #000000; font-weight: bold;">extends</span> DefaultSimilarity <span style="color: #009900;">&#123;</span>
    @Override
    <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">float</span> scorePayload<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span> fieldName, <span style="color: #000066; font-weight: bold;">byte</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> bytes, <span style="color: #000066; font-weight: bold;">int</span> offset, <span style="color: #000066; font-weight: bold;">int</span> length<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #000000; font-weight: bold;">return</span> PayloadHelper.<span style="color: #006633;">decodeFloat</span><span style="color: #009900;">&#40;</span>bytes, offset<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><span style="color: #666666; font-style: italic;">//we can ignore length here, because we know it is encoded as 4 bytes</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">class</span> PayloadAnalyzer <span style="color: #000000; font-weight: bold;">extends</span> Analyzer <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">private</span> PayloadEncoder encoder<span style="color: #339933;">;</span>
&nbsp;
    PayloadAnalyzer<span style="color: #009900;">&#40;</span>PayloadEncoder encoder<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">encoder</span> <span style="color: #339933;">=</span> encoder<span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">public</span> TokenStream tokenStream<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span> fieldName, <span style="color: #003399;">Reader</span> reader<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      TokenStream result <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> WhitespaceTokenizer<span style="color: #009900;">&#40;</span>reader<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      result <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> LowerCaseFilter<span style="color: #009900;">&#40;</span>result<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      result <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> DelimitedPayloadTokenFilter<span style="color: #009900;">&#40;</span>result, <span style="color: #0000ff;">'|'</span>, encoder<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #000000; font-weight: bold;">return</span> result<span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
	</channel>
</rss>

