<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lucid Imagination &#187; term vectors</title>
	<atom:link href="http://www.lucidimagination.com/blog/category/term-vectors/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lucidimagination.com/blog</link>
	<description>Exclusively dedicated to Apache Lucene/Solr open source search technology</description>
	<lastBuildDate>Sat, 04 Feb 2012 01:12:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Integrating Apache Mahout with Apache Lucene and Solr &#8211; Part I (of 3)</title>
		<link>http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/</link>
		<comments>http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/#comments</comments>
		<pubDate>Tue, 16 Mar 2010 15:35:48 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[term vectors]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1851</guid>
		<description><![CDATA[<h1>Introduction</h1>
<p>As <a href="http://lucene.apache.org/mahout">Apache Mahout</a> is about to release its next version (0.3), I thought I would share some thoughts on how it might be integrated with <a href="http://lucene.apache.org/java">Apache Lucene</a> and <a href="http://lucene.apache.org/solr">Apache Solr</a>.  For those who aren&#8217;t aware of Mahout, it is an <a href="http://www.apache.org/">ASF</a> project building out a library of machine learning algorithms that are designed to be scalable (often via Apache Hadoop) and licensed under the Apache Software License (i.e., commercially friendly).  Mahout has a &#8230;</p>]]></description>
			<content:encoded><![CDATA[<h1>Introduction</h1>
<p>As <a href="http://lucene.apache.org/mahout">Apache Mahout</a> is about to release its next version (0.3), I thought I would share some thoughts on how it might be integrated with <a href="http://lucene.apache.org/java">Apache Lucene</a> and <a href="http://lucene.apache.org/solr">Apache Solr</a>.  For those who aren&#8217;t aware of Mahout, it is an <a href="http://www.apache.org/">ASF</a> project building out a library of machine learning algorithms that are designed to be scalable (often via Apache Hadoop) and licensed under the Apache Software License (i.e., commercially friendly).  Mahout has a variety of algorithms already implemented, ranging from clustering to classification and collaborative filtering.  For more on Mahout, see <a href="http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/">my TriJUG talk</a> or my <a href="https://www.ibm.com/developerworks/java/library/j-mahout/index.html">developerWorks article</a>.  Instead of going over the litany of things implemented in Mahout, I&#8217;ll give a quick recap of what the primary features of 0.3 are:</p>
<ol>
<li>New math, collections modules based on the time tested Colt project</li>
<li>LLR (Log-likelihood ratio &#8211; See Lucid Imagination advisor Ted Dunning&#8217;s <a href="http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html">blog entry</a> for more info) co-location implementation</li>
<li>Hadoop-based Lanczos SVD (Singular Value Decomposition) solver &#8212; good for feature reduction, which is a common requirement at scale</li>
<li>Shell scripts for easier running of algorithms, examples</li>
<li>Faster Frequent Pattern Growth (FPGrowth) using FP-bonsai pruning</li>
<li>Parallel Dirichlet process clustering (model-based clustering algorithm)</li>
<li>Parallel co-occurrence based recommender</li>
<li>Code cleanup, many bug fixes and performance improvements</li>
<li>A new Logo:<a href="http://www.lucidimagination.com/blog/wp-content/uploads/2010/03/mahout.png"><img class="alignnone size-medium wp-image-1852" title="mahout" src="http://www.lucidimagination.com/blog/wp-content/uploads/2010/03/mahout-300x126.png" alt="" width="300" height="126" /></a></li>
</ol>
<p>﻿</p>
<p>Enough of the background; let&#8217;s get to what we can do right now.  I&#8217;ll break it down into three groups:</p>
<ol>
<li>Lucene/Solr as a Data Source for Mahout batch processing</li>
<li>Document/Results Augmentation (clustering, classification, recommendations)</li>
<li>Learning about your data and your users (log analysis with Apache Mahout)</li>
</ol>
<p>In Part I (this post), I&#8217;m going to focus on #1 as a way for people to get started without having to do any coding.  In Part II, I&#8217;ll focus on #2 and finally, as you might guess, Part III will focus on #3.</p>
<h1>Lucene/Solr as a Data Source for Mahout</h1>
<p>Most Apache Mahout algorithms run off of Feature Vectors.  For those in the Lucene world, a feature vector should feel very familiar.  It is, more or less a document, or some subset of a document.  Specifically, a feature vector is a tuple of features that are useful for the algorithm.  It is up to you to determine what features work best.  In many cases for Mahout, a vector is simply a tuple of weights for each of the words in a document.  In other cases, they might be the values from the output of some manufacturing process.  Do note that the features for having good search capabilities are often different than those needed for good machine learning.  For instance, in my experiments with Mahout&#8217;s clustering capabilities, I need far more aggressive stopword removal to get good results than I do for search.  (In fact, for search these days, I often don&#8217;t even remove stopwords, but instead deal with them at query time, but that is a whole other post.)</p>
<p>There are two different ways for Mahout to use Lucene/Solr as a data source:</p>
<ol>
<li>Utilize Lucene&#8217;s term vector capability to create Mahout feature vectors.</li>
<li>Programmatically access low level Lucene features like TermEnum, TermDocs, TermPositions, etc. to construct features.</li>
</ol>
<p>For this post, I&#8217;m going to focus on #1, as I have yet to even have a need for #2, even though in theory it could be done.</p>
<h2>Mahout Vectors from Lucene Term Vectors</h2>
<p>In order for Mahout to create vectors from a Lucene index, the first and foremost thing that must be done is that the index must contain Term Vectors.  A term vector is a document centric view of the terms and their frequencies (as opposed to the inverted index, which is a term centric view) and is not on by default.</p>
<p>For this example, I&#8217;m going to use Solr&#8217;s example, located in &lt;Solr Home&gt;/example</p>
<p>In Solr, storing Term Vectors is as simple as setting termVectors=&#8221;true&#8221; on on the field in the schema, as in:</p>
<blockquote><p>&lt;field name=&#8221;text&#8221; type=&#8221;text&#8221; indexed=&#8221;true&#8221; stored=&#8221;true&#8221; termVectors=&#8221;true&#8221;/&gt;</p></blockquote>
<p>For pure Lucene, you will need to set the TermVector option on during Field creation, as in:</p>
<blockquote><p>Field fld = new Field(&#8220;text&#8221;, &#8220;foo&#8221;, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.YES);</p></blockquote>
<p>From here, it&#8217;s as simple as pointing Mahout&#8217;s new shell script (try running &lt;MAHOUT HOME&gt;/bin/mahout for a full listing of it&#8217;s capabilities) at the index and letting it rip:</p>
<blockquote><p>&lt;MAHOUT HOME&gt;/bin/mahout lucene.vector &#8211;dir &lt;PATH TO INDEX&gt;/example/solr/data/index/ &#8211;output /tmp/foo/part-out.vec &#8211;field title-clustering &#8211;idField id &#8211;dictOut /tmp/foo/dict.out &#8211;norm 2</p></blockquote>
<p>A few things to note about this command:</p>
<ol>
<li>This outputs a single vector file, title part-out.vec to the target/foo directory</li>
<li>It uses the title-clustering field.  If you want a combination of fields, then you will have to create a single &#8220;merged&#8221; field containing those fields.  Solr&#8217;s &lt;copyField&gt; syntax can make this easy.</li>
<li>The idField is used to provide a label to the Mahout vector such that the output from Mahout&#8217;s algorithms can be traced back to the actual documents.</li>
<li>The &#8211;dictOut outputs the list of terms that are represented in the Mahout vectors.  Mahout uses an internal, sparse vector representation for text documents (dense vector representations are also available) so this file contains the &#8220;key&#8221; for making sense of the vectors later.  As an aside, if you ever have problems with Mahout, you can often share your vectors with the list and simply keep the dictionary to yourself, since it would be pretty difficult (not sure if it is impossible) to reverse engineer just the vectors.</li>
<li>The &#8211;norm tells Mahout how to <a href="http://en.wikipedia.org/wiki/Lp_space">normalize</a> the vector.  For many Mahout applications, normalization is a necessary process for obtaining good results.  In this case, I am using the Euclidean distance (aka the 2-norm) to normalize the vector because I intend to cluster the documents using the Euclidean distance similarity.  Other approaches may require other norms.</li>
</ol>
<p>Obviously, this script above can be run at any time, but I think it is even more interesting to hook it into Solr&#8217;s event system, with caveats.  For those who aren&#8217;t familiar, Solr provides an event call back system for events like <a href="http://wiki.apache.org/solr/SolrConfigXml#A.22Update.22_Related_Event_Listeners">commit and optimize</a> (see also the LucidWorks <a href="http://www.lucidimagination.com/search/document/CDRG_ch08_8.3.6.2?q=Event%20Listeners">Reference Guide</a>).    Hooking into the event system is as simple as setting up the appropriate event listener.  For this example, I&#8217;m going to hook into the commit listener by having it call out to the Mahout script above:</p>
<blockquote><p>&lt;listener event=&#8221;postCommit&#8221;&gt;<br />
&lt;str name=&#8221;exe&#8221;&gt;/Volumes/User/grantingersoll/projects/lucene/mahout/clean/bin/mahout&lt;/str&gt;<br />
&lt;str name=&#8221;dir&#8221;&gt;.&lt;/str&gt;<br />
&lt;bool name=&#8221;wait&#8221;&gt;false&lt;/bool&gt;<br />
&lt;arr name=&#8221;args&#8221;&gt;<br />
&lt;str&gt;lucene.vector&lt;/str&gt;<br />
&lt;str&gt;&#8211;dir&lt;/str&gt;<br />
&lt;str&gt;./solr/data/index/&lt;/str&gt;<br />
&lt;str&gt;&#8211;output&lt;/str&gt;<br />
&lt;str&gt;/tmp/mahout/vectors/part-out.vec&lt;/str&gt;<br />
&lt;str&gt;&#8211;field&lt;/str&gt;<br />
&lt;str&gt;text&lt;/str&gt;<br />
&lt;str&gt;&#8211;idField&lt;/str&gt;<br />
&lt;str&gt;id&lt;/str&gt;<br />
&lt;str&gt;&#8211;dictOut&lt;/str&gt;<br />
&lt;str&gt;/tmp/mahout/vectors/dict.dat&lt;/str&gt;<br />
&lt;str&gt;&#8211;norm&lt;/str&gt;<br />
&lt;str&gt;2&lt;/str&gt;<br />
&lt;str&gt;&#8211;maxDFPercent&lt;/str&gt;<br />
&lt;str&gt;90&lt;/str&gt;<br />
&lt;/arr&gt;<br />
&lt;/listener&gt;</p></blockquote>
<p>From here, one can easily extrapolate how a script could be written to then call Mahout&#8217;s other methods, namely things like <a href="http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms">clustering and Latent Dirichlet Allocation (LDA) for topic modeling</a>.  Alternatively, one could set up a process to watch for changes to the vector and then spawn a process to go and run the appropriate Mahout tasks.</p>
<p>So, what are the caveats with the above approach?</p>
<ol>
<li>If you are running in a commit heavy environment, you may not want to run Mahout on every commit.  Mahout is designed for batch processing (well, most of it is, anyway) and most of these jobs are designed to run on Hadoop clusters.  In order to do that, you would have to modify the above paths, etc. to have it output to Hadoop&#8217;s HDFS, which I&#8217;ll leave as an exercise to the reader (the Mathematician in me always enjoys saying that!)</li>
<li>If you are running Solr in a distributed environment, you&#8217;re going to have to set things up appropriately on each node.  Hopefully, as the <a href="http://wiki.apache.org/solr/SolrCloud">Solr Cloud</a> stuff matures, this will become even simpler and we should be able to do some really smart things to make Mahout and Solr work together in a distributed environment.  For now, you&#8217;re on your own.</li>
</ol>
<p>In the next posting, I&#8217;ll look at how we can more closely hook in Mahout into the indexing and search process.  As a teaser, think about how you could use Mahout to classify and cluster large volumes of text and then have that information available for things like faceting, discovery and filtering on the search side.</p>
<p>As always, let me know if you have any questions or comments.</p>
<h1>References</h1>
<ol>
<li><a href="http://manning.com/owen/"><strong>Mahout In Action</strong></a> by Owen and Anil.  Manning Publications.</li>
<li>Various Solr and Lucene <a href="http://www.lucidimagination.com/developer/documentation">books</a>, all linked via Lucid Imagination.</li>
<li><a href="# # http://lucene.apache.org/mahout">http://lucene.apache.org/mahout</a></li>
<li><a href="http://cwiki.apache.org/MAHOUT">http://cwiki.apache.org/MAHOUT</a></li>
<li><a href="http://lucene.grantingersoll.com/">Grant&#8217;s Blog</a> has a number of articles on Mahout</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
		<item>
		<title>Accessing words around a positional match in Lucene</title>
		<link>http://www.lucidimagination.com/blog/2009/05/26/accessing-words-around-a-positional-match-in-lucene/</link>
		<comments>http://www.lucidimagination.com/blog/2009/05/26/accessing-words-around-a-positional-match-in-lucene/#comments</comments>
		<pubDate>Tue, 26 May 2009 18:50:15 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[sentiment analysis]]></category>
		<category><![CDATA[Span Queries]]></category>
		<category><![CDATA[term vectors]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=672</guid>
		<description><![CDATA[<p>From time to time, users on the Lucene mailing list ask a variant of the following question:</p>
<blockquote><p>Given a term match in a document, what&#8217;s the best way to get a window of words around that match?</p></blockquote>
<p>Getting a window of words around a match can be useful for a lot of things, including, to name a few:</p>
<ol>
<li>Highlighting (although I&#8217;d recommend using Lucene&#8217;s Highlighter package for that)</li>
<li>Co-occurrence analysis</li>
<li>Sentiment analysis</li>
<li>Question Answering</li>
</ol>
<p>Unfortunately, &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>From time to time, users on the Lucene mailing list ask a variant of the following question:</p>
<blockquote><p>Given a term match in a document, what&#8217;s the best way to get a window of words around that match?</p></blockquote>
<p>Getting a window of words around a match can be useful for a lot of things, including, to name a few:</p>
<ol>
<li>Highlighting (although I&#8217;d recommend using Lucene&#8217;s Highlighter package for that)</li>
<li>Co-occurrence analysis</li>
<li>Sentiment analysis</li>
<li>Question Answering</li>
</ol>
<p>Unfortunately, given how inverted indexes are structured, retrieving content around a match isn&#8217;t efficient without doing some extra work during indexing.  In Lucene, this &#8220;extra work&#8221; involves creating and storing <a href="http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/index/TermFreqVector.html">Term Vectors</a> with position and offset information.</p>
<p>Storing Term Vector info can be done by adding in the appropriate code during Field construction, as in the following indexing example where I create an index from a few dummy documents (complete code is at the bottom of this post):</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">    RAMDirectory ramDir <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> RAMDirectory<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #666666; font-style: italic;">//Index some made up content</span>
    IndexWriter writer <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> IndexWriter<span style="color: #009900;">&#40;</span>ramDir, <span style="color: #000000; font-weight: bold;">new</span> StandardAnalyzer<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, <span style="color: #000066; font-weight: bold;">true</span>, IndexWriter.<span style="color: #006633;">MaxFieldLength</span>.<span style="color: #006633;">UNLIMITED</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">int</span> i <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> DOCS.<span style="color: #006633;">length</span><span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span>
      <span style="color: #003399;">Document</span> doc <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Document</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #003399;">Field</span> id <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Field</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;id&quot;</span>, <span style="color: #0000ff;">&quot;doc_&quot;</span> <span style="color: #339933;">+</span> i, <span style="color: #003399;">Field</span>.<span style="color: #006633;">Store</span>.<span style="color: #006633;">YES</span>, <span style="color: #003399;">Field</span>.<span style="color: #006633;">Index</span>.<span style="color: #006633;">NOT_ANALYZED_NO_NORMS</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      doc.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span>id<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #666666; font-style: italic;">//Store both position and offset information</span>
      <span style="color: #003399;">Field</span> text <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Field</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;content&quot;</span>, DOCS<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span>, <span style="color: #003399;">Field</span>.<span style="color: #006633;">Store</span>.<span style="color: #006633;">NO</span>, <span style="color: #003399;">Field</span>.<span style="color: #006633;">Index</span>.<span style="color: #006633;">ANALYZED</span>, <span style="color: #003399;">Field</span>.<span style="color: #006633;">TermVector</span>.<span style="color: #006633;">WITH_POSITIONS_OFFSETS</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      doc.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span>text<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      writer.<span style="color: #006633;">addDocument</span><span style="color: #009900;">&#40;</span>doc<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
    writer.<span style="color: #006633;">close</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>Notice the use of the Field.TermVector.WITH_POSITIONS_OFFSETS when constructing the text Field.  This tells Lucene to store term vector information on a per document basis (in other words, not inverted) with both Position and Offset information.  (Due note, other storage options are available, see <a href="http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/document/Field.TermVector.html">Field.TermVector</a>.  Also note, storing Term Vectors will cost you in disk space.)</p>
<p>For completeness, the DOCS array looks like:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #003399;">String</span> <span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> DOCS <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #0000ff;">&quot;The quick red fox jumped over the lazy brown dogs.&quot;</span>,
        <span style="color: #0000ff;">&quot;Mary had a little lamb whose fleece was white as snow.&quot;</span>,
        <span style="color: #0000ff;">&quot;Moby Dick is a story of a whale and a man obsessed.&quot;</span>,
        <span style="color: #0000ff;">&quot;The robber wore a black fleece jacket and a baseball cap.&quot;</span>,
        <span style="color: #0000ff;">&quot;The English Springer Spaniel is the best of all dogs.&quot;</span>
    <span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span></pre></div></div>

<p>Now that we have an index created, we need to do a search.  In our case, we need to do a position-based search as opposed to the more traditional document-based search.  In other words, it is not good enough to simply  know whether a term is in a document or not (think TermQuery), we need to know where in the document the match occurred.  Lucene enables position-based search through a series of Query classes collectively known as Span Queries.  (See <a href="http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/spans/SpanQuery.html">SpanQuery</a> and its derivitaves in the <a href="http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/spans/package-summary.html">org.apache.lucene.search.spans</a> package.)</p>
<p>Again, an example is warranted.  Assume we wanted to find where the term &#8220;fleece&#8221; occurs.  In this case, let&#8217;s start by doing a &#8220;normal&#8221; search, wherein we submit a query to the index and print out the Dcoument id and Score:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">    IndexSearcher searcher <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> IndexSearcher<span style="color: #009900;">&#40;</span>ramDir<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #666666; font-style: italic;">// Do a search using SpanQuery</span>
    SpanTermQuery fleeceQ <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> SpanTermQuery<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Term<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;content&quot;</span>, <span style="color: #0000ff;">&quot;fleece&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    TopDocs results <span style="color: #339933;">=</span> searcher.<span style="color: #006633;">search</span><span style="color: #009900;">&#40;</span>fleeceQ, <span style="color: #cc66cc;">10</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">int</span> i <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> results.<span style="color: #006633;">scoreDocs</span>.<span style="color: #006633;">length</span><span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      ScoreDoc scoreDoc <span style="color: #339933;">=</span> results.<span style="color: #006633;">scoreDocs</span><span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
      <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Score Doc: &quot;</span> <span style="color: #339933;">+</span> scoreDoc<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span></pre></div></div>

<p>That code looks pretty much like any basic search code with the exception that I substituted in a <a href="http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/spans/SpanTermQuery.html">SpanTermQuery</a> for what is often a TermQuery.  In fact, so far this isn&#8217;t all that interesting and it is likely to be slower than the comparable TermQuery too.</p>
<p>What does make it interesting?  If you look at the SpanQuery API, you will notice a method called getSpans().  The getSpans() method provides positional information about where a match occurred.  Thus, to print out the positional information, one might do:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">    Spans spans <span style="color: #339933;">=</span> fleeceQ.<span style="color: #006633;">getSpans</span><span style="color: #009900;">&#40;</span>searcher.<span style="color: #006633;">getIndexReader</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">while</span> <span style="color: #009900;">&#40;</span>spans.<span style="color: #006633;">next</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">==</span> <span style="color: #000066; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span>
      <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Doc: &quot;</span> <span style="color: #339933;">+</span> spans.<span style="color: #006633;">doc</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot; Start: &quot;</span> <span style="color: #339933;">+</span> spans.<span style="color: #006633;">start</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot; End: &quot;</span> <span style="color: #339933;">+</span> spans.<span style="color: #006633;">end</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span></pre></div></div>

<p>First off, notice getting the Spans is completely independent of running the actual query.  In fact, you need not run the query first.  Second, the start and end values are the positions of the tokens, not the offsets.</p>
<p>Now, given the position information, the question becomes how to get only those tokens around the match.  To answer that, we need a few things:</p>
<ol>
<li>The specification of a window in terms of positions.  For instance, I want the terms within two positions of the start and end of the span.</li>
<li>A <a href="http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/index/TermVectorMapper.html">TermVectorMapper</a> implementation that is aware of both the window and the position.  Think of a TermVectorMapper as the equivalent of a SAX parser for Lucene&#8217;s Term Vectors.  Basically, instead of assuming the data structure (like DOM) it provides call backs and let&#8217;s you, the programmer, decide on the data structures.  See the <a href="http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/index/PositionBasedTermVectorMapper.html">PositionBasedTermVectorMapper</a> for a useful implementation.</li>
</ol>
<p>As a quick hack (and it is by no means production quality), I created the following code that modifies the printing code above:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">    WindowTermVectorMapper tvm <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> WindowTermVectorMapper<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000066; font-weight: bold;">int</span> window <span style="color: #339933;">=</span> <span style="color: #cc66cc;">2</span><span style="color: #339933;">;</span><span style="color: #666666; font-style: italic;">//get the words within two of the match, inclusive of the boundaries</span>
    <span style="color: #000000; font-weight: bold;">while</span> <span style="color: #009900;">&#40;</span>spans.<span style="color: #006633;">next</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">==</span> <span style="color: #000066; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Doc: &quot;</span> <span style="color: #339933;">+</span> spans.<span style="color: #006633;">doc</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot; Start: &quot;</span> <span style="color: #339933;">+</span> spans.<span style="color: #006633;">start</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot; End: &quot;</span> <span style="color: #339933;">+</span> spans.<span style="color: #006633;">end</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #666666; font-style: italic;">//build up the window</span>
      tvm.<span style="color: #006633;">start</span> <span style="color: #339933;">=</span> spans.<span style="color: #006633;">start</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">-</span> window<span style="color: #339933;">;</span>
      tvm.<span style="color: #006633;">end</span> <span style="color: #339933;">=</span> spans.<span style="color: #006633;">end</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> window<span style="color: #339933;">;</span>
      reader.<span style="color: #006633;">getTermFreqVector</span><span style="color: #009900;">&#40;</span>spans.<span style="color: #006633;">doc</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, <span style="color: #0000ff;">&quot;content&quot;</span>, tvm<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span>WindowEntry entry <span style="color: #339933;">:</span> tvm.<span style="color: #006633;">entries</span>.<span style="color: #006633;">values</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Entry: &quot;</span> <span style="color: #339933;">+</span> entry<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #009900;">&#125;</span>
      <span style="color: #666666; font-style: italic;">//clear out the entries for the next round</span>
      tvm.<span style="color: #006633;">entries</span>.<span style="color: #006633;">clear</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span></pre></div></div>

<p>Now, in this chunk of code, I first create a WindowTermVectorMapper (WTVM, beautiful name, right?) and then in the Spans loop, I tell the WTVM what my window looks like.  Next up, I ask Lucene&#8217;s IndexReader for the TermVector and pass in my TermVectorMapper.  Finally, I print out the entries.</p>
<p>Of course, the last bit of useful info is what does the WTVM look like.  Here&#8217;s the most useful snippet of code:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">void</span> map<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span> term, <span style="color: #000066; font-weight: bold;">int</span> frequency, TermVectorOffsetInfo<span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> offsets, <span style="color: #000066; font-weight: bold;">int</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> positions<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">int</span> i <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> positions.<span style="color: #006633;">length</span><span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><span style="color: #666666; font-style: italic;">//unfortunately, we still have to loop over the positions</span>
      <span style="color: #666666; font-style: italic;">//we'll make this inclusive of the boundaries</span>
      <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>positions<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">&gt;=</span> start <span style="color: #339933;">&amp;&amp;</span> positions<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">&lt;</span> end<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span>
        WindowEntry entry <span style="color: #339933;">=</span> entries.<span style="color: #006633;">get</span><span style="color: #009900;">&#40;</span>term<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>entry <span style="color: #339933;">==</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
          entry <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> WindowEntry<span style="color: #009900;">&#40;</span>term<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
          entries.<span style="color: #006633;">put</span><span style="color: #009900;">&#40;</span>term, entry<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
        entry.<span style="color: #006633;">positions</span>.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span>positions<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #009900;">&#125;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span></pre></div></div>

<p>As you can see, I just look at the positions and check to see if the current term has an entry that is inside the start and end.  Obviously, you can do more interesting things here, but I&#8217;ll leave that up to you.  Also know that there are a few TermVectorMapper implementations in the Lucene distribution that you can use as examples.</p>
<p>That about wraps it up.  From here, one can easily imagine different ways to utilize the information returned from the Term Vector Mapper to process information about the terms in a window.  </p>
<p>The full code is below.  It is intended for demonstration purposes only.  Please note the disclaimers, etc.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">package</span> <span style="color: #006699;">com.lucidimagination.noodles</span><span style="color: #339933;">;</span>
<span style="color: #008000; font-style: italic; font-weight: bold;">/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the &quot;License&quot;); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an &quot;AS IS&quot; BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.store.RAMDirectory</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.index.IndexWriter</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.index.Term</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.index.IndexReader</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.index.TermVectorMapper</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.index.TermVectorOffsetInfo</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.analysis.standard.StandardAnalyzer</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.document.Document</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.document.Field</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.search.IndexSearcher</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.search.TopDocs</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.search.ScoreDoc</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.search.spans.SpanTermQuery</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.search.spans.Spans</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.IOException</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.util.LinkedHashMap</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.util.List</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.util.ArrayList</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #008000; font-style: italic; font-weight: bold;">/**
 *  This class is for demonstration purposes only.  No warranty, guarantee, etc. is implied.
 *
 * This is not production quality code!
 *
 *
 **/</span>
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> TermVectorFun <span style="color: #009900;">&#123;</span>
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> DOCS <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
          <span style="color: #0000ff;">&quot;The quick red fox jumped over the lazy brown dogs.&quot;</span>,
          <span style="color: #0000ff;">&quot;Mary had a little lamb whose fleece was white as snow.&quot;</span>,
          <span style="color: #0000ff;">&quot;Moby Dick is a story of a whale and a man obsessed.&quot;</span>,
          <span style="color: #0000ff;">&quot;The robber wore a black fleece jacket and a baseball cap.&quot;</span>,
          <span style="color: #0000ff;">&quot;The English Springer Spaniel is the best of all dogs.&quot;</span>
  <span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000066; font-weight: bold;">void</span> main<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> args<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">IOException</span> <span style="color: #009900;">&#123;</span>
    RAMDirectory ramDir <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> RAMDirectory<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #666666; font-style: italic;">//Index some made up content</span>
    IndexWriter writer <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> IndexWriter<span style="color: #009900;">&#40;</span>ramDir, <span style="color: #000000; font-weight: bold;">new</span> StandardAnalyzer<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, <span style="color: #000066; font-weight: bold;">true</span>, IndexWriter.<span style="color: #006633;">MaxFieldLength</span>.<span style="color: #006633;">UNLIMITED</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">int</span> i <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> DOCS.<span style="color: #006633;">length</span><span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #003399;">Document</span> doc <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Document</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #003399;">Field</span> id <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Field</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;id&quot;</span>, <span style="color: #0000ff;">&quot;doc_&quot;</span> <span style="color: #339933;">+</span> i, <span style="color: #003399;">Field</span>.<span style="color: #006633;">Store</span>.<span style="color: #006633;">YES</span>, <span style="color: #003399;">Field</span>.<span style="color: #006633;">Index</span>.<span style="color: #006633;">NOT_ANALYZED_NO_NORMS</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      doc.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span>id<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #666666; font-style: italic;">//Store both position and offset information</span>
      <span style="color: #003399;">Field</span> text <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Field</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;content&quot;</span>, DOCS<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span>, <span style="color: #003399;">Field</span>.<span style="color: #006633;">Store</span>.<span style="color: #006633;">NO</span>, <span style="color: #003399;">Field</span>.<span style="color: #006633;">Index</span>.<span style="color: #006633;">ANALYZED</span>, <span style="color: #003399;">Field</span>.<span style="color: #006633;">TermVector</span>.<span style="color: #006633;">WITH_POSITIONS_OFFSETS</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      doc.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span>text<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      writer.<span style="color: #006633;">addDocument</span><span style="color: #009900;">&#40;</span>doc<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
    writer.<span style="color: #006633;">close</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #666666; font-style: italic;">//Get a searcher</span>
    IndexSearcher searcher <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> IndexSearcher<span style="color: #009900;">&#40;</span>ramDir<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #666666; font-style: italic;">// Do a search using SpanQuery</span>
    SpanTermQuery fleeceQ <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> SpanTermQuery<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Term<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;content&quot;</span>, <span style="color: #0000ff;">&quot;fleece&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    TopDocs results <span style="color: #339933;">=</span> searcher.<span style="color: #006633;">search</span><span style="color: #009900;">&#40;</span>fleeceQ, <span style="color: #cc66cc;">10</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">int</span> i <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> results.<span style="color: #006633;">scoreDocs</span>.<span style="color: #006633;">length</span><span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      ScoreDoc scoreDoc <span style="color: #339933;">=</span> results.<span style="color: #006633;">scoreDocs</span><span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
      <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Score Doc: &quot;</span> <span style="color: #339933;">+</span> scoreDoc<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
    IndexReader reader <span style="color: #339933;">=</span> searcher.<span style="color: #006633;">getIndexReader</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    Spans spans <span style="color: #339933;">=</span> fleeceQ.<span style="color: #006633;">getSpans</span><span style="color: #009900;">&#40;</span>reader<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    WindowTermVectorMapper tvm <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> WindowTermVectorMapper<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000066; font-weight: bold;">int</span> window <span style="color: #339933;">=</span> <span style="color: #cc66cc;">2</span><span style="color: #339933;">;</span><span style="color: #666666; font-style: italic;">//get the words within two of the match</span>
    <span style="color: #000000; font-weight: bold;">while</span> <span style="color: #009900;">&#40;</span>spans.<span style="color: #006633;">next</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">==</span> <span style="color: #000066; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Doc: &quot;</span> <span style="color: #339933;">+</span> spans.<span style="color: #006633;">doc</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot; Start: &quot;</span> <span style="color: #339933;">+</span> spans.<span style="color: #006633;">start</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot; End: &quot;</span> <span style="color: #339933;">+</span> spans.<span style="color: #006633;">end</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #666666; font-style: italic;">//build up the window</span>
      tvm.<span style="color: #006633;">start</span> <span style="color: #339933;">=</span> spans.<span style="color: #006633;">start</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">-</span> window<span style="color: #339933;">;</span>
      tvm.<span style="color: #006633;">end</span> <span style="color: #339933;">=</span> spans.<span style="color: #006633;">end</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> window<span style="color: #339933;">;</span>
      reader.<span style="color: #006633;">getTermFreqVector</span><span style="color: #009900;">&#40;</span>spans.<span style="color: #006633;">doc</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, <span style="color: #0000ff;">&quot;content&quot;</span>, tvm<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span>WindowEntry entry <span style="color: #339933;">:</span> tvm.<span style="color: #006633;">entries</span>.<span style="color: #006633;">values</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Entry: &quot;</span> <span style="color: #339933;">+</span> entry<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #009900;">&#125;</span>
      <span style="color: #666666; font-style: italic;">//clear out the entries for the next round</span>
      tvm.<span style="color: #006633;">entries</span>.<span style="color: #006633;">clear</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">//Not thread-safe</span>
<span style="color: #000000; font-weight: bold;">class</span> WindowTermVectorMapper <span style="color: #000000; font-weight: bold;">extends</span> TermVectorMapper <span style="color: #009900;">&#123;</span>
&nbsp;
  <span style="color: #000066; font-weight: bold;">int</span> start<span style="color: #339933;">;</span>
  <span style="color: #000066; font-weight: bold;">int</span> end<span style="color: #339933;">;</span>
  LinkedHashMap entries <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> LinkedHashMap<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">void</span> map<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span> term, <span style="color: #000066; font-weight: bold;">int</span> frequency, TermVectorOffsetInfo<span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> offsets, <span style="color: #000066; font-weight: bold;">int</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> positions<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">int</span> i <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> positions.<span style="color: #006633;">length</span><span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><span style="color: #666666; font-style: italic;">//unfortunately, we still have to loop over the positions</span>
      <span style="color: #666666; font-style: italic;">//we'll make this inclusive of the boundaries</span>
      <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>positions<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">&gt;=</span> start <span style="color: #339933;">&amp;&amp;</span> positions<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">&lt;</span> end<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span>
        WindowEntry entry <span style="color: #339933;">=</span> entries.<span style="color: #006633;">get</span><span style="color: #009900;">&#40;</span>term<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>entry <span style="color: #339933;">==</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
          entry <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> WindowEntry<span style="color: #009900;">&#40;</span>term<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
          entries.<span style="color: #006633;">put</span><span style="color: #009900;">&#40;</span>term, entry<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
        entry.<span style="color: #006633;">positions</span>.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span>positions<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #009900;">&#125;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">void</span> setExpectations<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span> field, <span style="color: #000066; font-weight: bold;">int</span> numTerms, <span style="color: #000066; font-weight: bold;">boolean</span> storeOffsets, <span style="color: #000066; font-weight: bold;">boolean</span> storePositions<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #666666; font-style: italic;">// do nothing for this example</span>
    <span style="color: #666666; font-style: italic;">//See also the PositionBasedTermVectorMapper.</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">class</span> WindowEntry<span style="color: #009900;">&#123;</span>
  <span style="color: #003399;">String</span> term<span style="color: #339933;">;</span>
  <span style="color: #003399;">List</span> positions <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">ArrayList</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><span style="color: #666666; font-style: italic;">//a term could appear more than once w/in a position</span>
&nbsp;
  WindowEntry<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span> term<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">term</span> <span style="color: #339933;">=</span> term<span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  @Override
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #003399;">String</span> toString<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">return</span> <span style="color: #0000ff;">&quot;WindowEntry{&quot;</span> <span style="color: #339933;">+</span>
            <span style="color: #0000ff;">&quot;term='&quot;</span> <span style="color: #339933;">+</span> term <span style="color: #339933;">+</span> <span style="color: #0000ff;">'<span style="color: #000099; font-weight: bold;">\'</span>'</span> <span style="color: #339933;">+</span>
            <span style="color: #0000ff;">&quot;, positions=&quot;</span> <span style="color: #339933;">+</span> positions <span style="color: #339933;">+</span>
            <span style="color: #0000ff;">'}'</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2009/05/26/accessing-words-around-a-positional-match-in-lucene/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
	</channel>
</rss>

