• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
    • Certified Distributions
      • Certified Solr
      • Certified Lucene
    • Apache Releases
      • Apache Solr
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Training
    • Solr/Lucene Certification
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Technical Leadership
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . Blog

May 26, 2009

Accessing words around a positional match in Lucene

Posted by Grant Ingersoll

From time to time, users on the Lucene mailing list ask a variant of the following question:

Given a term match in a document, what’s the best way to get a window of words around that match?

Getting a window of words around a match can be useful for a lot of things, including, to name a few:

  1. Highlighting (although I’d recommend using Lucene’s Highlighter package for that)
  2. Co-occurrence analysis
  3. Sentiment analysis
  4. Question Answering

Unfortunately, given how inverted indexes are structured, retrieving content around a match isn’t efficient without doing some extra work during indexing.  In Lucene, this “extra work” involves creating and storing Term Vectors with position and offset information.

Storing Term Vector info can be done by adding in the appropriate code during Field construction, as in the following indexing example where I create an index from a few dummy documents (complete code is at the bottom of this post):

    RAMDirectory ramDir = new RAMDirectory();
    //Index some made up content
    IndexWriter writer = new IndexWriter(ramDir, new StandardAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
    for (int i = 0; i < DOCS.length; i++){
      Document doc = new Document();
      Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
      doc.add(id);
      //Store both position and offset information
      Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
      doc.add(text);
      writer.addDocument(doc);
    }
    writer.close();

Notice the use of the Field.TermVector.WITH_POSITIONS_OFFSETS when constructing the text Field.  This tells Lucene to store term vector information on a per document basis (in other words, not inverted) with both Position and Offset information.  (Due note, other storage options are available, see Field.TermVector.  Also note, storing Term Vectors will cost you in disk space.)

For completeness, the DOCS array looks like:

public static String [] DOCS = {
        "The quick red fox jumped over the lazy brown dogs.",
        "Mary had a little lamb whose fleece was white as snow.",
        "Moby Dick is a story of a whale and a man obsessed.",
        "The robber wore a black fleece jacket and a baseball cap.",
        "The English Springer Spaniel is the best of all dogs."
    };

Now that we have an index created, we need to do a search.  In our case, we need to do a position-based search as opposed to the more traditional document-based search.  In other words, it is not good enough to simply know whether a term is in a document or not (think TermQuery), we need to know where in the document the match occurred.  Lucene enables position-based search through a series of Query classes collectively known as Span Queries.  (See SpanQuery and its derivitaves in the org.apache.lucene.search.spans package.)

Again, an example is warranted.  Assume we wanted to find where the term “fleece” occurs.  In this case, let’s start by doing a “normal” search, wherein we submit a query to the index and print out the Dcoument id and Score:

    IndexSearcher searcher = new IndexSearcher(ramDir);
    // Do a search using SpanQuery
    SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece"));
    TopDocs results = searcher.search(fleeceQ, 10);
    for (int i = 0; i < results.scoreDocs.length; i++) {
      ScoreDoc scoreDoc = results.scoreDocs[i];
      System.out.println("Score Doc: " + scoreDoc);
    }

That code looks pretty much like any basic search code with the exception that I substituted in a SpanTermQuery for what is often a TermQuery.  In fact, so far this isn’t all that interesting and it is likely to be slower than the comparable TermQuery too.

What does make it interesting?  If you look at the SpanQuery API, you will notice a method called getSpans().  The getSpans() method provides positional information about where a match occurred.  Thus, to print out the positional information, one might do:

    Spans spans = fleeceQ.getSpans(searcher.getIndexReader());
    while (spans.next() == true){
      System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end());
    }

First off, notice getting the Spans is completely independent of running the actual query. In fact, you need not run the query first. Second, the start and end values are the positions of the tokens, not the offsets.

Now, given the position information, the question becomes how to get only those tokens around the match. To answer that, we need a few things:

  1. The specification of a window in terms of positions.  For instance, I want the terms within two positions of the start and end of the span.
  2. A TermVectorMapper implementation that is aware of both the window and the position.  Think of a TermVectorMapper as the equivalent of a SAX parser for Lucene’s Term Vectors.  Basically, instead of assuming the data structure (like DOM) it provides call backs and let’s you, the programmer, decide on the data structures.  See the PositionBasedTermVectorMapper for a useful implementation.

As a quick hack (and it is by no means production quality), I created the following code that modifies the printing code above:

    WindowTermVectorMapper tvm = new WindowTermVectorMapper();
    int window = 2;//get the words within two of the match, inclusive of the boundaries
    while (spans.next() == true) {
      System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end());
      //build up the window
      tvm.start = spans.start() - window;
      tvm.end = spans.end() + window;
      reader.getTermFreqVector(spans.doc(), "content", tvm);
      for (WindowEntry entry : tvm.entries.values()) {
        System.out.println("Entry: " + entry);
      }
      //clear out the entries for the next round
      tvm.entries.clear();
    }

Now, in this chunk of code, I first create a WindowTermVectorMapper (WTVM, beautiful name, right?) and then in the Spans loop, I tell the WTVM what my window looks like. Next up, I ask Lucene’s IndexReader for the TermVector and pass in my TermVectorMapper.  Finally, I print out the entries.

Of course, the last bit of useful info is what does the WTVM look like.  Here’s the most useful snippet of code:

public void map(String term, int frequency, TermVectorOffsetInfo[] offsets, int[] positions) {
    for (int i = 0; i < positions.length; i++) {//unfortunately, we still have to loop over the positions
      //we'll make this inclusive of the boundaries
      if (positions[i] >= start && positions[i] < end){
        WindowEntry entry = entries.get(term);
        if (entry == null) {
          entry = new WindowEntry(term);
          entries.put(term, entry);
        }
        entry.positions.add(positions[i]);
      }
    }
  }

As you can see, I just look at the positions and check to see if the current term has an entry that is inside the start and end. Obviously, you can do more interesting things here, but I’ll leave that up to you. Also know that there are a few TermVectorMapper implementations in the Lucene distribution that you can use as examples.

That about wraps it up.  From here, one can easily imagine different ways to utilize the information returned from the Term Vector Mapper to process information about the terms in a window.

The full code is below. It is intended for demonstration purposes only. Please note the disclaimers, etc.

package com.lucidimagination.noodles;
/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.TermVectorMapper;
import org.apache.lucene.index.TermVectorOffsetInfo;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.search.spans.Spans;
 
import java.io.IOException;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.ArrayList;
 
/**
 *  This class is for demonstration purposes only.  No warranty, guarantee, etc. is implied.
 *
 * This is not production quality code!
 *
 *
 **/
public class TermVectorFun {
  public static String[] DOCS = {
          "The quick red fox jumped over the lazy brown dogs.",
          "Mary had a little lamb whose fleece was white as snow.",
          "Moby Dick is a story of a whale and a man obsessed.",
          "The robber wore a black fleece jacket and a baseball cap.",
          "The English Springer Spaniel is the best of all dogs."
  };
 
  public static void main(String[] args) throws IOException {
    RAMDirectory ramDir = new RAMDirectory();
    //Index some made up content
    IndexWriter writer = new IndexWriter(ramDir, new StandardAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
    for (int i = 0; i < DOCS.length; i++) {
      Document doc = new Document();
      Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
      doc.add(id);
      //Store both position and offset information
      Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
      doc.add(text);
      writer.addDocument(doc);
    }
    writer.close();
    //Get a searcher
    IndexSearcher searcher = new IndexSearcher(ramDir);
    // Do a search using SpanQuery
    SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece"));
    TopDocs results = searcher.search(fleeceQ, 10);
    for (int i = 0; i < results.scoreDocs.length; i++) {
      ScoreDoc scoreDoc = results.scoreDocs[i];
      System.out.println("Score Doc: " + scoreDoc);
    }
    IndexReader reader = searcher.getIndexReader();
    Spans spans = fleeceQ.getSpans(reader);
    WindowTermVectorMapper tvm = new WindowTermVectorMapper();
    int window = 2;//get the words within two of the match
    while (spans.next() == true) {
      System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end());
      //build up the window
      tvm.start = spans.start() - window;
      tvm.end = spans.end() + window;
      reader.getTermFreqVector(spans.doc(), "content", tvm);
      for (WindowEntry entry : tvm.entries.values()) {
        System.out.println("Entry: " + entry);
      }
      //clear out the entries for the next round
      tvm.entries.clear();
    }
  }
 
}
 
//Not thread-safe
class WindowTermVectorMapper extends TermVectorMapper {
 
  int start;
  int end;
  LinkedHashMap entries = new LinkedHashMap();
 
  public void map(String term, int frequency, TermVectorOffsetInfo[] offsets, int[] positions) {
    for (int i = 0; i < positions.length; i++) {//unfortunately, we still have to loop over the positions
      //we'll make this inclusive of the boundaries
      if (positions[i] >= start && positions[i] < end){
        WindowEntry entry = entries.get(term);
        if (entry == null) {
          entry = new WindowEntry(term);
          entries.put(term, entry);
        }
        entry.positions.add(positions[i]);
      }
    }
  }
 
  public void setExpectations(String field, int numTerms, boolean storeOffsets, boolean storePositions) {
    // do nothing for this example
    //See also the PositionBasedTermVectorMapper.
  }
 
}
 
class WindowEntry{
  String term;
  List positions = new ArrayList();//a term could appear more than once w/in a position
 
  WindowEntry(String term) {
    this.term = term;
  }
 
  @Override
  public String toString() {
    return "WindowEntry{" +
            "term='" + term + '\'' +
            ", positions=" + positions +
            '}';
  }
}
  • Share this:
  • Email
  • Facebook
  • Digg
  • Share
  • Print
  • Reddit
  • StumbleUpon

Category: Lucene, sentiment analysis, Span Queries, term vectors

Tags: Span Queries, term vectors

17 Responses to “Accessing words around a positional match in Lucene”

  1. [...] Read more [...]

    May 18, 2010 16:04 — Accessing words around a positional match in Lucene « Widgetti

  2. Great post!,

    I’m trying to run the code with Lucene 3.0.1, but I encounter the following exception:

    Exception in thread “main” java.io.IOException: read past EOF
    at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:255)
    at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:160)
    at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
    at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:71)
    at org.apache.lucene.store.IndexInput.readLong(IndexInput.java:94)
    at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:232)
    at org.apache.lucene.index.SegmentReader.getTermFreqVector(SegmentReader.java:1170)
    at org.apache.lucene.index.DirectoryReader.getTermFreqVector(DirectoryReader.java:475)

    By using Lucene 2.4.1 all is fine. Do you know hot to resolve the problem?

    Thanks a lot,
    Elia Bruni

    May 21, 2010 04:35 — Elia Bruni

  3. Thanks a lot for this example,

    but I am unable to start it in Netbeas 6.9. In the last for – loop:
    for (WindowEntry entry : tvm.entries.values()) {
    System.out.println(“Entry: ” + entry);
    }

    Entries “WindowEntry entry : tvm.entries.values()” are incompatible!!

    tvm.entries.values() seems to be still java.lang.Object!?

    Any Idea what to do?

    Thanks
    Dimitri

    August 2, 2010 13:09 — Diman

  4. Hi Folks,

    I have problems to compile this code. Because class WindowTermVectorMapper delivers a java.lang.Object

    LinkedHashMap entries = new LinkedHashMap(); and use of this Object is not possible as WindowEntry even if you cast..

    Error message:

    run:
    Score Doc: org.apache.lucene.search.ScoreDoc@765291
    Score Doc: org.apache.lucene.search.ScoreDoc@26e431
    Exception in thread “main” java.lang.RuntimeException: Uncompilable source code – incompatible types
    required: lucenepositionsexample.WindowEntry
    found: java.lang.Object
    at lucenepositionsexample.TermVectorFun.main(TermVectorFun.java:95)
    Java Result: 1
    BUILD SUCCESSFUL (total time: 2 seconds)

    August 4, 2010 06:23 — Diman Getz

  5. Hi Diman,

    What version of Lucene are you using?

    August 4, 2010 10:26 — Grant Ingersoll

  6. Hi Grant,

    I tryed to compile with different Lucene versions: Lucene lucene-core-2.4.1, 3.0.2. Also with tested lucidworks – Lucene-version-3.0.1.

    The problem is because of incompatibility of “LinkedHashMap” “entries” object and WindowEntry “entry” – bject.

    August 4, 2010 12:00 — Diman

  7. Can you change the declaration to:
    LinkedHashMap entries = new LinkedHashMap();

    Here’s my full working example:
    package com.lucidimagination.noodles;
    /**
    * Licensed to the Apache Software Foundation (ASF) under one or more
    * contributor license agreements. See the NOTICE file distributed with
    * this work for additional information regarding copyright ownership.
    * The ASF licenses this file to You under the Apache License, Version 2.0
    * (the “License”); you may not use this file except in compliance with
    * the License. You may obtain a copy of the License at
    *
    * http://www.apache.org/licenses/LICENSE-2.0
    *
    * Unless required by applicable law or agreed to in writing, software
    * distributed under the License is distributed on an “AS IS” BASIS,
    * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    * See the License for the specific language governing permissions and
    * limitations under the License.
    */

    import org.apache.lucene.store.RAMDirectory;
    import org.apache.lucene.index.IndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.TermVectorMapper;
    import org.apache.lucene.index.TermVectorOffsetInfo;
    import org.apache.lucene.analysis.standard.StandardAnalyzer;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.TopDocs;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.search.spans.Spans;
    import org.apache.lucene.util.Version;

    import java.io.IOException;
    import java.util.LinkedHashMap;
    import java.util.List;
    import java.util.ArrayList;

    /**
    * This class is for demonstration purposes only. No warranty, guarantee, etc. is implied.
    *

    * This is not production quality code!
    */
    public class TermVectorFun {
    public static String[] DOCS = {
    “The quick red fox jumped over the lazy brown dogs.”,
    “Mary had a little lamb whose fleece was white as snow.”,
    “Moby Dick is a story of a whale and a man obsessed.”,
    “The robber wore a black fleece jacket and a baseball cap.”,
    “The English Springer Spaniel is the best of all dogs.”
    };

    public static void main(String[] args) throws IOException {
    RAMDirectory ramDir = new RAMDirectory();
    //Index some made up content
    IndexWriter writer = new IndexWriter(ramDir, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED);
    for (int i = 0; i < DOCS.length; i++) {
    Document doc = new Document();
    Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
    doc.add(id);
    //Store both position and offset information
    Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
    doc.add(text);
    writer.addDocument(doc);
    }
    writer.close();
    //Get a searcher
    IndexSearcher searcher = new IndexSearcher(ramDir);
    // Do a search using SpanQuery
    SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece"));
    TopDocs results = searcher.search(fleeceQ, 10);
    for (int i = 0; i < results.scoreDocs.length; i++) {
    ScoreDoc scoreDoc = results.scoreDocs[i];
    System.out.println("Score Doc: " + scoreDoc);
    }
    IndexReader reader = searcher.getIndexReader();
    Spans spans = fleeceQ.getSpans(reader);
    WindowTermVectorMapper tvm = new WindowTermVectorMapper();
    int window = 2;//get the words within two of the match
    while (spans.next() == true) {
    System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end());
    //build up the window
    tvm.start = spans.start() - window;
    tvm.end = spans.end() + window;
    reader.getTermFreqVector(spans.doc(), "content", tvm);
    for (WindowEntry entry : tvm.entries.values()) {
    System.out.println("Entry: " + entry);
    }
    //clear out the entries for the next round
    tvm.entries.clear();
    }
    }

    }

    //Not thread-safe
    class WindowTermVectorMapper extends TermVectorMapper {

    int start;
    int end;
    LinkedHashMap entries = new LinkedHashMap();

    public void map(String term, int frequency, TermVectorOffsetInfo[] offsets, int[] positions) {
    for (int i = 0; i < positions.length; i++) {//unfortunately, we still have to loop over the positions
    //we'll make this inclusive of the boundaries
    if (positions[i] >= start && positions[i] < end) {
    WindowEntry entry = entries.get(term);
    if (entry == null) {
    entry = new WindowEntry(term);
    entries.put(term, entry);
    }
    entry.positions.add(positions[i]);
    }
    }
    }

    public void setExpectations(String field, int numTerms, boolean storeOffsets, boolean storePositions) {
    // do nothing for this example
    //See also the PositionBasedTermVectorMapper.
    }

    }

    class WindowEntry {
    String term;
    List positions = new ArrayList();//a term could appear more than once w/in a position

    WindowEntry(String term) {
    this.term = term;
    }

    @Override
    public String toString() {
    return “WindowEntry{” +
    “term=’” + term + ‘\” +
    “, positions=” + positions +
    ‘}’;
    }
    }

    August 4, 2010 12:53 — Grant Ingersoll

  8. Thanks, but you postet idetical code like in example. My declarations are identical, also
    LinkedHashMap entries = new LinkedHashMap();

    And according exception, “entry” and “entries” are incompatible types.

    Unfortunatelly I still have errors in

    class WindowTermVectorMapper
    ………

    if (positions[i] >= start
    && positions[i] WindowEntry entry = entries.get(term); for (WindowEntry entry : tvm.entries.values())
    {
    System.out.println(“Entry: ” + entry);
    }
    //clear out the entries for the next round
    tvm.entries.clear();
    }

    Maybe it is because of toString() – Overriding in class WindowEntry ??

    August 4, 2010 16:37 — Diman

  9. Error codelines:

    class WindowTermVectorMapper
    ……

    –> for (WindowEntry entry : tvm.entries.values()) WindowEntry entry = entries.get(term); <–

    August 5, 2010 02:04 — Diman

  10. Error codelines(updated):
    “class TermVectorFun”

    –-!> for (WindowEntry entry : tvm.entries.values()) WindowEntry entry = entries.get(term); <!–-

    August 5, 2010 03:52 — Diman

  11. Error codelines(updated):
    “class TermVectorFun”
    for (WindowEntry entry : tvm.entries.values())

    //—————————————————//

    “class WindowTermVectorMapper”
    WindowEntry entry = entries.get(term);

    August 5, 2010 03:58 — Diman

  12. I think the brackets are being stripped.

    package com.lucidimagination.noodles;
    /**
     * Licensed to the Apache Software Foundation (ASF) under one or more
     * contributor license agreements.  See the NOTICE file distributed with
     * this work for additional information regarding copyright ownership.
     * The ASF licenses this file to You under the Apache License, Version 2.0
     * (the "License"); you may not use this file except in compliance with
     * the License.  You may obtain a copy of the License at
     *
     *     http://www.apache.org/licenses/LICENSE-2.0
     *
     * Unless required by applicable law or agreed to in writing, software
     * distributed under the License is distributed on an "AS IS" BASIS,
     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     * See the License for the specific language governing permissions and
     * limitations under the License.
     */
    
    import org.apache.lucene.store.RAMDirectory;
    import org.apache.lucene.index.IndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.TermVectorMapper;
    import org.apache.lucene.index.TermVectorOffsetInfo;
    import org.apache.lucene.analysis.standard.StandardAnalyzer;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.TopDocs;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.search.spans.Spans;
    import org.apache.lucene.util.Version;
    
    import java.io.IOException;
    import java.util.LinkedHashMap;
    import java.util.List;
    import java.util.ArrayList;
    
    /**
     * This class is for demonstration purposes only.  No warranty, guarantee, etc. is implied.
     *
    

    * This is not production quality code! */ public class TermVectorFun { public static String[] DOCS = { "The quick red fox jumped over the lazy brown dogs.", "Mary had a little lamb whose fleece was white as snow.", "Moby Dick is a story of a whale and a man obsessed.", "The robber wore a black fleece jacket and a baseball cap.", "The English Springer Spaniel is the best of all dogs." }; public static void main(String[] args) throws IOException { RAMDirectory ramDir = new RAMDirectory(); //Index some made up content IndexWriter writer = new IndexWriter(ramDir, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); for (int i = 0; i < DOCS.length; i++) { Document doc = new Document(); Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS); doc.add(id); //Store both position and offset information Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS); doc.add(text); writer.addDocument(doc); } writer.close(); //Get a searcher IndexSearcher searcher = new IndexSearcher(ramDir); // Do a search using SpanQuery SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece")); TopDocs results = searcher.search(fleeceQ, 10); for (int i = 0; i < results.scoreDocs.length; i++) { ScoreDoc scoreDoc = results.scoreDocs[i]; System.out.println("Score Doc: " + scoreDoc); } IndexReader reader = searcher.getIndexReader(); Spans spans = fleeceQ.getSpans(reader); WindowTermVectorMapper tvm = new WindowTermVectorMapper(); int window = 2;//get the words within two of the match while (spans.next() == true) { System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end()); //build up the window tvm.start = spans.start() - window; tvm.end = spans.end() + window; reader.getTermFreqVector(spans.doc(), "content", tvm); for (WindowEntry entry : tvm.entries.values()) { System.out.println("Entry: " + entry); } //clear out the entries for the next round tvm.entries.clear(); } } } //Not thread-safe class WindowTermVectorMapper extends TermVectorMapper { int start; int end; LinkedHashMap<String, WindowEntry> entries = new LinkedHashMap<String, WindowEntry>(); public void map(String term, int frequency, TermVectorOffsetInfo[] offsets, int[] positions) { for (int i = 0; i < positions.length; i++) {//unfortunately, we still have to loop over the positions //we'll make this inclusive of the boundaries if (positions[i] >= start && positions[i] < end) { WindowEntry entry = entries.get(term); if (entry == null) { entry = new WindowEntry(term); entries.put(term, entry); } entry.positions.add(positions[i]); } } } public void setExpectations(String field, int numTerms, boolean storeOffsets, boolean storePositions) { // do nothing for this example //See also the PositionBasedTermVectorMapper. } } class WindowEntry { String term; List<Integer> positions = new ArrayList<Integer>();//a term could appear more than once w/in a position WindowEntry(String term) { this.term = term; } @Override public String toString() { return "WindowEntry{" + "term='" + term + '\'' + ", positions=" + positions + '}'; } }

    August 5, 2010 04:44 — Grant Ingersoll

  13. Wow!! Amazing, it works now!!

    Thanks a lot Grant!!

    P.S.: works with every lucene package from 2.4.1 to 3.0.1

    August 5, 2010 05:13 — Diman

  14. Hello Grant!!

    Thanks again for this great example.
    I am trying to build code with allows me to do same with exact “phrase” match. Is there any way to get word around a phrase? Or at least start end end position of a phrase?

    August 10, 2010 16:31 — Diman

  15. Sure, have a look at using the SpanNearQuery, from which you can get positions, etc.

    August 11, 2010 05:42 — Grant Ingersoll

  16. Thanks alot, really helpful for people who are interested in lucene.

    June 9, 2011 23:52 — Kiran Umadi

  17. Thanks for the example Grant. I adapted your code to find concordances – the concordance was then culled manually to provide input for developing a component to automatically detect age group patterns in a document.

    Here is a link if you are interested:
    http://sujitpal.blogspot.com/2011/08/implementing-concordance-with-lucene.html

    August 23, 2011 22:16 — Sujit Pal

Leave a Reply

Go to Blog Front Page

  • Recent Posts

    • Lucene Revolution 2012 – Call for Participation now open!
    • SolrCloud is Coming (and looking to mix in even more ‘NoSQL’)
    • Our Solr Reference Guide updated for v3.5
    • Enhancing Discovery with Solr and Mahout – session slides now available!
    • Solr and LucidWorks feature matrix available
    • LucidWorks Enterprise latest version 2.0.1 released!
    • Why Not AND, OR, And NOT?
    • Options to tune document’s relevance in Solr
    • Dallas JavaMUG December 14th 2011
    • Apache Mahout user meeting – session slides and videos are now available!
  • Archives

    • January 2012
    • December 2011
    • November 2011
    • October 2011
    • September 2011
    • August 2011
  • Tags

    acts_as_solr apache Apache Mahout best practices chump code4lib dismax drupal enterprise search Erik Hatcher field collapsing function query Grant Ingersoll hoss image isfdb local params Lucene lucene revolution LucidGaze lucid imagination Mahout Marc Krellenstein Mark Miller nested queries nutch Open Source Open Source Search qparser query parser queryparser Rails release result grouping Richmond Ruby schema design sint Solr solr 3.1 solr 4.0 solr cloud sortable Tika VA
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Log in

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2011 Lucid Imagination. All Right reserved.

loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.