From time to time, users on the Lucene mailing list ask a variant of the following question:
Given a term match in a document, what’s the best way to get a window of words around that match?
Getting a window of words around a match can be useful for a lot of things, including, to name a few:
- Highlighting (although I’d recommend using Lucene’s Highlighter package for that)
- Co-occurrence analysis
- Sentiment analysis
- Question Answering
Unfortunately, given how inverted indexes are structured, retrieving content around a match isn’t efficient without doing some extra work during indexing. In Lucene, this “extra work” involves creating and storing Term Vectors with position and offset information.
Storing Term Vector info can be done by adding in the appropriate code during Field construction, as in the following indexing example where I create an index from a few dummy documents (complete code is at the bottom of this post):
RAMDirectory ramDir = new RAMDirectory(); //Index some made up content IndexWriter writer = new IndexWriter(ramDir, new StandardAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED); for (int i = 0; i < DOCS.length; i++){ Document doc = new Document(); Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS); doc.add(id); //Store both position and offset information Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS); doc.add(text); writer.addDocument(doc); } writer.close();
Notice the use of the Field.TermVector.WITH_POSITIONS_OFFSETS when constructing the text Field. This tells Lucene to store term vector information on a per document basis (in other words, not inverted) with both Position and Offset information. (Due note, other storage options are available, see Field.TermVector. Also note, storing Term Vectors will cost you in disk space.)
For completeness, the DOCS array looks like:
public static String [] DOCS = { "The quick red fox jumped over the lazy brown dogs.", "Mary had a little lamb whose fleece was white as snow.", "Moby Dick is a story of a whale and a man obsessed.", "The robber wore a black fleece jacket and a baseball cap.", "The English Springer Spaniel is the best of all dogs." };
Now that we have an index created, we need to do a search. In our case, we need to do a position-based search as opposed to the more traditional document-based search. In other words, it is not good enough to simply know whether a term is in a document or not (think TermQuery), we need to know where in the document the match occurred. Lucene enables position-based search through a series of Query classes collectively known as Span Queries. (See SpanQuery and its derivitaves in the org.apache.lucene.search.spans package.)
Again, an example is warranted. Assume we wanted to find where the term “fleece” occurs. In this case, let’s start by doing a “normal” search, wherein we submit a query to the index and print out the Dcoument id and Score:
IndexSearcher searcher = new IndexSearcher(ramDir); // Do a search using SpanQuery SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece")); TopDocs results = searcher.search(fleeceQ, 10); for (int i = 0; i < results.scoreDocs.length; i++) { ScoreDoc scoreDoc = results.scoreDocs[i]; System.out.println("Score Doc: " + scoreDoc); }
That code looks pretty much like any basic search code with the exception that I substituted in a SpanTermQuery for what is often a TermQuery. In fact, so far this isn’t all that interesting and it is likely to be slower than the comparable TermQuery too.
What does make it interesting? If you look at the SpanQuery API, you will notice a method called getSpans(). The getSpans() method provides positional information about where a match occurred. Thus, to print out the positional information, one might do:
Spans spans = fleeceQ.getSpans(searcher.getIndexReader()); while (spans.next() == true){ System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end()); }
First off, notice getting the Spans is completely independent of running the actual query. In fact, you need not run the query first. Second, the start and end values are the positions of the tokens, not the offsets.
Now, given the position information, the question becomes how to get only those tokens around the match. To answer that, we need a few things:
- The specification of a window in terms of positions. For instance, I want the terms within two positions of the start and end of the span.
- A TermVectorMapper implementation that is aware of both the window and the position. Think of a TermVectorMapper as the equivalent of a SAX parser for Lucene’s Term Vectors. Basically, instead of assuming the data structure (like DOM) it provides call backs and let’s you, the programmer, decide on the data structures. See the PositionBasedTermVectorMapper for a useful implementation.
As a quick hack (and it is by no means production quality), I created the following code that modifies the printing code above:
WindowTermVectorMapper tvm = new WindowTermVectorMapper(); int window = 2;//get the words within two of the match, inclusive of the boundaries while (spans.next() == true) { System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end()); //build up the window tvm.start = spans.start() - window; tvm.end = spans.end() + window; reader.getTermFreqVector(spans.doc(), "content", tvm); for (WindowEntry entry : tvm.entries.values()) { System.out.println("Entry: " + entry); } //clear out the entries for the next round tvm.entries.clear(); }
Now, in this chunk of code, I first create a WindowTermVectorMapper (WTVM, beautiful name, right?) and then in the Spans loop, I tell the WTVM what my window looks like. Next up, I ask Lucene’s IndexReader for the TermVector and pass in my TermVectorMapper. Finally, I print out the entries.
Of course, the last bit of useful info is what does the WTVM look like. Here’s the most useful snippet of code:
public void map(String term, int frequency, TermVectorOffsetInfo[] offsets, int[] positions) { for (int i = 0; i < positions.length; i++) {//unfortunately, we still have to loop over the positions //we'll make this inclusive of the boundaries if (positions[i] >= start && positions[i] < end){ WindowEntry entry = entries.get(term); if (entry == null) { entry = new WindowEntry(term); entries.put(term, entry); } entry.positions.add(positions[i]); } } }
As you can see, I just look at the positions and check to see if the current term has an entry that is inside the start and end. Obviously, you can do more interesting things here, but I’ll leave that up to you. Also know that there are a few TermVectorMapper implementations in the Lucene distribution that you can use as examples.
That about wraps it up. From here, one can easily imagine different ways to utilize the information returned from the Term Vector Mapper to process information about the terms in a window.
The full code is below. It is intended for demonstration purposes only. Please note the disclaimers, etc.
package com.lucidimagination.noodles; /** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.TermVectorMapper; import org.apache.lucene.index.TermVectorOffsetInfo; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.TopDocs; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.spans.SpanTermQuery; import org.apache.lucene.search.spans.Spans; import java.io.IOException; import java.util.LinkedHashMap; import java.util.List; import java.util.ArrayList; /** * This class is for demonstration purposes only. No warranty, guarantee, etc. is implied. * * This is not production quality code! * * **/ public class TermVectorFun { public static String[] DOCS = { "The quick red fox jumped over the lazy brown dogs.", "Mary had a little lamb whose fleece was white as snow.", "Moby Dick is a story of a whale and a man obsessed.", "The robber wore a black fleece jacket and a baseball cap.", "The English Springer Spaniel is the best of all dogs." }; public static void main(String[] args) throws IOException { RAMDirectory ramDir = new RAMDirectory(); //Index some made up content IndexWriter writer = new IndexWriter(ramDir, new StandardAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED); for (int i = 0; i < DOCS.length; i++) { Document doc = new Document(); Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS); doc.add(id); //Store both position and offset information Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS); doc.add(text); writer.addDocument(doc); } writer.close(); //Get a searcher IndexSearcher searcher = new IndexSearcher(ramDir); // Do a search using SpanQuery SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece")); TopDocs results = searcher.search(fleeceQ, 10); for (int i = 0; i < results.scoreDocs.length; i++) { ScoreDoc scoreDoc = results.scoreDocs[i]; System.out.println("Score Doc: " + scoreDoc); } IndexReader reader = searcher.getIndexReader(); Spans spans = fleeceQ.getSpans(reader); WindowTermVectorMapper tvm = new WindowTermVectorMapper(); int window = 2;//get the words within two of the match while (spans.next() == true) { System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end()); //build up the window tvm.start = spans.start() - window; tvm.end = spans.end() + window; reader.getTermFreqVector(spans.doc(), "content", tvm); for (WindowEntry entry : tvm.entries.values()) { System.out.println("Entry: " + entry); } //clear out the entries for the next round tvm.entries.clear(); } } } //Not thread-safe class WindowTermVectorMapper extends TermVectorMapper { int start; int end; LinkedHashMap entries = new LinkedHashMap(); public void map(String term, int frequency, TermVectorOffsetInfo[] offsets, int[] positions) { for (int i = 0; i < positions.length; i++) {//unfortunately, we still have to loop over the positions //we'll make this inclusive of the boundaries if (positions[i] >= start && positions[i] < end){ WindowEntry entry = entries.get(term); if (entry == null) { entry = new WindowEntry(term); entries.put(term, entry); } entry.positions.add(positions[i]); } } } public void setExpectations(String field, int numTerms, boolean storeOffsets, boolean storePositions) { // do nothing for this example //See also the PositionBasedTermVectorMapper. } } class WindowEntry{ String term; List positions = new ArrayList();//a term could appear more than once w/in a position WindowEntry(String term) { this.term = term; } @Override public String toString() { return "WindowEntry{" + "term='" + term + '\'' + ", positions=" + positions + '}'; } }


[...] Read more [...]
May 18, 2010 16:04 — Accessing words around a positional match in Lucene « Widgetti
Great post!,
I’m trying to run the code with Lucene 3.0.1, but I encounter the following exception:
Exception in thread “main” java.io.IOException: read past EOF
at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:255)
at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:160)
at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:71)
at org.apache.lucene.store.IndexInput.readLong(IndexInput.java:94)
at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:232)
at org.apache.lucene.index.SegmentReader.getTermFreqVector(SegmentReader.java:1170)
at org.apache.lucene.index.DirectoryReader.getTermFreqVector(DirectoryReader.java:475)
By using Lucene 2.4.1 all is fine. Do you know hot to resolve the problem?
Thanks a lot,
Elia Bruni
May 21, 2010 04:35 — Elia Bruni
Thanks a lot for this example,
but I am unable to start it in Netbeas 6.9. In the last for – loop:
for (WindowEntry entry : tvm.entries.values()) {
System.out.println(“Entry: ” + entry);
}
Entries “WindowEntry entry : tvm.entries.values()” are incompatible!!
tvm.entries.values() seems to be still java.lang.Object!?
Any Idea what to do?
Thanks
Dimitri
August 2, 2010 13:09 — Diman
Hi Folks,
I have problems to compile this code. Because class WindowTermVectorMapper delivers a java.lang.Object
LinkedHashMap entries = new LinkedHashMap(); and use of this Object is not possible as WindowEntry even if you cast..
Error message:
run:
Score Doc: org.apache.lucene.search.ScoreDoc@765291
Score Doc: org.apache.lucene.search.ScoreDoc@26e431
Exception in thread “main” java.lang.RuntimeException: Uncompilable source code – incompatible types
required: lucenepositionsexample.WindowEntry
found: java.lang.Object
at lucenepositionsexample.TermVectorFun.main(TermVectorFun.java:95)
Java Result: 1
BUILD SUCCESSFUL (total time: 2 seconds)
August 4, 2010 06:23 — Diman Getz
Hi Diman,
What version of Lucene are you using?
August 4, 2010 10:26 — Grant Ingersoll
Hi Grant,
I tryed to compile with different Lucene versions: Lucene lucene-core-2.4.1, 3.0.2. Also with tested lucidworks – Lucene-version-3.0.1.
The problem is because of incompatibility of “LinkedHashMap” “entries” object and WindowEntry “entry” – bject.
August 4, 2010 12:00 — Diman
Can you change the declaration to: entries = new LinkedHashMap ();
LinkedHashMap
Here’s my full working example:
package com.lucidimagination.noodles;
/**
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the “License”); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an “AS IS” BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.TermVectorMapper;
import org.apache.lucene.index.TermVectorOffsetInfo;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.search.spans.Spans;
import org.apache.lucene.util.Version;
import java.io.IOException;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.ArrayList;
/**
* This is not production quality code!* This class is for demonstration purposes only. No warranty, guarantee, etc. is implied.
*
*/
public class TermVectorFun {
public static String[] DOCS = {
“The quick red fox jumped over the lazy brown dogs.”,
“Mary had a little lamb whose fleece was white as snow.”,
“Moby Dick is a story of a whale and a man obsessed.”,
“The robber wore a black fleece jacket and a baseball cap.”,
“The English Springer Spaniel is the best of all dogs.”
};
public static void main(String[] args) throws IOException {
RAMDirectory ramDir = new RAMDirectory();
//Index some made up content
IndexWriter writer = new IndexWriter(ramDir, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED);
for (int i = 0; i < DOCS.length; i++) {
Document doc = new Document();
Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
doc.add(id);
//Store both position and offset information
Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
doc.add(text);
writer.addDocument(doc);
}
writer.close();
//Get a searcher
IndexSearcher searcher = new IndexSearcher(ramDir);
// Do a search using SpanQuery
SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece"));
TopDocs results = searcher.search(fleeceQ, 10);
for (int i = 0; i < results.scoreDocs.length; i++) {
ScoreDoc scoreDoc = results.scoreDocs[i];
System.out.println("Score Doc: " + scoreDoc);
}
IndexReader reader = searcher.getIndexReader();
Spans spans = fleeceQ.getSpans(reader);
WindowTermVectorMapper tvm = new WindowTermVectorMapper();
int window = 2;//get the words within two of the match
while (spans.next() == true) {
System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end());
//build up the window
tvm.start = spans.start() - window;
tvm.end = spans.end() + window;
reader.getTermFreqVector(spans.doc(), "content", tvm);
for (WindowEntry entry : tvm.entries.values()) {
System.out.println("Entry: " + entry);
}
//clear out the entries for the next round
tvm.entries.clear();
}
}
}
//Not thread-safe
class WindowTermVectorMapper extends TermVectorMapper {
int start; entries = new LinkedHashMap ();
int end;
LinkedHashMap
public void map(String term, int frequency, TermVectorOffsetInfo[] offsets, int[] positions) {
for (int i = 0; i < positions.length; i++) {//unfortunately, we still have to loop over the positions
//we'll make this inclusive of the boundaries
if (positions[i] >= start && positions[i] < end) {
WindowEntry entry = entries.get(term);
if (entry == null) {
entry = new WindowEntry(term);
entries.put(term, entry);
}
entry.positions.add(positions[i]);
}
}
}
public void setExpectations(String field, int numTerms, boolean storeOffsets, boolean storePositions) {
// do nothing for this example
//See also the PositionBasedTermVectorMapper.
}
}
class WindowEntry { positions = new ArrayList();//a term could appear more than once w/in a position
String term;
List
WindowEntry(String term) {
this.term = term;
}
@Override
public String toString() {
return “WindowEntry{” +
“term=’” + term + ‘\” +
“, positions=” + positions +
‘}’;
}
}
August 4, 2010 12:53 — Grant Ingersoll
Thanks, but you postet idetical code like in example. My declarations are identical, also
LinkedHashMap entries = new LinkedHashMap();
And according exception, “entry” and “entries” are incompatible types.
Unfortunatelly I still have errors in
class WindowTermVectorMapper
………
if (positions[i] >= start
&& positions[i] WindowEntry entry = entries.get(term); for (WindowEntry entry : tvm.entries.values())
{
System.out.println(“Entry: ” + entry);
}
//clear out the entries for the next round
tvm.entries.clear();
}
Maybe it is because of toString() – Overriding in class WindowEntry ??
August 4, 2010 16:37 — Diman
Error codelines:
class WindowTermVectorMapper
……
–> for (WindowEntry entry : tvm.entries.values()) WindowEntry entry = entries.get(term); <–
August 5, 2010 02:04 — Diman
Error codelines(updated):
“class TermVectorFun”
–-!> for (WindowEntry entry : tvm.entries.values()) WindowEntry entry = entries.get(term); <!–-
August 5, 2010 03:52 — Diman
Error codelines(updated):
“class TermVectorFun”
for (WindowEntry entry : tvm.entries.values())
//—————————————————//
“class WindowTermVectorMapper”
WindowEntry entry = entries.get(term);
August 5, 2010 03:58 — Diman
I think the brackets are being stripped.
August 5, 2010 04:44 — Grant Ingersoll
Wow!! Amazing, it works now!!
Thanks a lot Grant!!
P.S.: works with every lucene package from 2.4.1 to 3.0.1
August 5, 2010 05:13 — Diman
Hello Grant!!
Thanks again for this great example.
I am trying to build code with allows me to do same with exact “phrase” match. Is there any way to get word around a phrase? Or at least start end end position of a phrase?
August 10, 2010 16:31 — Diman
Sure, have a look at using the SpanNearQuery, from which you can get positions, etc.
August 11, 2010 05:42 — Grant Ingersoll