• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
    • Certified Distributions
      • Certified Solr
      • Certified Lucene
    • Apache Releases
      • Apache Solr
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Training
    • Solr/Lucene Certification
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Technical Leadership
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . Blog

August 5, 2009

Getting Started with Payloads

Posted by Grant Ingersoll

Mark Miller recently posted a brief intro to Span Queries, so I thought I would piggyback on top of his work and show how to get started with Payloads (see also [1]).

Introduction

Like Spans, payloads involve the position of terms, but go one step further.  Namely, a Payload in Apache Lucene is an arbitrary byte array stored at a specific position (i.e. a specific token/term) in the index.  A payload can be used to store weights for specific terms or things like part of speech tags or other semantic information.  If you read Brin and Page’s (you know, the Google guys) original paper Anatomy of a Large-Scale Hypertextual Search Engine,  they describe what is essentially a payload functionality, whereby they store information about font, etc. at a specific position in the index (remember when you could get your pages ranked number one by using really big fonts?) and then utilize it in search.

There are three parts to taking advantage of payloads in Lucene.  Solr requires a fourth step, which I will explain in a moment.

  1. Add a Payload to one or more Tokens during indexing.
  2. Override the Similarity class to handle scoring payloads
  3. Use a Payload aware Query during your search

For Solr, step 3 requires you to have your own Query Parser, as none of the existing Solr Query Parsers support the BoostingTermQuery.  Thus, the third step for Solr is add a Query Parser that supports payloads (and Spans would be nice, too!  Please donate if you do this!)

Adding Payloads during indexing

(I’m using Lucene 2.9-dev)

I’m going to use the same indexing code I did for my post on co-occurrence analysis, but with a few modifications.

First off, I’m going to change Analyzers to one of my own creation:

class PayloadAnalyzer extends Analyzer {
    private PayloadEncoder encoder;
 
    PayloadAnalyzer(PayloadEncoder encoder) {
      this.encoder = encoder;
    }
 
    public TokenStream tokenStream(String fieldName, Reader reader) {
      TokenStream result = new WhitespaceTokenizer(reader);
      result = new LowerCaseFilter(result);
      result = new DelimitedPayloadTokenFilter(result, '|', encoder);
      return result;
    }
  }

In this Analyzer, I have the basic whitespace tokenizer and a lower case filter, but then I add in the recently added DelimitedPayloadTokenFilter (DPTF). The DPTF allows you to add payloads to tokens simply by marking up the tokens with a special character followed by the payload value. For instance, I changed my sample docs from the co-occurrence example to now include payload information.  Specifically, I said that all nouns should be weighted by 10, all verbs by 5 and all adjectives by 2 (I used http://l2r.cs.uiuc.edu/~cogcomp/pos_demo.php to tag the sentences, any errors are likely mine.)  Everything else has no payload.   I also stripped all punctuation. My DOCS array now looks like:

public static String[] DOCS = {
           "The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the lazy|2.0 brown|2.0 dogs|10.0",
          "The quick red fox jumped over the lazy brown dogs",//no boosts
          "The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the old|2.0 box|10.0",
          "Mary|10.0 had a little|2.0 lamb|10.0 whose fleece|10.0 was|5.0 white|2.0 as snow|10.0",
          "Mary had a little lamb whose fleece was white as snow",
          "Mary|10.0 takes on Wolf|10.0 Restoration|10.0 project|10.0 despite ties|10.0 to sheep|10.0 farming|10.0",
          "Mary|10.0 who lives|5.0 on a farm|10.0 is|5.0 happy|2.0 that she|10.0 takes|5.0 a walk|10.0 every day|10.0",
          "Moby|10.0 Dick|10.0 is|5.0 a story|10.0 of a whale|10.0 and a man|10.0 obsessed|10.0",
          "The robber|10.0 wore|5.0 a black|2.0 fleece|10.0 jacket|10.0 and a baseball|10.0 cap|10.0",
          "The English|10.0 Springer|10.0 Spaniel|10.0 is|5.0 the best|2.0 of all dogs|10.0"
  };

The DOCS array simply marks each noun, verb and adjective with a | (pipe symbol) and then a float indicating the boost. I also added some docs that have no boosts at all to demonstrate the differences at query time. The DPTF will then use this to encode the payloads using the PayloadEncoder. A PayloadEncoder is an interface that tells the DPTF how to convert the payload to a byte array. Also note that Lucene’s contrib/analysis package contains several other TokenFilters for adding payloads to a Token and, of course, you can write your own as well.  Furthermore, the PayloadHelper class can help encode/decode payloads for common types.

Overriding the Similarity Class

The next step, which should happen before indexing, is to override the Similarity class to handle payloads.  While it is isn’t strictly required that this happens before indexing in THIS case, it is a good habit to do in case you have made other changes to the Similarity class that are required during indexing (such as overriding how norms are encoded.)

Overriding the Similarity is done on both the IndexWriter and the IndexSearcher.  See [3] below for the full code, including the calls to set the similarity. My Similarity implementation simply converts the byte array to a float and returns the float, as in:

class PayloadSimilarity extends DefaultSimilarity {
    @Override
    public float scorePayload(String fieldName, byte[] bytes, int offset, int length) {
      return PayloadHelper.decodeFloat(bytes, offset);//we can ignore length here, because we know it is encoded as 4 bytes
    }
}

Executing the Query

Currently, Lucene has one payload aware Query called the BoostingTermQuery (BTQ for short,  see [2] for another Payload aware query that may be in Lucene 2.9), which can be used just like any other query.  For instance:

IndexSearcher searcher = new IndexSearcher(dir, true);
searcher.setSimilarity(payloadSimilarity);
BoostingTermQuery btq = new BoostingTermQuery(new Term("body", "fox"));
TopDocs topDocs = searcher.search(btq, 10);
for (int i = 0; i < topDocs.scoreDocs.length; i++) {
   ScoreDoc doc = topDocs.scoreDocs[i];
   System.out.println("Doc: " + doc.toString());
   System.out.println("Explain: " + searcher.explain(btq, doc.doc));
}

In this example, I create the BTQ and hand it to the searcher and then print out the results.  Easy peasy, yet so powerful.

Running this yields:

———–
Results for body:fox of type: org.apache.lucene.search.payloads.BoostingTermQuery
Doc: doc=0 score=4.2344446
Explain: 4.234444 = (MATCH) fieldWeight(body:fox in 0), product of:
7.071068 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
10.0 = scorePayload(…)
1.9162908 = idf(body: fox=3)
0.3125 = fieldNorm(field=body, doc=0)

Doc: doc=2 score=4.2344446
Explain: 4.234444 = (MATCH) fieldWeight(body:fox in 2), product of:
7.071068 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
10.0 = scorePayload(…)
1.9162908 = idf(body: fox=3)
0.3125 = fieldNorm(field=body, doc=2)

Doc: doc=1 score=0.42344445
Explain: 0.42344445 = (MATCH) fieldWeight(body:fox in 1), product of:
0.70710677 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
1.0 = scorePayload(…)
1.9162908 = idf(body: fox=3)
0.3125 = fieldNorm(field=body, doc=1)

Notice how both Doc 0 and Doc 2, which both contain the word “fox” in the body occur before doc 1 even though they all have the same term frequency and length.

Running the a simple TermQuery (ignoring payloads) with the exact same Term, on the other hand, yields:

———–
Results for body:fox of type: org.apache.lucene.search.TermQuery
Doc: doc=0 score=0.59884083
Explain: 0.59884083 = (MATCH) fieldWeight(body:fox in 0), product of:
1.0 = tf(termFreq(body:fox)=1)
1.9162908 = idf(docFreq=3, numDocs=10)
0.3125 = fieldNorm(field=body, doc=0)

Doc: doc=1 score=0.59884083
Explain: 0.59884083 = (MATCH) fieldWeight(body:fox in 1), product of:
1.0 = tf(termFreq(body:fox)=1)
1.9162908 = idf(docFreq=3, numDocs=10)
0.3125 = fieldNorm(field=body, doc=1)

Doc: doc=2 score=0.59884083
Explain: 0.59884083 = (MATCH) fieldWeight(body:fox in 2), product of:
1.0 = tf(termFreq(body:fox)=1)
1.9162908 = idf(docFreq=3, numDocs=10)
0.3125 = fieldNorm(field=body, doc=2)

As you can see, in the TermQuery case, all the docs are scored exactly the same.

Next Steps

As you can see from above, getting started with Payloads is pretty easy.  In reality, the only hard part is determining what exactly to put in your payload and then how it should factor into your score.  Lucene takes care of the rest.  Tools like UIMA, OpenNLP and other proprietary vendors can often be used to provide higher level lexical, syntactical and semantic information about tokens, thus giving you the power to create very expressive payloads and richer search applications.

[1] See Michael Busch’s talk at the last SF Meetup for more details on payloads: http://www.meetup.com/SFBay-Lucene-Solr-Meetup/files/

[2] https://issues.apache.org/jira/browse/LUCENE-1341

[3] Full class:

package com.lucidimagination.noodles;
 
import junit.framework.TestCase;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilter;
import org.apache.lucene.analysis.payloads.PayloadEncoder;
import org.apache.lucene.analysis.payloads.FloatEncoder;
import org.apache.lucene.analysis.payloads.PayloadHelper;
import org.apache.lucene.search.DefaultSimilarity;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.payloads.BoostingTermQuery;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
 
import java.io.Reader;
import java.io.IOException;
 
/**
 *
 *
 **/
public class PayloadTest extends TestCase {
Directory dir;
 
  public static String[] DOCS = {
          "The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the lazy|2.0 brown|2.0 dogs|10.0",
          "The quick red fox jumped over the lazy brown dogs",//no boosts
          "The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the old|2.0 brown|2.0 box|10.0",
          "Mary|10.0 had a little|2.0 lamb|10.0 whose fleece|10.0 was|5.0 white|2.0 as snow|10.0",
          "Mary had a little lamb whose fleece was white as snow",
          "Mary|10.0 takes on Wolf|10.0 Restoration|10.0 project|10.0 despite ties|10.0 to sheep|10.0 farming|10.0",
          "Mary|10.0 who lives|5.0 on a farm|10.0 is|5.0 happy|2.0 that she|10.0 takes|5.0 a walk|10.0 every day|10.0",
          "Moby|10.0 Dick|10.0 is|5.0 a story|10.0 of a whale|10.0 and a man|10.0 obsessed|10.0",
          "The robber|10.0 wore|5.0 a black|2.0 fleece|10.0 jacket|10.0 and a baseball|10.0 cap|10.0",
          "The English|10.0 Springer|10.0 Spaniel|10.0 is|5.0 the best|2.0 of all dogs|10.0"
  };
  protected PayloadSimilarity payloadSimilarity;
 
  @Override
  protected void setUp() throws Exception {
    dir = new RAMDirectory();
 
    PayloadEncoder encoder = new FloatEncoder();
    IndexWriter writer = new IndexWriter(dir, new PayloadAnalyzer(encoder), true, IndexWriter.MaxFieldLength.UNLIMITED);
    payloadSimilarity = new PayloadSimilarity();
    writer.setSimilarity(payloadSimilarity);
    for (int i = 0; i < DOCS.length; i++) {
      Document doc = new Document();
      Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
      doc.add(id);
      //Store both position and offset information
      Field text = new Field("body", DOCS[i], Field.Store.NO, Field.Index.ANALYZED);
      doc.add(text);
      writer.addDocument(doc);
    }
    writer.close();
  }
 
  public void testPayloads() throws Exception {
    IndexSearcher searcher = new IndexSearcher(dir, true);
    searcher.setSimilarity(payloadSimilarity);//set the similarity.  Very important
    BoostingTermQuery btq = new BoostingTermQuery(new Term("body", "fox"));
    TopDocs topDocs = searcher.search(btq, 10);
    printResults(searcher, btq, topDocs);
 
    TermQuery tq = new TermQuery(new Term("body", "fox"));
    topDocs = searcher.search(tq, 10);
    printResults(searcher, tq, topDocs);
  }
 
  private void printResults(IndexSearcher searcher, Query query, TopDocs topDocs) throws IOException {
    System.out.println("-----------");
    System.out.println("Results for " + query + " of type: " + query.getClass().getName());
    for (int i = 0; i < topDocs.scoreDocs.length; i++) {
      ScoreDoc doc = topDocs.scoreDocs[i];
      System.out.println("Doc: " + doc.toString());
      System.out.println("Explain: " + searcher.explain(query, doc.doc));
    }
  }
 
  class PayloadSimilarity extends DefaultSimilarity {
    @Override
    public float scorePayload(String fieldName, byte[] bytes, int offset, int length) {
      return PayloadHelper.decodeFloat(bytes, offset);//we can ignore length here, because we know it is encoded as 4 bytes
    }
  }
 
  class PayloadAnalyzer extends Analyzer {
    private PayloadEncoder encoder;
 
    PayloadAnalyzer(PayloadEncoder encoder) {
      this.encoder = encoder;
    }
 
    public TokenStream tokenStream(String fieldName, Reader reader) {
      TokenStream result = new WhitespaceTokenizer(reader);
      result = new LowerCaseFilter(result);
      result = new DelimitedPayloadTokenFilter(result, '|', encoder);
      return result;
    }
  }
}
  • Share this:
  • Email
  • Facebook
  • Digg
  • Share
  • Print
  • Reddit
  • StumbleUpon

Category: apache, BoostingTermQuery, Lucene, Payloads, Solr

20 Responses to “Getting Started with Payloads”

  1. Powerful payload – allowing control of boost on term level! Great introduction. I’m wondering good use cases of using payload in real life search engine. what type of searches will benefit from using payload?

    September 24, 2009 11:20 — AJ Chen

  2. The Google paper linked above provides some examples (they stored font and other information in the payload). You could also use it to override specific term weights. I’ve also used it to store part of speech information that allowed me to boost words that were a specific part of speech.

    September 24, 2009 12:02 — Grant Ingersoll

  3. In this example, the weight of each word are known. If I do not want to allow users to

    Inquiries, know the weight of fox | 10. 0 to 10, how can I do?

    November 29, 2009 23:45 — liupeng

  4. [...] has a post introducing Lucene payloads. This is the primary way to get token/term level metadata into a lucene index. See also Michael [...]

    February 4, 2010 21:33 — News of the day: Eclipse AppEngine Plugin, New Chome Beta, Lucene Payloads, …- Information Retrieval Blog

  5. Useful intro. A couple of questions:

    What is a payload of a span supposed to look like? Is it merely the concatenation of the payloads of the constituent tem occurrences? I ask because I’m creating new span operators and want to get the payloads correct.

    Second question: I see that NearSpansOrdered allocates a new hashset for every match, to handle potential payloads (in the shrinkToAfterShortestMatch method). Doesn’t that make the operator really slow, even for non-payload queries? Or am I missing something obvious?

    Thanks!

    February 7, 2010 22:56 — Paul Nelson

  6. How to integrate the Lucene payloads with solr katta patch.

    March 31, 2010 20:14 — Jhonleel

  7. Not sure how to integrate w/ Katta/Solr. That being said, the TokenFilter is there to support it, so it should just work from the indexing side.

    April 1, 2010 08:03 — Grant Ingersoll

  8. I am not able to find BoostingTermQuery class in Lucene 3.0. Can some help why this class is not there in 3.0 or where to get this ?

    April 16, 2010 01:12 — Ajay

  9. It’s now called the PayloadTermQuery

    April 16, 2010 04:48 — Grant Ingersoll

  10. Thanks Grant but i am struggling to use PayTermQuery because of PayloadFunction. I didnt find any example.
    In the example mention here if I want to extract payload value how can I do that e.g. 10.0 = scorePayload(…)
    I am not sure any method to get this value.
    Could you please help me to get this value ?

    April 18, 2010 04:03 — Ajay

  11. [...] Lucid Imagination » Getting Started with Payloads, I introduced the basics of payloads, but that article is now slightly out of date if you are using [...]

    April 18, 2010 04:34 — Lucid Imagination » Refresh: Getting Started with Payloads

  12. Ajay,

    See http://www.lucidimagination.com/blog/2010/04/18/refresh-getting-started-with-payloads/

    I posted an update to this for Lucene 3.0. If you want the same functionality of the BoostingTermQuery, just use the AveragePayloadFunction.

    April 18, 2010 04:35 — Grant Ingersoll

  13. Hi Grant,
    Thanks for the reply. PayTermQuery works now but I also wants to get some API to get scorepayload value which is currently getting printed in searcher.explain(query, doc.doc) but I want something like getPayLoadValue() which will return payload associated with term.
    is there any open API ?

    April 18, 2010 11:13 — Ajay

  14. [...] using BoostTermQuery but I am not able to extract payload value. I am using example mention in http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/comment-page-1/#commen... Could you please help me to get payload value ? Related Posts:Term offsets for highlighting [...]

    April 18, 2010 11:23 — Payload Example for Lucene 3.0.0 : 28

  15. Ajay,

    Have a look at the TermPositions class, either that or Span.getPayload()

    April 18, 2010 16:42 — Grant Ingersoll

  16. Thanks Grant I am able to manage getpayload value using TermPosition class but now I am having one issue. I have document with doc_id and Title and I want to get payload for a given term in a given document but current API’s takes input as term only. There is no parameter for document id.
    API example:
    Term t = new Term(“Title”, “Java”);
    TermPositions positions = ir.termPositions(t)

    I want to iterate for all documents and wants to get payload for each document separately. Is there any API available ?

    April 19, 2010 05:23 — Ajay

  17. Hey, Grant
    Its a great article. After reading it, i feel payload can have many applications. But One thing i was not able to figure out, in the article u are using a specific term to query. How can i use it with query parser so that i can search multiple terms.

    June 6, 2010 22:54 — Arpit

  18. Hi Arpit,

    Unfortunately, there is no Query Parser support yet, but see https://issues.apache.org/jira/browse/SOLR-1337

    June 7, 2010 04:26 — Grant Ingersoll

  19. In your example you use payloads to add metadata to a term. The metadata you add is weight, which is used to change a term’s weight when returned in a search result.

    I have a different usecase and I’d like you to comment whether my usecase is consistent with the intended use of the payload feature.

    Specifically, I’d like to store a video transcript time-code with each term (the timecode is the number of milliseconds of elapsed time from the start of the video where the term occurs). the timecode is not data that affects Lucene’s indexing or searching behavior, rather it is simply data I want associated with the term.

    June 18, 2010 07:03 — Peter Wilkins

  20. Hi Peter,

    Payloads would work fine for that. You will need to use TermPositions to access it or SpanQuery if you want to get the actual payload out at some point other than as part of a search.

    June 18, 2010 07:37 — Grant Ingersoll

Leave a Reply

Go to Blog Front Page

  • Recent Posts

    • Lucene Revolution 2012 – Call for Participation now open!
    • SolrCloud is Coming (and looking to mix in even more ‘NoSQL’)
    • Our Solr Reference Guide updated for v3.5
    • Enhancing Discovery with Solr and Mahout – session slides now available!
    • Solr and LucidWorks feature matrix available
    • LucidWorks Enterprise latest version 2.0.1 released!
    • Why Not AND, OR, And NOT?
    • Options to tune document’s relevance in Solr
    • Dallas JavaMUG December 14th 2011
    • Apache Mahout user meeting – session slides and videos are now available!
  • Archives

    • January 2012
    • December 2011
    • November 2011
    • October 2011
    • September 2011
    • August 2011
  • Tags

    acts_as_solr apache Apache Mahout best practices chump code4lib dismax drupal enterprise search Erik Hatcher field collapsing function query Grant Ingersoll hoss image isfdb local params Lucene lucene revolution LucidGaze lucid imagination Mahout Marc Krellenstein Mark Miller nested queries nutch Open Source Open Source Search qparser query parser queryparser Rails release result grouping Richmond Ruby schema design sint Solr solr 3.1 solr 4.0 solr cloud sortable Tika VA
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Log in

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2011 Lucid Imagination. All Right reserved.

loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.