• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
    • Certified Distributions
      • Certified Solr
      • Certified Lucene
    • Apache Releases
      • Apache Solr
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Training
    • Solr/Lucene Certification
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Technical Leadership
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . Blog

February 17, 2009

Highlighting Highlighter Thoughts

Posted by Mark Miller

I have some Highlighter work that I keep meaning to finish up (basic support for highlighting ConstantScoreQuerys), and so I have the Highlighter on my mind…

The History…

The first Lucene Highlighter was written and contributed to Lucene by Mark Harwood, a longtime Lucene contrib Committer and PMC member. Mark came up with a nice, robust, extensible API and a handful of default implementations for the API. It was a very solid Highlighter implementation that has held up nicely in the face of a lot of complicated Analyzers and Filters. A variety of contributors have enriched the code over the years since then (squashing bugs and making improvements), and the Highlighter is currently fairly capable and heavily used.

The Lucene contrib Highlighter was created with a focus on generating text fragments. This allows you to easily generate ‘keywords in context’ type views (ie the results list from your favorite search engine). Eventually, the NullFragmenter was added, allowing you to highlight a full document as well (you could have used the API to write your own NullFragmenter before Lucene added it – one of the nice things about the Highlighter’s fairly pluggable API).

Scoring and Highlighting…

The Highlighter works with a TokenStream and a Query. A TokenStream is just as it sounds: a stream of tokens – terms even, if thats easier – terms with possibly additional meta-data attached (position, offsets in original text, etc). An Analyzer and a document create a TokenStream – apply the Analyzer to the documents text, and out pops the Tokens. By comparing the tokens from the query with the tokens from the document, the Highlighter can identify which tokens should be highlighted (termFromDoc==termFromQuery? Highlight!). The highlighter works by feeding tokens from the document one at a time to a Scorer. The Scorer assigns a score to the token. The QueryScorer assigns the score based on whether the token matches a token in the query. Fragments are then generated and scored based on the underlying token scores. Generally the token score might just be 0 or 1, but you can do gradient highlighting by expanding the range of the scores (if you pass an IndexReader to QueryScorer, it will use term index stats to modify the score based on those stats). Finally, a pluggable Formatter implementation will actually insert the highlight text (using the score to decide what, if any, text to insert).

Obtaining a TokenStream for a Document…

Unfortunately, the index does not store the TokenStream for a document, so when its time to highlight, its up to the user to get a valid TokenStream for the document text. Generally this means shoving the original text for the document through the Analyzer you used for indexing the document. However, if you stored term vectors in your index, the position and/or offset information can be used to reconstruct the TokenStream from info in the index. Especially for large documents, this can be much faster. The TokenSources class in the Highlighter package will build a TokenStream for you, using the best method based on whether term vectors are available or not.

SpanScorer – adding position sensitive highlighting…

A couple of years ago I became interested in adding positional support to the Highlighter. The QueryScorer implementation just checks that tokens from the query match tokens from the document, and it doesn’t take the position of the tokens into account. The result is that if you use a PhraseQuery, rather than just highlighting the phrase, each term in the phrase will be highlighted everywhere it occurs. Attempts had been made to support PhraseQuery highlighting in the past, but not in a way that took advantage of the current Highlighter framework, and not in a way that supported the other positional queries (SpanQuery, MultiPhraseQuery, etc). I wanted pretty much full Highlighting support, as well as all of the goodness that had been squeezed into the current Highlighter. The result of this desire was a new Scorer implementation called SpanScorer.

The new SpanScorer would put the TokenStream into a fast single doc MemoryIndex, convert the query to a SpanQuery approximation, and call getSpans on the MemoryIndex to get all of the position hits for the document. This info is then used in scoring to filter out query terms that match doc terms, but are not in the correct position. The SpanScorer now supports almost the entire range of Lucene queries, and is just as fast as the QueryScorer for Query clauses that are not position sensitive.

Lucene added the SpanScorer in release 2.4 and Solr has also added support for the SpanScorer in 1.3. To take advantage of the SpanScorer in Lucene, just use the SpanScorer rather than the QueryScorer. You can enable the SpanScorer in Solr by passing hl.usePhraseHighlighter=true with your request.

Other Highlighter Implementations…

In Lucene JIRA there are a couple of other Highlighter implementations as well. The most interesting ideas you will find in them come from the two implementations that require term vectors to be stored for the documents you want to highlight. If you can enforce that requirement (something we don’t yet want to do with the default Highlighter), you can use the approach of looking at just the terms in the query, rather than looking at each of the terms in the document. This can be a very large win on large documents. The downsides are, that its not easy (and it hasn’t been done that I know of) to highlight based on position (phrase/span queries), and the exposed API for custom hooks is less rich. And of course, you have to store TermVectors to use the Highlighter.

  • Share this:
  • Email
  • Facebook
  • Digg
  • Share
  • Print
  • Reddit
  • StumbleUpon

Category: Uncategorized

Tags: Mark Miller

Leave a Reply

Go to Blog Front Page

  • Recent Posts

    • Lucene Revolution 2012 – Call for Participation now open!
    • SolrCloud is Coming (and looking to mix in even more ‘NoSQL’)
    • Our Solr Reference Guide updated for v3.5
    • Enhancing Discovery with Solr and Mahout – session slides now available!
    • Solr and LucidWorks feature matrix available
    • LucidWorks Enterprise latest version 2.0.1 released!
    • Why Not AND, OR, And NOT?
    • Options to tune document’s relevance in Solr
    • Dallas JavaMUG December 14th 2011
    • Apache Mahout user meeting – session slides and videos are now available!
  • Archives

    • January 2012
    • December 2011
    • November 2011
    • October 2011
    • September 2011
    • August 2011
  • Tags

    acts_as_solr apache Apache Mahout best practices chump code4lib dismax drupal enterprise search Erik Hatcher field collapsing function query Grant Ingersoll hoss image isfdb local params Lucene lucene revolution LucidGaze lucid imagination Mahout Marc Krellenstein Mark Miller nested queries nutch Open Source Open Source Search qparser query parser queryparser Rails release result grouping Richmond Ruby schema design sint Solr solr 3.1 solr 4.0 solr cloud sortable Tika VA
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Log in

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2011 Lucid Imagination. All Right reserved.

loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.