• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
      • LucidWorks Big Data
    • Apache Releases
      • Apache Solr 4.0-dev
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Lucid University
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Lucene Revolution
      • Tradeshows & Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Board of Directors
    • Apache Lucene/Solr Committers
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . Blog

May 23, 2010

What’s a “DisMax” ?

Posted by hossman

The term “dismax” gets tossed around on the Solr lists frequently, which can be fairly confusing to new users. It originated as a shorthand name for the DisMaxRequestHandler (which I named after the DisjunctionMaxQueryParser, which I named after the DisjunctionMaxQuery class that it uses heavily). In recent years, the DisMaxRequestHandler and the StandardRequestHandler were both refactored into a single SearchHandler class, and now the term “dismax” usually refers to the DisMaxQParser.

Clear as Mudd, right?

Regardless of whether you use the DisMaxRequestHandler via the qt=dismax parameter, or use the SearchHandler with the DisMaxQParser via defType=dismax the end result is that your q parameter gets parsed by the DisjunctionMaxQueryParser.

The original goals of dismax (whichever meaning you might infer) have never changed:

… supports a simplified version of the Lucene QueryParser syntax. Quotes can be used to group phrases, and +/- can be used to denote mandatory and optional clauses … but all other Lucene query parser special characters are escaped to simplify the user experience. The handler takes responsibility for building a good query from the user’s input using BooleanQueries containing DisjunctionMaxQueries across fields and boosts you specify It also allows you to provide additional boosting queries, boosting functions, and filtering queries to artificially affect the outcome of all searches. These options can all be specified as default parameters for the handler in your solrconfig.xml or overridden the Solr query URL.

In short: You worry about what fields and boosts you want to use when you configure it, your users just give you words w/o worrying too much about syntax.

The magic of dismax (in my opinion) comes from the query structure it produces. What it essentially boils down to is matrix multiplication: a one column matrix of each “chunk” of your user’s input, multiplied by a one row matrix of the qf fields to produce a big matrix of every field:chunk permutation. The matrix is then turned into a BooleanQuery consisting of DisjunctionMaxQueries for each row in the matrix. DisjunctionMaxQuery is used because it’s score is determined by the maximum score of it’s subclauses — instead of the sum like a BooleanQuery — so no one word from the user input dominates the final score. The best way to explain this is with an example, so let’s consider the following input…

defType = dismax
     mm = 50%
     qf = features^2 name^3
      q = +"apache solr" search server

First off, we consider the “markup” characters of the parser that appear in this q string:

  • white space – dividing input string into chunk
  • quotes – makes a single phrase chunk
  • + – makes a chunk mandatory

So we have 3 “chunks” of user input:

  • “apache solr” (must match)
  • “search” (should match)
  • “server” (should match>

If we “multiply” that with our qf list (features, name) we get a matrix like this…

features:”apache solr” name:”apache solr” (must match)
features:”search” name:”search” (should match)
features:”server” name:”server” (should match)

If we then factor in the mm param to determing the “minimum number of ‘ShouldMatch’ clauses that (ahem) must match” (50% of 2 == 1) we get the following query structure (in psuedo-code)…

q = BooleanQuery(
  minNumberShouldMatch => 1,
  booleanClauses => ClauseList(
    MustMatch(DisjunctionMaxQuery(
      PhraseQuery("features","apache solr")^2,
      PhraseQuery("name","apache solr")^3)
    ),
    ShouldMatch(DisjunctionMaxQuery(
      TermQuery("features","search")^2,
      TermQuery("name","search")^3)
    ),
    ShouldMatch(DisjunctionMaxQuery(
      TermQuery("features","server")^2,
      TermQuery("name","server")^3))
    ));

With me so far right?

Where people tend to get tripped up, is in thinking about how Solr’s per-field analysis configuration (in schema.xml) impacts all of this. Our example above was pretty straight forward, but lets consider for a moment what might happen if:

  • The name field uses the WordDelimiterFilter at query time but features does not.
  • The features field is configured so that “the” is a stopword, but name is not.

Now let’s look at what we get when our input parameters are structurally similar to what we had before, but just different enough to for WordDelimiterFilter and StopFilter to come into play…

defType = dismax
     mm = 50%
     qf = features^2 name^3
      q = +"apache solr" the search-server

Our resulting query is going to be something like…

q = BooleanQuery(
  minNumberShouldMatch => 1,
  booleanClauses => ClauseList(
    MustMatch(DisjunctionMaxQuery(
      PhraseQuery("features","apache solr")^2,
      PhraseQuery("name","apache solr")^3)
    ),
    ShouldMatch(DisjunctionMaxQuery(
      TermQuery("name","the")^3)
    ),
    ShouldMatch(DisjunctionMaxQuery(
      TermQuery("features","search-server")^2,
      PhraseQuery("name","search server")^3))
  ));

The use of WordDelimiterFilter hasn’t changed things very much: features is treating “search-server” as a single Term, while in the name field we are searching for the phrase “search server” — hopefully this shouldn’t surprise anyone given the use of WordDelimiterFilter for the name field (presumably that’s why it’s being used). This DisjunctionMaxQuery still “makes sense”, but other fields with odd analysis that produce less/more Tokens then a “typical” field for the same thunk might produce queries that aren’t as easily to understand. In particular consider what has happened in our example with the word “the”: Because “the” is a stop word in the features field, no Query object is produced for that field/chunk combination. But a Query is produced for the name field, which means the total number of “ShouldMatch” clauses in our top level query is still 2 so our minNumberShouldMatch is still 1 (50% of 2 == 1).

This type of situation tends to confuse a lot of people: since “the” is a stop word in one field, they don’t expect it to matter in the final query — but as long as at least one qf field produces a Token for it (name in our example) it will be included in the final query, and will contribute to the count of “ShouldMatch” clauses.

So, what’s the take away from all of this?

DisMax is a complicated creature. When using it, you need to consider all of it’s options carefully, and look at the debugQuery=true output while experimenting with different query strings and different analysis configurations to make really sure you understand how queries from your users will be parsed.

  • Share this:
  • Email
  • Facebook
  • Digg
  • Share
  • Print
  • Reddit
  • StumbleUpon

Category: Solr

Tags: dismax, qparser, Solr

9 Responses to “What’s a “DisMax” ?”

  1. Great article hossman! That really cleared some of my confusions.

    Is it possible that you can elaborate on how multi-word synonyms come into play in your example for both query time and index time synonyms? For example, if we have a synonym setup as follows:

    search server ==> search engine

    And q = +”apache solr” search server
    What would the resulting query look like?

    Thanks a lot for your insight!

    August 12, 2010 08:19 — alexw

  2. [...] more succinctly: everyone else seem to join all terms with ‘AND’, whereas we do a DisMax variant on [...]

    August 18, 2010 08:01 — Solr: Forcing items with all query terms to the top of a Solr search » Robot Librarian

  3. [...] 100% of the terms DID match in the title field. For a mor thorough explanation, see one of several available [...]

    September 17, 2010 21:04 — Creating a unified search experience | Engineering Blog | The OpenSky Project

  4. How can I get fuzzy logic working with DisMax with solr 1.4?

    December 2, 2010 07:39 — Dominic Martino

  5. What a great article. I am only starting to acknowledge the importance of fully understanding the mechanisms present within various requestHandler’s prior to moving on to other ‘tools’ within solr such as faceting. This has made me think logically about how to design my Solr implementation.

    Thankyou
    Lewis

    March 19, 2011 05:22 — Lewis John McGibbney

  6. [...] Biases. In my next post, I hope to continue the topic of improving the user experience by using DisMax to add “Field [...]

    June 20, 2011 15:29 — Lucid Imagination » Solr Powered ISFDB – Part #10: Tweaking Relevancy

  7. [...] is a QParser that I’ve written about before. If you want all the gory details, I suggest you read that article, but for now the quick take away [...]

    August 8, 2011 16:02 — Lucid Imagination » Solr Powered ISFDB – Part #11: Using DisMax

  8. [...] in case you are new to Dismax, see: What’s a “Dismax”? from Lucid [...]

    February 9, 2012 13:25 — Using Solr’s Dismax Tie Parameter « Another Word For It

  9. what happens in pf field case?

    March 21, 2012 21:52 — sagar

Leave a Reply

Go to Blog Front Page

  • Recent Posts

    • Lucene Revolution 2012: Presentation slides are now available
    • Calling all Big Data analysts
    • Stump The Chump Updates
    • Once a Chump, Always a Chump
    • What is the big deal about BIG DATA?
    • Cloud Search: a David and Goliath battle
    • Lucene / Solr 3.6 released
    • What is Big Data? More importantly, what is Big Data not?
    • Memory comparisons between Solr 3x and trunk
    • Dates, date boosting, and NOW
  • Archives

    • May 2012
    • April 2012
    • March 2012
    • February 2012
    • January 2012
    • December 2011
  • Tags

    acts_as_solr apache Apache Mahout best practices Big data chump Cloud Computing code4lib dismax drupal enterprise search Erik Hatcher field collapsing frange function query Grant Ingersoll hoss isfdb Lucene lucene revolution lucid imagination lucidworks Mahout Marc Krellenstein Mark Miller memory Open Source Open Source Search qparser query parser Rails result grouping Richmond Ruby schema design sint Solr solr 3.1 solr 4.0 solr cloud sortable spatial search Tika unstructured search VA
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Log in

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2011 Lucid Imagination. All Right reserved.

loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.