<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lucid Imagination &#187; Relevancy</title>
	<atom:link href="http://www.lucidimagination.com/blog/category/relevancy/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lucidimagination.com/blog</link>
	<description>Exclusively dedicated to Apache Lucene/Solr open source search technology</description>
	<lastBuildDate>Sat, 04 Feb 2012 01:12:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Flexible ranking in Lucene 4</title>
		<link>http://www.lucidimagination.com/blog/2011/09/12/flexible-ranking-in-lucene-4/</link>
		<comments>http://www.lucidimagination.com/blog/2011/09/12/flexible-ranking-in-lucene-4/#comments</comments>
		<pubDate>Mon, 12 Sep 2011 18:51:40 +0000</pubDate>
		<dc:creator>robert.muir</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Relevancy]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=3951</guid>
		<description><![CDATA[<p>Over the summer I served as a <a title="Google Summer of Code" href="http://code.google.com/soc/">Google Summer of Code</a> mentor for David Nemeskey, PhD student at <a title="Eötvös Loránd University" href="http://www.elte.hu/en">Eötvös Loránd University</a>. David proposed to improve Lucene&#8217;s scoring architecture and implement some state-of-the-art ranking models with the new framework.</p>
<p>These improvements are now committed to Lucene&#8217;s trunk: you can use these models in tandem with all of Lucene&#8217;s features (boosts, slops, explanations, etc) and queries (term, phrase, spans, etc). A <a title="SOLR-2754: create Solr similarity factories for new ranking algorithms" href="https://issues.apache.org/jira/browse/SOLR-2754">JIRA issue</a> has been created &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>Over the summer I served as a <a title="Google Summer of Code" href="http://code.google.com/soc/">Google Summer of Code</a> mentor for David Nemeskey, PhD student at <a title="Eötvös Loránd University" href="http://www.elte.hu/en">Eötvös Loránd University</a>. David proposed to improve Lucene&#8217;s scoring architecture and implement some state-of-the-art ranking models with the new framework.</p>
<p>These improvements are now committed to Lucene&#8217;s trunk: you can use these models in tandem with all of Lucene&#8217;s features (boosts, slops, explanations, etc) and queries (term, phrase, spans, etc). A <a title="SOLR-2754: create Solr similarity factories for new ranking algorithms" href="https://issues.apache.org/jira/browse/SOLR-2754">JIRA issue</a> has been created to make it easy to use these models from Solr&#8217;s schema.xml.</p>
<p>Relevance ranking is the heart of the search engine, and I hope the additional models and flexibility will improve the user experience for Lucene: whether you&#8217;ve been frustrated with tuning TF/IDF weights and find an alternative model works better for your case, found it difficult to integrate custom logic that your application needs, or just want to experiment.</p>
<p>I&#8217;ll be giving a talk about how you can practically apply some of the upcoming Lucene 4 search features at <a title="Improved Search with Lucene 4" href="http://2011.lucene-eurocon.org/talks/20851">Lucene Eurocon</a> in October, and at the <a title="What's New in Lucene/Solr Meetup" href="http://www.meetup.com/SFBay-Lucene-Solr-Meetup/events/32514312/">SFBay Apache Lucene/Solr Meetup</a> later this month.</p>
<p>Some bullet points of the new scoring features:</p>
<ul>
<li>New ranking algorithms, in addition to Lucene&#8217;s <a title="Vector Space Model" href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model</a>:
<ul>
<li><a title="Okapi BM25" href="http://en.wikipedia.org/wiki/Okapi_BM25">Okapi BM25 Model</a></li>
<li><a title="Language Model" href="http://en.wikipedia.org/wiki/Language_model">Language Models</a></li>
<li><a title="Probability models for information retrieval based on divergence from randomness" href="http://theses.gla.ac.uk/1570/">Divergence from Randomness Models</a></li>
<li><a title="Information-based models for ad hoc IR" href="http://dl.acm.org/citation.cfm?id=1835490">Information-based Models</a></li>
</ul>
</li>
<li>Added key statistics to the index format to support additional scoring models.
<ul>
<li>Term- and field-level statistics for collection frequencies and deriving averages.</li>
<li>Additional document-level statistics for computing normalization factors.</li>
</ul>
</li>
<li>Decoupled <em>matching</em> from <em>ranking</em> in Lucene&#8217;s core search classes:<em></em>
<ul>
<li>Customize scoring without digging into the &#8220;guts&#8221;.</li>
<li><em></em>Customize <em>explanations</em>: essential for <a title="Debugging Search Application Relevance Issues" href="http://www.lucidimagination.com/devzone/technical-articles/debugging-search-application-relevance-issues">debugging relevance issues</a>.</li>
</ul>
</li>
<li>Powerful low-level <em>Similarity </em>API, supporting advanced use cases:
<ul>
<li>Incorporate per-document values from <a title="Heavy Committing: DocValues aka. Column Stride Fields in Lucene 4.0" href="http://www.lucidimagination.com/files/Willnauer%20Simon%20-%20DocValues%20Column%20Stride%20Fields%20in%20Lucene.pdf">Column Stride Fields</a> into the score.</li>
<li>Use different scoring parameters or algorithms for different fields.</li>
<li>Fuse multiple scoring algorithms into a combined score.</li>
</ul>
</li>
<li>Convenient high-level <em>SimilarityBase </em>for everything else:
<ul>
<li>Write your own scoring function in one Java method.</li>
<li>Easy access to available index statistics.</li>
</ul>
</li>
</ul>
<p>For more information about this GSOC project, take a look at its <a title="Google Summer of Code 2011 - Implementing State of the Art Ranking for Lucene" href="http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking">wiki page</a></p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/09/12/flexible-ranking-in-lucene-4/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Solr Powered ISFDB – Part #10: Tweaking Relevancy</title>
		<link>http://www.lucidimagination.com/blog/2011/06/20/solr-powered-isfdb-part-10/</link>
		<comments>http://www.lucidimagination.com/blog/2011/06/20/solr-powered-isfdb-part-10/#comments</comments>
		<pubDate>Mon, 20 Jun 2011 23:29:32 +0000</pubDate>
		<dc:creator>hossman</dc:creator>
				<category><![CDATA[Relevancy]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[isfdb]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=3696</guid>
		<description><![CDATA[<p>
This is part 10 in a (never ending?) <a href="http://www.lucidimagination.com/blog/tag/isfdb/">series of articles on Indexing and Searching the ISFDB.org data using Solr</a>.
</p>
<p>
Circumstances have conspired to keep my away from this series longer then I had intended, So today I want to jump right in talking about improving the user experience by improving relevancy.
</p>
]]></description>
			<content:encoded><![CDATA[<p>
This is part 10 in a (never ending?) <a href="http://www.lucidimagination.com/blog/tag/isfdb/">series of articles on Indexing and Searching the ISFDB.org data using Solr</a>.
</p>
<p>
Circumstances have conspired to keep my away from this series longer then I had intended, So today I want to jump right in talking about improving the user experience by improving relevancy.
</p>
<p><!-- .............................. --></p>
<p>
<i>(If you are interested in following along at home, you can <a href="https://github.com/lucidimagination/isfdb-solr/">checkout the code from github</a>.  I’m starting at the <a href="https://github.com/lucidimagination/isfdb-solr/tree/blog_9">blog_9</a> tag, and as the article progresses I’ll link to specific commits where I changed things, leading up to the <a href="https://github.com/lucidimagination/isfdb-solr/tree/blog_10">blog_10</a> tag containing the end result of this article.)</i>
</p>
<h3>Academic vs Practical</h3>
<p>
In Academia, people who study IR have historically discussed &#8220;relevancy&#8221; in terms of &#8220;<a href="https://secure.wikimedia.org/wikipedia/en/wiki/Precision_and_recall">Precision vs Recall</a>&#8221; (If these terms aren&#8217;t familiar to you, then I highly suggest reading the link) but in my experience, those kinds of metrics are just the starting point.  While users tend to care that your &#8220;Recall&#8221; is good (results shouldn&#8217;t be missing), &#8220;Precision&#8221; is is usually less important then &#8220;ordering&#8221; &#8212; Most users (understandably) want the &#8220;best&#8221; results to come first, and don&#8217;t care about the total number of results.
</p>
<p>
Defining the &#8220;best&#8221; results is where things get tricky. Once again, there are lots of great algorithms out there that academics debate the pros and cons of all the time, but frequently the best approach you can take to give you users the &#8220;best&#8221; results first isn&#8217;t to get a PhD in IR, it&#8217;s to &#8220;cheat&#8221; and bias the algorithms and apply &#8220;Domain Specific Knowledge&#8221; &#8212; But I&#8217;m getting ahead of myself, let&#8217;s start with a real example.
</p>
<h3>Poor Results in Our Domain</h3>
<p>
Every domain is different, and the key to providing a good search experience is making sure you really understand your domain, and how your users (and data) relate to it.
</p>
<p>
Lets look at a specific example with our ISFDB Data.  One of the most famous Sci-Fi short stories ever written is <a href="http://www.isfdb.org/cgi-bin/title.cgi?46434">Nightfall</a> by Isaac Asimov, who later collaborated with Robert Silverberg to expand it into <a href="http://www.isfdb.org/cgi-bin/title.cgi?11852">a novel</a>.  If a user (who knows they are searching the ISFDB) searched for the word &#8220;Nightfall&#8221; they would understandable expect one of those two titles to appear fairly high up in the list of results, but that&#8217;s now quite what they get with <a href="http://localhost:8983/solr/browse?q=nightfall">page #1 of our search</a> as it&#8217;s configured right now&#8230;
</p>
<ol>
<li>Title: Nightfall INTERIORART &#8211; Author: Kolliker</li>
<li>Title: Nightfall: Body Snatchers ANTHOLOGY- Author: uncredited</li>
<li>Title: Cover: Nightfall One COVERART &#8211; Author: Ken Sequin</li>
<li>Title: Cover: Nightfall One COVERART &#8211; Author: Ken Sequin</li>
<li>Title: Nightfall SHORTFICTION &#8211; Author: Tom Chambers</li>
<li>Title: Glossary (Nightfall at Algemron) ESSAY &#8211; Author: uncredited</li>
<li>Title: Cover: Nightfall One COVERART &#8211; Author: Ken Sequin</li>
<li>Title: Cover: Nightfall Two COVERART &#8211; Author: Ken Sequin</li>
<li>Title: Nightfall POEM &#8211; Author: Susan A. Manchester</li>
<li>Title: Cover: Nightfall One COVERART &#8211; Author: Ken Sequin</li>
</ol>
<p>
These results aren&#8217;t terrible surprising since so far in this series we&#8217;ve put no work into relevancy tuning, we&#8217;re just searching a simple &#8220;catchall&#8221; field. Before we can improve the situation, it&#8217;s important to make sure we understand why we&#8217;re getting what we&#8217;re getting and why we&#8217;re <i>not</i> getting what we want.
</p>
<h3>Score Explanations</h3>
<p>
One of the most hard to understand features of Solr is &#8220;Score Explanation&#8221; &#8212; not because it&#8217;s hard to use, but because the output really assumes you understand the core underpinnings of Lucene/Solr scoring.  When we <a href="http://localhost:8983/solr/browse?q=nightfall&amp;debugQuery=true">enable debugging on our query</a> we get a new &#8220;toggle explain&#8221; links for each result that let us see the score and a break down of how that score was computed &#8212; but that doesn&#8217;t let us compare with documents that aren&#8217;t on page #1.  To do that, we use the <a href="http://wiki.apache.org/solr/CommonQueryParameters#explainOther"><code>explainOther</code></a> option, and switch to the XML view since the velocity templates don&#8217;t currently display <code>explainOther</code> info.  Now we can <a href="http://localhost:8983/solr/browse?q=nightfall&amp;wt=xml&amp;indent=true&amp;debugQuery=true&amp;explainOther=doc_id:TITLE_46434+doc_id:TITLE_11852">compare the explanations</a> between the two docs we really hoped to find, and the top scoring result&#8230;
</p>
<ul>
<li>TITLE_847094 (Nightfall INTERIORART) <br/>
<pre>
2.442217 = (MATCH) fieldWeight(catchall:nightfall in 274241), product of:
  <b>1.0</b> = tf(termFreq(catchall:nightfall)=<b>1</b>)
  9.768868 = idf(docFreq=98, maxDocs=636658)
  <b>0.25</b> = fieldNorm(field=catchall, doc=274241)
</pre>
</li>
<li>TITLE_11852 (Nightfall NOVEL) <br/>
<pre>
1.7269082 = (MATCH) fieldWeight(catchall:nightfall in 11741), product of:
  <b>1.4142135</b> = tf(termFreq(catchall:nightfall)=<b>2</b>)
  9.768868 = idf(docFreq=98, maxDocs=636658)
  <b>0.125</b> = fieldNorm(field=catchall, doc=11741)
</pre>
</li>
<li>
TITLE_46434 (Nightfall SHORTFICTION) <br/></p>
<pre>
1.7269082 = (MATCH) fieldWeight(catchall:nightfall in 41784), product of:
  <b>1.4142135</b> = tf(termFreq(catchall:nightfall)=<b>2</b>)
  9.768868 = idf(docFreq=98, maxDocs=636658)
  <b>0.125</b> = fieldNorm(field=catchall, doc=41784)
</pre>
</li>
</ul>
<p>
The devil is in the differences, which I&#8217;ve put in bold.  Without going into a lot of complicated explanation, the crux of the issue is that even though the documents we&#8217;re looking for match the word &#8220;nightfall&#8221; twice in the catchall field we&#8217;re searching (and the top scoring result only matches once) that is offset by the &#8220;fieldNorm&#8221; which reflects the fact that the catchall field is much longer for our &#8220;good&#8221; docs then for our &#8220;bad&#8221; docs.
</p>
<h3>Tweaking Our Scoring</h3>
<p>
This is one of those examples where academics doesn&#8217;t always match the reality of your domain.  Typically when using the TF/IDF scoring model used in Lucene/Solr, you need a &#8220;length normalization&#8221; factor to offset the common case where a really long document inherently contains more words, so there is a statistical likely hood that the search terms may appear more times.  In a nutshell: All other things being equal, shorter is better.  This reasoning is generally sound, but the default implementation in Lucene/Solr can be a hinderence in a few common cases:
</p>
<ul>
<li>A corpus full of really short documents &#8211; our ISFDB index isn&#8217;t full books, just a bunch of metadata fields</li>
<li>A corpus where longer really is better &#8211; in the ISFDB data, more popular titles/authors tend to have more data, which means the <code>catchall</code> field is naturally longer.</li>
</ul>
<p>
There are some cool things we could do with tweaking the Similarity class to try and improve this, but the simplest thing to start with is to <a href="https://github.com/lucidimagination/isfdb-solr/commit/2fb0258f257a5678592338c6d186f43ea4ef7661"><code>omitNorms</code> on the catchall field</a> to eliminate this factor from our scoring.  With our new schema, we re-index and see some noticable changes&#8230;
</p>
<ol>
<li>Title: Nightfall NOVEL &#8211; Authors: Robert Silverberg, Isaac Asimov</li>
<li>Title: Nightfall and Other Stories COLLECTION &#8211; Author: Isaac Asimov</li>
<li>Title: Nightfall SHORTFICTION &#8211; Author: Isaac Asimov</li>
<li>Title: The Legend of Nightfall NOVEL &#8211; Author: Mickey Zucker Reichert</li>
<li>Title: Nightfall NOVEL &#8211; Author: John Farris</li>
<li>Title: The Road to Nightfall COLLECTION &#8211; Author: Robert Silverberg</li>
<li>Title: Road to Nightfall SHORTFICTION &#8211; Author: Robert Silverberg</li>
<li>Title: A Tiger at Nightfall SHORTFICTION &#8211; Author: Harlan Ellison</li>
<li>Title: Nightfall SHORTFICTION &#8211; Author: David Weber</li>
<li>Title: Nightfall SHORTFICTION &#8211; Author: Charles Stross</li>
</ol>
<h3>Domain Specific Biases</h3>
<p>
Omitting length norms has helped &#8220;level the field&#8221; for our docs, and in this one example it looks like a huge improvement at first glance, but that&#8217;s mainly a fluke.  If you look at the score explanations now we get a lot of identical scores, and the final ordering is primarily because of the order they were indexed in.
</p>
<p>
This is where adding some Domain Specific Bias can be handy.  If we review are schema, we see the <code>views</code> and <code>annualviews</code> fields which correspond to how many page views a given author/title has received (recently) on the ISFDB web site.  By factoring these page view counts into our scoring, we provide some &#8220;Document Biasing&#8221; to ensure that documents which are more popular will &#8220;win&#8221; (ie: score higher) in the event of a tie on the basic relevancy score.
</p>
<p>
The most straightforward way to bias scoring is with the <a href="http://lucene.apache.org/solr/api/org/apache/solr/search/BoostQParserPlugin.html">BoostQParser</a> which will multiple the score of a query for each document against an arbitrary function (on that document).  In it&#8217;s simplest form we can use it directly in our <code>q</code> param to multiple the scores by the simple sum of the two &#8220;views&#8221; fields: <code><a href="http://localhost:8983/solr/browse?q={!boost%20b=sum%28views,annualviews%29}nightfall">q={!boost b=sum(views,annualviews)}nightfall</a></code> and now we get a much more interesting ordering&#8230;
</p>
<ol>
<li>Title: Nightfall SHORTFICTION &#8211; Author: Isaac Asimov</li>
<li>Title: Nightfall NOVEL &#8211; Authors: Robert Silverberg, Isaac Asimov</li>
<li>Title: Nightfall and Other Stories COLLECTION &#8211; Author: Isaac Asimov</li>
<li>Title: Nightfall SHORTFICTION &#8211; Author: Charles Stross</li>
<li>Title: The Return: Nightfall NOVEL &#8211; Author: L. J. Smith</li>
<li>Title: Nightfall SHORTFICTION &#8211; Author: Arthur C. Clarke</li>
<li>Title: Nightfall SHORTFICTION &#8211; Author: David Weber</li>
<li>Title: The Road to Nightfall COLLECTION &#8211; Author: Robert Silverberg</li>
<li>Title: The Legend of Nightfall NOVEL &#8211; Author: Mickey Zucker Reichert</li>
<li>Title: Nightfall Revisited ESSAY &#8211; Authors: Pat Murphy, Paul Doherty</li>
</ol>
<p>
This new ordering for the page #1 results is much more appropriate for the domain of the ISFDB, and represents a general rule of relevancy biasing: <i>&#8220;Users unusually want to see the popular stuff.&#8221;</i>  However, users don&#8217;t usually want to have to type things like <code>{!boost b=sum(views,annualviews)}...</code> into the search box, so we need to encapsulate this into our config.  It&#8217;s very easy to do this using <a href="http://wiki.apache.org/solr/LocalParams">Local Params</a>, but unfortunately it does mean changing our &#8220;main&#8221; query param from <code>q</code> to something else.
</p>
<p>
We start by <a href="https://github.com/lucidimagination/isfdb-solr/commit/75f830caa1a11fd97ab48d6428096cf63f53cb3b">changing the <code>defaults</code> and <code>invariants</code> of our request handler</a> so that our boost function is always used as the <code>q</code> param, but it uses a new <code>qq</code> param as the main query (whose score will be multiplied by the function).  This works fine for our default query, but in order to be useful our <a href="https://github.com/lucidimagination/isfdb-solr/commit/59784a8268ff99d6238e122da45655e869bb811b">UI also needs to be changed</a> to know that the <code>qq</code> param is what is now used for the user input.
</p>
<h3>Conclusion (For Now)</h3>
<p>
And that wraps up this latest installment with the <a href="https://github.com/lucidimagination/isfdb-solr/tree/blog_10">blog_10</a> tag.  We&#8217;ve dramatically improved the user experience by tweaking our how our relevancy scores are computed based on some knowledge of our domain, particularly via Document Biases.  In my next post, I hope to continue the topic of improving the user experience by using <a href="http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/">DisMax</a> to add &#8220;Field Biases&#8221;. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/06/20/solr-powered-isfdb-part-10/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>The high bar for relevancy?</title>
		<link>http://www.lucidimagination.com/blog/2009/10/04/the-high-bar-for-relevancy/</link>
		<comments>http://www.lucidimagination.com/blog/2009/10/04/the-high-bar-for-relevancy/#comments</comments>
		<pubDate>Mon, 05 Oct 2009 02:25:23 +0000</pubDate>
		<dc:creator>David M. Fishman</dc:creator>
				<category><![CDATA[Enterprise Search]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Relevancy]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1205</guid>
		<description><![CDATA[<p>A big chunk of the billions that go to search-engine marketing and search engine optimization, SEM and SEO, (mostly to you-know-who) are spent on getting to Page 1 of the results. </p>
<p>I won&#8217;t be the first to point out that relevance for in-house search &#8212; i.e., without using <a href="http://en.wikipedia.org/wiki/Pagerank">Pagerank</a> &#8212; is a harder nut to crack. How much harder? A recent study from Aberdeen Group, publicized this week in <a href="http://www.informationweek.com/news/internet/search/showArticle.jhtml?articleID=220300901&#038;pgno=1&#038;queryText=&#038;isPrev=">Information Week</a>, provides the following &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>A big chunk of the billions that go to search-engine marketing and search engine optimization, SEM and SEO, (mostly to you-know-who) are spent on getting to Page 1 of the results. </p>
<p>I won&#8217;t be the first to point out that relevance for in-house search &#8212; i.e., without using <a href="http://en.wikipedia.org/wiki/Pagerank">Pagerank</a> &#8212; is a harder nut to crack. How much harder? A recent study from Aberdeen Group, publicized this week in <a href="http://www.informationweek.com/news/internet/search/showArticle.jhtml?articleID=220300901&#038;pgno=1&#038;queryText=&#038;isPrev=">Information Week</a>, provides the following stat: </p>
<blockquote><p>
At top performing companies [defined as "Best in Class", the top 20% of those surveyed], 67% of searches returned the most relevant results on the first search results page, while lower rated companies saw relevant results on the first page for only 42% of searches. </p></blockquote>
<p>1 out of 3 searches <em>at best</em> don&#8217;t deliver the right search on the first results page. In other words, the best case for search=find is 67%. </p>
<p>Relevancy is as much art as science; the best solutions for the problem are the ones that provide a way to match the art to the science. If you need some background on relevancy, read the <a href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Optimizing-Findability-Lucene-and-Solr">seminal article on relevancy and findability by Grant Ingersoll</a>, and check out the fine presentation on the subject by Mark Bennett of <a href="http://www.ideaeng.com/">New Idea Engineering</a> delivered at the most recent <a href="http://www.meetup.com/SFBay-Lucene-Solr-Meetup/">SFBay Lucene/Solr Meetup</a> we co-sponsored with the Computer History Museum in early September. </p>
<p>One of the best implementations of findability in Lucene and Solr I&#8217;ve come across is at <a href="http://www.netflix.com">Netflix</a>. There&#8217;s a really nice discussion captured in some <a href="http://www.lucidimagination.com/Community/Marketplace/Meetups">slides</a> by <a href="http://wunder.best.vwh.net/">Walter Underwood</a>, who helped built the Solr search infrastructure at Netflix (a milestone in a very distinguished career in search). He gave a terrific presentation at that same most recent <a href="http://www.meetup.com/SFBay-Lucene-Solr-Meetup/">Meetup</a>.</p>
<p>A key metric Walter used at Netflix to gauge finding (search relevancy effectiveness is such a mouthful) is called Mean Reciprocal Rank, or MRR. Simply put, it gives one point for a click through to the first-ranked item, 1/2 a point to the second ranked item, 1/3 of a point to the 3d ranked, etc. While it may not help find relevancy bugs, it provides a very nice aggregate picture of users&#8217; experience finding what they look for. A good benchmark, or stretch goal, according to Walter: 0.5 MRR, with 85%  of clicks on #1. </p>
<p>Let me be quick to say that there is much that is unique about the Netflix search use case (and much that is really, really fun). But the contrast between 85% of results selected at #1 in the results ranking, vs. 2/3 of results <em>on the first page</em> at best in class enterprise search implementations, leads me to wonder: what are others doing to measure relevancy and programmatically build feedback loops, automatic or otherwise? Lucene and Solr provide transparent, rich interfaces for doing this; and according to the Aberdeen study, there&#8217;s plenty of opportunity to do so. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2009/10/04/the-high-bar-for-relevancy/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

