<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lucid Imagination &#187; Grant Ingersoll</title>
	<atom:link href="http://www.lucidimagination.com/blog/tag/grant-ingersoll/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lucidimagination.com/blog</link>
	<description>Exclusively dedicated to Apache Lucene/Solr open source search technology</description>
	<lastBuildDate>Sat, 04 Feb 2012 01:12:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Mahout in Action Review</title>
		<link>http://www.lucidimagination.com/blog/2011/10/15/mahout-in-action-review/</link>
		<comments>http://www.lucidimagination.com/blog/2011/10/15/mahout-in-action-review/#comments</comments>
		<pubDate>Sat, 15 Oct 2011 13:13:18 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Grant Ingersoll]]></category>
		<category><![CDATA[Mahout in Action]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4317</guid>
		<description><![CDATA[<p>You know your (technical) <a href="https://cwiki.apache.org/confluence/display/MAHOUT/MahoutName">baby</a> is (almost) grown up when the book on the project finally comes out.  Such is the case for Apache Mahout, thanks to <a href="http://www.manning.com">Manning Publications</a> shipping <a href="http://affiliate.manning.com/idevaffiliate.php?id=1141_219">Mahout in Action</a> this week.</p>
<p><img src="http://manning.com/owen/owen_cover150.jpg" alt="" width="150" height="187" class="alignright" float="right" />So, before I start into my review, let me first say congratulations to Sean, Robin, Ted, Ellen and Manning for producing such an excellent product.   The simplest praise I can give it is to put it on the same &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>You know your (technical) <a href="https://cwiki.apache.org/confluence/display/MAHOUT/MahoutName">baby</a> is (almost) grown up when the book on the project finally comes out.  Such is the case for Apache Mahout, thanks to <a href="http://www.manning.com">Manning Publications</a> shipping <a href="http://affiliate.manning.com/idevaffiliate.php?id=1141_219">Mahout in Action</a> this week.</p>
<p><img src="http://manning.com/owen/owen_cover150.jpg" alt="" width="150" height="187" class="alignright" float="right" />So, before I start into my review, let me first say congratulations to Sean, Robin, Ted, Ellen and Manning for producing such an excellent product.   The simplest praise I can give it is to put it on the same level as one of the best intro to technology books I know:  <a href="http://www.manning.com/affiliate/idevaffiliate.php?id=1071_147">Lucene In Action</a>.  In other words, it sets the standard by which all other Mahout books will be judged.<br />
<br />
As for the actual book, it is broken down into 3 sections, which I like to call the &#8220;three C&#8217;s&#8221;:</p>
<ol>
<li>Collaborative Filtering</li>
<li>Clustering</li>
<li>Classification</li>
</ol>
<p>So, without further ado, let&#8217;s take a deeper look at the book in this context of the three C&#8217;s.</p>
<h2>Collaborative Filtering</h2>
<p>Collaborative Filtering is by far one of the most popular parts of Mahout, being used in places like <a href="https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout">Amazon and Foursquare</a> and this section of the book, via 5 chapters,  walks you nicely through both the concepts and the practical aspects of collaborative filtering.   Chapter 2 starts by getting you up and running using the <a href="http://www.grouplens.org/">GroupLens</a> dataset for movie recommendations.  For those unfamiliar with collaborative filtering, this makes for a nice entrance into the subject with data everyone can relate to easily.  Chapter 3 then discusses how to best model your data, while chapter 4 looks at the mechanics of actually generating recommendations from the data. </p>
<p>Chapters 5 and 6 then discuss the ins and outs of taking a recommendation engine into production, including details on how to scale it out using Apache Hadoop.  I found the explanation of the Hadoop based co-occurrence process (via RecommenderJob) especially useful, as I recently just committed <a href="https://issues.apache.org/jira/browse/MAHOUT-798">MAHOUT-798</a>, which uses it to build an example recommendation system based off of user interaction with email.  In fact, I relied heavily on all of the concepts in this part of the book, as I first had to extract and clean the data, then properly model it before finally running the recommendation task on EC2.</p>
<p>When I first got access to the MEAP for this book (quite some time ago), I did not have a lot of background in collaborative filtering and these chapters really helped fill in the practical details for me as well as provided a good foundation for the theoretical aspects behind collab. filtering.  I think this will serve others well who are looking to get started with collaborative filtering as well.</p>
<h2>Clustering</h2>
<p>Similar to collaborative filtering, the clustering section starts off by introducing the basic concepts and then quickly gets you up and running with an example clustering run.  Chapter 8 then gets into how best to do feature selection for clustering.  Feature selection is often one of the keys to successful clustering, so be sure to make sure you have a good grasp on the contents of the chapter before moving ahead into chapter 9, which gets into some of Mahout&#8217;s clustering algorithms.  That chapter primarily focuses on K-Means and Dirichlet, but also covers a few others.  Note, Mahout actually has a few other algorithms for clustering then the ones described, like spectral, canopy, meanshift and minhash.  Of course, some of these were added later in the book cycle, so it is hard to complain that they weren&#8217;t incorporated. Chapter 10 then covers, in my experience, one of the harder aspects of clustering, namely how to evaluate the results.  This chapter is a little bit thin, but it seems the overall field is the same, so this is not a put down on the chapter!  There simply isn&#8217;t a lot of great tools available for evaluating clustering.</p>
<p>Chapter 11 then adds some meat onto the bones of taking clustering to producti0n, including information on leveraging clustering in a Hadoop cluster.  Chapter 12 adds some nice concreteness to the sections by looking at clustering of real data sets from <a href="http://www.twitter.com">Twitter</a>, <a href="http://last.fm">Last.fm</a> and <a href="http://www.stackoverflow.com">Stack Overflow</a>.  For those looking to kick the tires with some real data, be sure to check out that chapter.</p>
<h2>Classification</h2>
<p>Classification is very popular these days both in search and beyond, so it is great to see this set of chapters covering the topic so well in practical, accessible terms.  As you would expect, the first chapter (13) gets you up and running as well as introduces the concepts of classification.  This chapter has a great explanation of how classification works and a typical workflow for building a classifier.</p>
<p>Chapter 14 then delves into the details of actually training a classifier using Mahout&#8217;s Stochastic Gradient Descent algorithm as well as it&#8217;s Bayesian classifier.</p>
<p><img src="http://1.bp.blogspot.com/_t0NJvKaO1dI/SjXBGm0DCpI/AAAAAAAAD1M/ISwdVEi7dt4/s400/potatosalad.jpg" alt="" width="243" height="320" class="alignright" float="right" />The next chapter then takes a look at how best to evaluate a classifier as well as some insight into what happens when a classifier goes bad.  Be sure to check this out, as you will no doubt run into many of the issues covered.  As an aside, I couldn&#8217;t help thinking of the classic &#8220;Far Side&#8221; cartoon to the right upon reading that section heading.The penultimate classification chapter digs into the practical aspects of deploying a classifier in production, including details on working through your scale and speed requirements.  It finishes off with an example Apache Thrift based server which some may find as a useful starting point for their applications.  Finally, Mahout in Action finishes off with a Case Study of how <a href="http://www.shopittome.com">Shop It To Me</a> uses a Mahout classifier to provide recommendations of offers to customers.  As with any technical book, it is great to have some concrete discussion of how this stuff functions in the wild.</p>
<h2>What&#8217;s Missing (i.e. When&#8217;s the 2nd edition coming out?)</h2>
<p>Mahout has a number of other interesting things that are in various stages of development like frequent patternset mining, Singular Value Decomposition (feature reduction), evolutionary programming, integration libraries for input/output as well as tools for storing data in Cassandra and Mongo.  Since Mahout is developing pretty quickly, the lack of this being in the book is no fault of the authors, I&#8217;m just putting it up here so that people are aware that Mahout has more to offer, even if the three &#8220;C&#8217;s&#8221; are the most popular.</p>
<p>All in all, Mahout in Action is an excellent introduction to the project.  Naturally I&#8217;m biased, but, pun intended, I highly recommend the book!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/10/15/mahout-in-action-review/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Estimating Memory and Storage for Lucene/Solr</title>
		<link>http://www.lucidimagination.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/</link>
		<comments>http://www.lucidimagination.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/#comments</comments>
		<pubDate>Wed, 14 Sep 2011 17:27:00 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[disk usage]]></category>
		<category><![CDATA[Grant Ingersoll]]></category>
		<category><![CDATA[memory]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4000</guid>
		<description><![CDATA[<p>Many times, clients ask us to help them estimate memory usage or disk space usage or to share benchmarks as they build out there search system. Doing so is always an interesting process, as I&#8217;ve always been wary of claims about benchmarks (for instance, one of the old tricks of performance benchmark hacking is to &#8220;cat XXX &#62; /dev/null&#8221; to load everything into memory first, which isn&#8217;t what most people do when running their system) &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>Many times, clients ask us to help them estimate memory usage or disk space usage or to share benchmarks as they build out there search system. Doing so is always an interesting process, as I&#8217;ve always been wary of claims about benchmarks (for instance, one of the old tricks of performance benchmark hacking is to &#8220;cat XXX &gt; /dev/null&#8221; to load everything into memory first, which isn&#8217;t what most people do when running their system) and or estimates because I know there are so many variables involved that it is possible to vary the results quite significantly depending on marketing goals. Thus, I tend to be pragmatic (which I think the Lucene/Solr community does as well) and focus on what do my tests show for my specific data and my specific use cases.</p>
<p>For instance, for testing memory, it&#8217;s pretty easy to set up a series of tests that start with a small heap size and successively grow it until no Out Of Memory Errors (OOME) occur. Then, to be on the safe side, add 1 GB of memory to the heap.  It works for the large majority of people. Ironically, for Solr at least, this usually ends up with a heap size somewhere between 6-12 GBs for a system doing &#8220;consumer search&#8221; with faceting, etc. and reasonably sized caches on an index in the 10-50 million docs range. Sure, there are systems that go beyond this or are significantly less (I just saw one the other day that has around 200M docs in less than 3 GB of RAM while handling decent search load), but the 6-12 GB seems to be a nice sweet spot for the application and the JVM, especially when it comes to garbage collection, while still giving the operating system enough room to do it&#8217;s job.  Too much heap and garbage may pile up and give you <em>ohmygodstoptheworld</em> full garbage collections at some point in the future.  Too little heap and you get the dreaded OOME.  Also too much heap relative to total RAM and you choke off the OS.  Besides, that range also has a nice business/operations side effect in that 16 GBs of RAM has a nice performance/cost benefit for many people.</p>
<p>Recently, however, I thought it would be good to get beyond the inherent hand waving above and attempt to come up with a theoretical (with a little bit of empiricism thrown in) model for estimating memory usage and disk space.   After a few discussions on <a href="http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2011-09-13 for assumptions">IRC</a> with McCandless and others, I put together a <span style="text-decoration: underline;"><strong>DRAFT</strong></span> <a href="http://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/size-estimator-lucene-solr.xls">Excel spreadsheet</a> that allows people to model both memory and disk space (based on the formula in <a href="http://www.lucidimagination.com/devzone/references/books-and-publications">Lucene in Action 2nd ed.</a> &#8211; LIA2), after filling in some assumptions about their applications (I put in defaults.)   First a few caveats:</p>
<ol>
<li>This is just an estimate, don&#8217;t construe it for what you are actually seeing in your system.</li>
<li>It is a DRAFT.  It is likely missing a few things, but I am putting it up here and in Subversion as a means to gather feedback.  I reserve the right to have messed up the calculations.</li>
<li>I feel the values might be a little bit low for the memory estimator, especially the Lucene section.</li>
<li>It&#8217;s only good for <a href="svn.apache.org/repos/asf/lucene/dev/trunk">trunk</a>.  I don&#8217;t think it will be right for 3.3 or 3.4.</li>
<li>The goal is to try to establish a model for the &#8220;high water mark&#8221; of memory and disk, not necessarily the typical case.</li>
<li>It inherently assumes you are searching and indexing on the same machine, which is often not the case.</li>
<li>There are still a couple of TODOs in the model.  More to come later.</li>
</ol>
<p>As for using the memory estimator, the primary things to fill in are the number of documents, number of unique terms and information on sorting and indexed fields, but you can also mess with all of the other assumptions.  For Solr, there are entries for estimating cache memory usage.  Keep in mind that the assumption for caching is that they are full, which often is not the case and not even feasible.  For instance, your system may only ever employ 5 or 6 different filters.</p>
<p>The disk space estimator is much more straightforward and based on LIA2&#8242;s fairly simple formula of:</p>
<blockquote>
<div>disk space used(original) = 1/3 original for each indexed field + 1 * original for stored + 2 * original per field with term vectors</div>
</blockquote>
<p>&nbsp;</p>
<p>It will be interesting to see how some of the new flexible indexing capabilities in trunk effect the results of this equation.  Also note, I&#8217;ve seen some applications where the size of the indexed fields is as low as 20%.</p>
<p>Hopefully, people will find this useful as well as enhance it and <a href="https://issues.apache.org/jira/browse/LUCENE-3435">fix any bugs</a> in it.  In other words, feedback is welcome.  As with any model like this, YMMV!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Implementing the Ecommerce Checklist with Apache Solr and LucidWorks</title>
		<link>http://www.lucidimagination.com/blog/2011/01/25/implementing-the-ecommerce-checklist-with-apache-solr-and-lucidworks/</link>
		<comments>http://www.lucidimagination.com/blog/2011/01/25/implementing-the-ecommerce-checklist-with-apache-solr-and-lucidworks/#comments</comments>
		<pubDate>Tue, 25 Jan 2011 15:24:12 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[ecommerce]]></category>
		<category><![CDATA[LucidWorks]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Grant Ingersoll]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=2351</guid>
		<description><![CDATA[<h1>Introduction</h1>
<p>During a past <a href="http://www.lucidimagination.com/blog/2010/04/06/webinar-e-commerce/">ecommerce webinar</a> with Brian Doll of <a href="http://www.sheetmusicplus.com">Sheetmusicplus.com</a>,  I posted a checklist of items that are commonly occurring in many  ecommerce applications and then I waved my hands, due to time  constraints, and said Solr (and now <a href="http://www.lucidimagination.com/lwe/download">LucidWorks</a>) can do almost all of them out of the box and  left the rest as an exercise for the reader.  (Note, the slides are  available <a href="http://www.lucidimagination.com/files/file/Lucid-Sheetmusic-Solr-ECommercePerformance.pdf">here</a>.   Registration required.)  Well, now I &#8230;</p>]]></description>
			<content:encoded><![CDATA[<h1>Introduction</h1>
<p>During a past <a href="http://www.lucidimagination.com/blog/2010/04/06/webinar-e-commerce/">ecommerce webinar</a> with Brian Doll of <a href="http://www.sheetmusicplus.com">Sheetmusicplus.com</a>,  I posted a checklist of items that are commonly occurring in many  ecommerce applications and then I waved my hands, due to time  constraints, and said Solr (and now <a href="http://www.lucidimagination.com/lwe/download">LucidWorks</a>) can do almost all of them out of the box and  left the rest as an exercise for the reader.  (Note, the slides are  available <a href="http://www.lucidimagination.com/files/file/Lucid-Sheetmusic-Solr-ECommercePerformance.pdf">here</a>.   Registration required.)  Well, now I have some time, so let me fill in  the blanks with some more concrete examples about how to do this.</p>
<h1>Setup</h1>
<p>For this example, I am using real estate data freely available from the <a href="http://www.nyc.gov/html/dof/html/property/property_val_sales.shtml">NYC government</a>.  The reason I am interested in this data is that it is:</p>
<ol>
<li>Free.</li>
<li>It has product-like data in it, as in: name, description, a bunch of metadata and price</li>
<li>It&#8217;s mostly real (I embellished it with descriptions and a few other  pieces and filled in some missing pieces of data, see the Indexer class  in the source code.)  In fact, it&#8217;s so real, that when setting up the  app, one quickly sees how noisy the data is in terms of things like  missing values, etc.  For instance, 1804 records don&#8217;t have the year  built specified.</li>
</ol>
<p>I have setup a Solr schema for this data as well as some tools for indexing the data.    To run the demo, you will need:</p>
<ol>
<li>Java 1.6</li>
<li>Ant 1.7.X</li>
<li>Download <a href="http://www.lucidimagination.com/blog/wp-content/uploads/2011/01/ecommerce.zip">ecommerce.zip</a> (73 MB)</li>
</ol>
<p>Once you have the prerequisites in place, take the following steps:</p>
<ol>
<li>Unzip the ecommerce.zip file into the directory of your choice</li>
<li>cd lucid_ecom</li>
<li>In a separate terminal window: cd solr
<ol>
<li>java -jar start.jar (just as if you were running the Solr tutorial.   Note, I am running a relatively recent version of the Solr 3.x branch)</li>
</ol>
</li>
<li>Point your web browser at http://localhost:8983/solr/nyc and take a moment to familiarize yourself with the interface.</li>
</ol>
<p>Once you have completed step 4, you should see something like:</p>
<p><a href="http://www.lucidimagination.com/blog/wp-content/uploads/2010/04/lucid_real_estate.png"></a><a href="http://www.lucidimagination.com/blog/wp-content/uploads/2010/04/lucid_real_estate.png"><img class="size-full wp-image-2299 alignnone" title="Lucid Real Estate Screenshot" src="http://www.lucidimagination.com/blog/wp-content/uploads/2010/04/lucid_real_estate.png" alt="Lucid Real Estate Screenshot" width="936" height="430" /></a></p>
<p>(NOTE: I&#8217;m not a graphic designer.  I tried to create a reasonable UI  w/o spending a ton of time on every last piece of it.  Also, I used the <a href="http://wiki.apache.org/solr/VelocityResponseWriter"> VelocityResponseWriter</a> built into Solr.  It&#8217;s nice for prototyping, but  it &#8220;ain&#8217;t&#8221; for production use.)</p>
<p>A pre-built index is included in the Zip file, but if you wish to index it yourself, run:</p>
<ol>
<li>ant delete-all (deletes the existing content)</li>
<li>ant index</li>
</ol>
<p>With the working application in place, let&#8217;s take a look at how to implement the various checklist items.</p>
<p><!--StartFragment--></p>
<h1>Implementing the Checklist</h1>
<p>I&#8217;ve broken out each checklist item below and will cover each of them in more detail in the following subsections.</p>
<h2>Keyword search</h2>
<p>There really isn&#8217;t much to be said here other than Solr has built in  support for querying in all the &#8220;usual&#8221; ways that one would expect out  of a search engine.  Keywords, phrases, wildcards, fielded search and  much, much more.  For example, try:</p>
<ol>
<li><a href="http://localhost:8983/solr/nyc?q=tottenville">http://localhost:8983/solr/nyc?q=tottenville</a> or just type tottenville in the search box.</li>
<li><a href="http://localhost:8983/solr/nyc?q=5+bedrooms+%22Staten+Island%22">http://localhost:8983/solr/nyc?q=5+bedrooms+%22Staten+Island%22</a> (5 bedrooms &#8220;Staten Island&#8221;)</li>
<li><a href="http://localhost:8983/solr/nyc?q=5+bedrooms+borough_display%3ABro*">http://localhost:8983/solr/nyc?q=5+bedrooms+borough_display%3ABro*</a> (5 bedrooms borough_display:Bro* &#8212; Should match all 5 bedrooms in either the Bronx or Brooklyn)</li>
</ol>
<p>Take some time and try out your own queries.  In our example, we are using the <a href="http://wiki.apache.org/solr/DisMaxQParserPlugin">extended Dismax Query Parser</a>, in case you want to learn more about how it works.</p>
<h2>High Quality relevance (precision @ &lt; 10)</h2>
<p>In many search applications, and ecommerce is no exception, users  often abandon searches when the first page of results (often the top 10)  are not relevant to their query.  Thus, it is important that a search  engine return good results on the first page.  While some guidance (more  on this in the coming sections) can help alleviate the abandonment  problem, a strong first showing is often the quickest way to more  clickthroughs.  Since Solr utilizes Lucene, which implements an industry  standard vector space approach to search, results are often quite good  out of the box.  Nevertheless, many ecommerce applications may need one  or more of the tools that Solr/Lucene provide out of the box to tweak  relevance, such as:</p>
<ol>
<li>Document, field, token boosting (i.e. matches in the title field are more important than matches in the description.)</li>
<li>Query term boosting (provide weights for different terms, such as synonyms.)</li>
<li>Disjunction Maximum Query scoring (aka the &#8220;dismax&#8221; parser or the extended dismax parser) for dealing with cross field matches.</li>
<li>Automatic phrase generation from multiword queries even when the user did not explicitly quote the keywords.</li>
<li>The ability to override low-level scoring information such as term  frequency, document frequency, document length normalization and  coordination factors.</li>
<li>Function queries (more later) to allow values in fields (such as price) to be factors in scoring.</li>
<li>Editorial Boosting/Sponsored Results (in Solr-speak it&#8217;s called the  QueryElevationComponent &#8212; more later) to place specific results at the  top.</li>
</ol>
<p>Relevance tuning is a complex subject and one that is best viewed in  the light of your data.  In summary, make sure you are making decisions  about relevancy based on the big picture and try to avoid any local  minima (i.e. tuning a specific query to the detriment of breaking lots  of other queries.)  In other words, make sure your top money making  queries aren&#8217;t effected by you &#8220;fixing&#8221; a one or two bad queries.  To  learn more, see my articles on <a href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Optimizing-Findability-Lucene-and-Solr">Improving Findability</a> and <a href="http://www.lucidimagination.com/search/out?u=http%3A%2F%2Fwww.lucidimagination.com%2FCommunity%2FHear-from-the-Experts%2FArticles%2FDebugging-Relevance-Issues-Search">Debugging Relevance</a>.  With the basics out of the way, it&#8217;s time to take a look at faceting and discovery tools</p>
<h2>Faceting/Discovery</h2>
<p>One of Solr&#8217;s most appealing features is its out of the box support  for faceting (sometimes called navigators, parametric search, guided  navigation) in a number of different ways (see <a href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr">http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr</a> for a primer.  Also see <a href="http://wiki.apache.org/solr/SimpleFacetParameters">http://wiki.apache.org/solr/SimpleFacetParameters</a>)   In the example application, the left hand nav area shows facets for  things like borough (field based faceting), sale price (numeric range  faceting), sale date (date range faceting) and pet friendly (facet by  query).   Solr also supports &#8220;multi-select&#8221; faceting (see <a href="http://search.lucidimagination.com">http://search.lucidimagination.com</a> for an example.)  And, while there isn&#8217;t support for true hierarchical  faceting in Solr yet, there are ways to achieve it through intelligent  modeling of your tokens.  Last, but not least, you may find <a href="https://issues.apache.org/jira/browse/SOLR-792">https://issues.apache.org/jira/browse/SOLR-792</a> useful for doing grouped faceting (color: red, size: large).</p>
<p>Additionally, helping customers discover items of interest goes well  beyond facets.  Features like Did you mean, Related Items/Searches,  Collaborative Filtering/Recommenders (see Mahout for an open source  solution), Auto Suggest and others can go a long way in increasing the  user&#8217;s ability to purchase items from your store.  Many of these  features I&#8217;ll cover below.</p>
<h2>Flexible language analysis tools</h2>
<div>Lucene and Solr have an extensive, open language analysis framework  that makes it easy to do linguistic analysis.  I won&#8217;t spend too much  time here, as you can have a look at the included schema.xml for  information on the various pieces I used.  Also, have a look at the <a href="http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters">Solr wiki</a> for more info.  Suffice it to say, Solr has many tokenizers, stemmers  and other token modification capabilities.  In many cases, a good search  system will use a variety of techniques (case changes, stemming,  synonyms, etc.) to achieve the desired results.  It is also often useful  to build up a list of protected words for things like product names so  that they do not get confused with other words that share a common  root.  Finally, keep in mind that of all the extension points to Lucene  and Solr, writing your own TokenFilter is one of the easiest things you  can do to extend the capabilities of your application.</div>
<h2>Multilingual support</h2>
<p>Solr contains support for most of the commonly spoken languages in  the world, including English, Chinese, French, Spanish, Korean, German,  Thai and many more.  Lucene and Solr are also UNICODE compliant.</p>
<h2>Frequent Incremental Updates</h2>
<p>Lucene, and thus, Solr has supported incremental updates from it&#8217;s  inception without the need to re-index the whole collection.  It is also  very fast at making new documents available for search.  Additionally,  with the combination of recent and upcoming work in Lucene, real time  search should be available soon.  The one piece that is still missing is  individual field update, but for certain types of fields (ratings, for  instance), there may be easy workarounds.</p>
<h2>Ratings and Reviews</h2>
<p>In working with many ecommerce customers on Solr, there are usually  questions around how to incorporate ratings and reviews into search  results without skewing results or introducing too much noise.   On the  ratings side, app developers often want to incorporate the aggregate  rating of an item as a boost factor in the overall score.  I will  discuss how to do this in detail in the section titled Editorial  Relevance Controls below.  Meanwhile, on the review side, it is often  the case that too much noise is introduced by including reviews &#8220;on par&#8221;  with matches in the product title or description.  For instance, if I&#8217;m  selling &#8220;Widget X&#8221; and a review for a different product says something  like &#8220;You should also check out Widget X&#8221;, bringing back a match on that  second product really isn&#8217;t all that useful for a customer searching  for &#8220;Widget X&#8221;.   To deal with this noise, people often take a couple of  different approaches:</p>
<ol>
<li>They weight review matches lower than product matches via boosting (either at query time or indexing time)</li>
<li>They only search reviews if they don&#8217;t feel they have high quality matches for the main product search</li>
</ol>
<p>You could also do some type of post processing analysis (NLP) of the   review to see if it is on topic, but this approach likely isn&#8217;t viable  for  most people in most situations due to the processing power and  accuracy  of such a solution.  As for #2 above, see my post on <a href="http://www.lucidimagination.com/blog/2009/08/12/fake-and-invisible-queries/">Fake and Invisible Queries</a> for more insight.</p>
<h2><!--StartFragment--></h2>
<h2>Auto-suggest</h2>
<p>Auto suggest (aka auto complete) is one of the cheapest (in terms of  development costs) mechanisms available for enhancing the chance that  users find what they are looking for.  I&#8217;ve heard of vendors adding  auto-suggest and having it add millions to their bottom line.  Simply by  providing a drop down list of ways of completing what a user has typed  so far an application can do a number of things:</p>
<ol>
<li>Reduce spelling errors thus leading to lower frustration and better results sooner rather than later</li>
<li>Seed the user with items that they may want but weren&#8217;t explicitly  looking for.  After all, an intelligent auto-suggest box can very easily  not only give completions, but it can also hook in related items too.</li>
<li>Short-circuit search all together and go directly to a landing page for a specific search</li>
</ol>
<div>
<dl id="attachment_2317">
<dt><a href="http://www.lucidimagination.com/blog/wp-content/uploads/2010/04/ecomm-sample-auto-suggest.png"><img title="Example Auto-Suggest screen" src="http://www.lucidimagination.com/blog/wp-content/uploads/2010/04/ecomm-sample-auto-suggest.png" alt="" width="429" height="340" /></a></dt>
<dd>Example Auto-Suggest Screen Capture</dd>
</dl>
</div>
<p>For the demo, I implemented auto-suggest using SOLR-1316, which  should be committed to trunk soon.  Note, also, there are other ways of  doing auto-suggest, too, including using the TermsComponent and  Faceting.  Here are the steps I went through to make auto-suggest work:</p>
<ol>
<li>Applied the SOLR-1316 patch to the 3.x branch.  This required a  minor tweak to the HighFreqDictionary.java file.  See patch below</li>
<li>Add the necessary piece to the solrconfig.xml.  See the /autosuggest SearchComponent in the solrconfig.xml in the appendix.</li>
<li>Decide what fields to use in building the auto-suggest index (see  schema.xml).  I then &#8220;copy fielded&#8221; these into a field named suggest.   Note that I used a non-stemming analyzer.  I also used Solr&#8217;s word-based  n-gram filter with a shingle base of 5 so as to give phrase suggestions  too.  Note, this is intended for demonstration purposes, as you may  wish to not use shingles and append terms as the user types or you may  want to use a different value for n.  Also note, I did not spend much  time at all on evaluating what went into the suggest field that is used  as a source.  You will want to validate it and make sure it is aligned  with your business goals.</li>
<li>Build the auto-suggest data structures via the Spell Checker build command (see the next section)</li>
<li>Modified the jQuery script that is in the Solr  VelocityResponseWriter example to use the SOLR-1316 output instead of  the TermsComponent output.  See the autocomplete.vm file for details on  the Javascript.  See the next section on Did You Mean on how to make  requests to the the auto-suggest component, as it uses the same  mechanism as the spell checker.</li>
</ol>
<p>Hopefully, from here you will have enough information to build you your auto-suggest capabilities.  If not, see our <a href="http://www.lucidimagination.com/search/?q=autosuggest">search site</a> for more info, including alternate approaches to the SOLR-1316 patch.</p>
<h2>Did You Mean?</h2>
<p>Just like auto-suggest, spell checking can be helpful to users in  finding what they are looking for, especially given the propensity of  manufacturers/product designers to use incorrectly spelled words in  their product name in order to better &#8220;brand&#8221; the product.  Good spell  checking goes beyond merely hooking up a dictionary of terms, it is also  quite important to know when to suggest a term and when not suggest a  term.  Lucene/Solr has the basics of setting up spell checking covered  via the SpellCheckComponent, but a good spell checking application will  need to go beyond merely setting up the component in order to achieve  good results.  First things first, however, let&#8217;s take a look at getting  spell checking setup and then we can examine what is needed to make it  better.</p>
<p>First, we need to configure the SpellCheckComponent in the  solrconfig.xml file.  There is an example of this in the Solr tutorial  example, from which I changed the distance measure from the Levenstein  edit distance to the Jaro-Winkler distance.  The reason I did this is  based on past experience that users tend to misspell words towards the  end of the word and not the beginning, which the Jaro-Winkler distance  accounts for.  My configuration looks like:</p>
<blockquote>
<pre>&lt;searchComponent name="spellcheck"&gt;
 &lt;str name="queryAnalyzerFieldType"&gt;textSpell&lt;/str&gt;
 &lt;lst name="spellchecker"&gt;
 &lt;str name="name"&gt;default&lt;/str&gt;
 &lt;str name="field"&gt;spell&lt;/str&gt;
 &lt;str name="spellcheckIndexDir"&gt;./spellchecker&lt;/str&gt;
 &lt;str name="distanceMeasure"&gt;org.apache.lucene.search.spell.JaroWinklerDistance&lt;/str&gt;
 &lt;/lst&gt;
&lt;!-- ... --&gt;
 &lt;/searchComponent&gt;</pre>
</blockquote>
<p>The whole point of a SearchComponent such as the SpellCheckComponent  is to hook it into the main Solr request processing instead of having to  make a separate call.  Thus, I hooked the SpellCheckComponent into the  /nyc RequestHandler so that all queries that are submitted to the &#8220;main&#8221;  RequestHandler will also be spell checked.  Once the configuration is  setup, the spelling index must be built (and maintained.)  This is  handled by issuing an &amp;spellcheck.build=true command to the spell  checker, as in:</p>
<blockquote><p><a href="http://localhost:8983/solr/autosuggest?q=man&amp;spellcheck=true&amp;wt=xml&amp;rows=0&amp;indent=true&amp;spellcheck.build=true">http://localhost:8983/solr/autosuggest?q=man&amp;spellcheck=true&amp;wt=xml&amp;rows=0&amp;indent=true&amp;spellcheck.build=true</a></p></blockquote>
<p>(Note, the &amp;q param can be anything.)</p>
<p>Once the configuration is hooked up and the spell checking data  structure is built, the last piece is to hook it into the UI.  (Note, I  setup the solrconfig.xml to automatically do spell checking on every  query request.)  To hook into the UI, I co-opted the suggest.vm file and  spruced it up a bit to provide links, etc.  Other than that, it is  exactly the same, since both are just different implementations of spell  checking.</p>
<p>See the Solr wiki on the <a href="http://wiki.apache.org/solr/SpellCheckComponent">SpellCheckComponent</a> for more information.</p>
<h2>Related Searches/Items</h2>
<p>In many ecommerce applications, stores position related items next to  a particular item so as to inspire the user to either buy an additional  item or offer an alternative.  Naturally, the &#8220;relation&#8221; is determined  by the store and might take on a variety of forms, such as: accessories,  enhanced versions, cheaper versions, alternatives from different  manufacturers or popular items based on other users.  Similarly, a store  may wish to give users not only suggestions and spelling corrections,  but they may also want to give users alternative search terms or other  popular searches.  For instance, if a user searches for TVs, a store may  want to suggest they search for &#8220;LCD TVs&#8221; or &#8220;HD TVs&#8221;, etc.</p>
<p>When it comes to related items, many Solr users rely on either  hand-crafting a second query (given an original query and a particular  item) by using the original terms of the query and some of the terms  that describe the item.  For instance, an application might use the  category of the item plus some of the keywords for that item to then  craft the query, submit it to Solr and then display the first few  results.  This approach can also be done automatically using Solr&#8217;s  built in <a href="http://wiki.apache.org/solr/MoreLikeThis">More Like This</a> (MLT) capability, but you may need to do some tuning to get the results  you desire.  For the sake of the example, I incorporated MLT into the  application.  You can see it on the left hand side, just below the map,  under the &#8220;Similar Properties&#8221; heading.  The configuration of MLT was  done in the solrconfig.xml file as part of the /nyc RequestHandler.   Note, in a typical application you may not wish to generate MLT results  for a search query, but instead only provide them once a user chooses a  particular document, as MLT can add a fair amount of overhead to the  process.  Other Solr applications will often calculate related items off  line or through some type of collaborative filtering approach (see  Apache <a href="https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation">Mahout&#8217;s recommender capability</a> for an open source library to do this) and either add the information  to the document and re-index or integrate it at the application level.   In these cases, it&#8217;s not hard to integrate, but it is beyond the scope  of this article.</p>
<p>As for the functionality to add related searches, there is not currently support built into Solr, but there is a <a href="https://issues.apache.org/jira/browse/SOLR-2080">JIRA issue</a> open to track the idea.  Related searches can often be determined  through a combination of log analysis (look for patterns in a user  session) and synonyms or via collaborative filtering/recommenders.   Also, have a look at <a href="https://cwiki.apache.org/confluence/display/MAHOUT/Parallel+Frequent+Pattern+Mining">Mahout&#8217;s Frequent Pattern Mining capabilities</a>.  One could also index the queries into another index (Solr core) and simply issue fuzzy queries to it.</p>
<h2>Editorial Relevance Controls</h2>
<div>Whether its called &#8220;editorial controls&#8221;, &#8220;sponsored results&#8221;, &#8220;best  bets&#8221; or any other name, the ability to implement business goals as  part of search is a fundamental need of any ecommerce solution.  Hidden  in the various names is a desire to have total control of search  relevance without sacrificing speed or hindering the engine from working  well when no business rules are applicable.  Solr and Lucene offer a  myriad of mechanisms to achieve business goals ranging from the typical  boost values on documents, fields, tokens and query terms to the  hardcore &#8220;gotta have it exactly my way&#8221; option of cracking open the  source and adding your own query mechanism.  In between these two  extremes are a whole range of things like <a href="http://wiki.apache.org/solr/FunctionQuery">function queries</a>, payloads, the <a href="http://wiki.apache.org/solr/QueryElevationComponent">QueryElevationComponent</a> for setting fixed results as well as excluding specific documents, <a href="http://wiki.apache.org/solr/SolrPlugins#Similarity">similarity adjustments</a>, <a href="http://wiki.apache.org/solr/DisMaxQParserPlugin">augmented queries (such as automatic phrase boosting)</a> and much more.  Of these, most people rely on function queries, the  dismax extensions and the QueryElevationComponent to achieve their  relevance goals.</div>
<div>In the working example, I made a couple of changes to demonstrate some of the relevance ideas described here:</div>
<div>
<ol>
<li>The /nyc RequestHandler has the QueryElevationComponent hooked in  and keyed off of the elevate.xml file.  In that file, I mapped the query  &#8220;3 bedroom Brooklyn&#8221; to rank a specific document higher and exclude one  other.  See <a href="http://localhost:8983/solr/admin/file/?file=elevate.xml">http://localhost:8983/solr/admin/file/?file=elevate.xml</a> for the mapping.  To see this, add &amp;enableElevation=false to the query, as in: <a href="http://localhost:8983/solr/nyc?q=3+bedroom+Brooklyn&amp;enableElevation=false">http://localhost:8983/solr/nyc?q=3+bedroom+Brooklyn&amp;enableElevation=false</a></li>
<li>I setup &#8220;phrase boosting&#8221; on the description field to generate  phrases against the description field.  See the /nyc RequestHandler  (it&#8217;s the &#8220;pf&#8221; setting&#8221; in the solrconfig.xml).</li>
<li>I added a &#8220;boost function&#8221; to rank documents higher based on the  commission paid for selling the property (note, I randomly assigned a  value to this field for pedagogical reasons).  See the &#8220;bf&#8221; setting in  the /nyc RequestHandler.</li>
<li>Also, don&#8217;t forget creative domain modeling:  for instance, if you want to support landing pages and banners, why not just create them as documents in your index (assign a type to them) and make sure they are at the top of the results (other possibilities include doing two queries, one for landing pages first and then one for the results)</li>
</ol>
</div>
<div>If you are so inclined, you can also extend Solr and Lucene.  Before you do, however, you might want to <a href="http://search.lucidimagination.com">search for you issue</a>, or even ask on the appropriate mailing list.  If that doesn&#8217;t help, I recommend starting with the <a href="http://wiki.apache.org/solr/SolrPlugins">Solr Plugins</a> wiki page and then you can dig into the source from there if  necessary.  My advice:  If you think you need a new Query class (a  low-level Lucene mechanism for custom scoring), see if you can solve  your problem via a FunctionQuery (even a custom one) first and maybe  some other mechanisms before going down the Query path.</div>
<h2>Administration</h2>
<p>Administration means many things to many people.  To the IT  department, it means easy setup, configuration, monitoring, maintenance,  scalability, fault tolerance, etc. while to the business user it means  tools for manipulating results, reporting search statistics and  following through on business goals.  While the latter is important, I  am going to focus on the IT dept. needs for the sake of this article.   Solr is very easy for an IT person to get setup and have a baseline  configuration in place.   I&#8217;ve seen customers (without my help) be up  and running and searching their data in non-trivial ways in as little as  30 minutes, sometimes less.  As for monitoring, Solr comes with web  pages that report status as well as JMX integration.  I&#8217;ve also seen  Solr integrated nicely with Nagios, Cactus and other tools.  Lucid  Imagination also partners with <a href="http://www.lucidimagination.com/performanceportal">New Relic</a> to offer Solr specific monitoring tools.</p>
<p>As for the big questions about scalability and fault tolerance, the answer is an unequivocal yes.  High traffic ecommerce sites like Zappos, Netflix, CNET, AOL and many others use Solr  to server their search needs.  Solr can be setup to both handle large  indexes and high query volumes.  For more information on how to do this,  see Mark Miller&#8217;s excellent article on <a href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr">scaling Solr</a>.</p>
<h2>Recommendations (See Mahout)</h2>
<p>For both online and offline recommendation calculations, see the <a href="http://mahout.apache.org">Apache Mahout</a> project, which has an excellent collaborative filtering library.   While integration with Solr does not yet exist, Mahout does expose web  services (as well as Java APIs) for its recommender engine, so it is  feasible to integrate it within an application.</p>
<h2>Analytics and other Business Tools</h2>
<p>Analytics are probably  Solr&#8217;s weakest area, but that being said, we find that many customers  already have platforms in place (like Omniture) that they can easily  integrate Solr into.  This often saves business users from having to  learn yet another tool.  As for other business tools, Solr likely does  not have them (for instance, merchandising tools), but again, many  people find it straightforward to integrate Solr into existing tools.  Also, this is an area that LucidWorks, with it&#8217;s administrative UI really can help.  It has screens and tools for doing log analysis and seeing what popular queries are, as well as popular terms and zero result values.</p>
<h1>Looking Forward</h1>
<p>Solr is a very popular and capable search engine for ecommerce and, looking forward,  it is only getting better.  With a focus on greater features (spatial  search, for instance), the latest Lucene and easier scalability, the  next version of Solr promises to be even better.</p>
<h1>Appendix A</h1>
<p>Items needed here: schema, solrconfig, SOLR-1316 3.x branch patch</p>
<p><!--EndFragment--></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/01/25/implementing-the-ecommerce-checklist-with-apache-solr-and-lucidworks/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>What&#8217;s a shingle in Lucene parlance?</title>
		<link>http://www.lucidimagination.com/blog/2010/12/17/whats-a-shingle-in-lucene-parlance/</link>
		<comments>http://www.lucidimagination.com/blog/2010/12/17/whats-a-shingle-in-lucene-parlance/#comments</comments>
		<pubDate>Fri, 17 Dec 2010 15:47:44 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Grant Ingersoll]]></category>
		<category><![CDATA[ngrams]]></category>
		<category><![CDATA[shingles]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=2808</guid>
		<description><![CDATA[<p>Every now and then we get asked what the heck is a shingle in Lucene, as in the ShingleFilter or the ShingleMatrixFilter, so it seems worthwhile to provide some info on shingles in Lucene, Solr and <a href="http://www.lucidimagination.com/enterprise-search-solutions">LucidWorks Enterprise</a>.  First off, a shingle is just a word-based n-gram, as opposed to a character-based n-gram (NGramTokenizer, NGramTokenFilter, EdgeNGramTokenizer and EdgeNGramTokenFilter provide the latter functionality).  We named it shingles just to differentiate the two when it comes &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>Every now and then we get asked what the heck is a shingle in Lucene, as in the ShingleFilter or the ShingleMatrixFilter, so it seems worthwhile to provide some info on shingles in Lucene, Solr and <a href="http://www.lucidimagination.com/enterprise-search-solutions">LucidWorks Enterprise</a>.  First off, a shingle is just a word-based n-gram, as opposed to a character-based n-gram (NGramTokenizer, NGramTokenFilter, EdgeNGramTokenizer and EdgeNGramTokenFilter provide the latter functionality).  We named it shingles just to differentiate the two when it comes to naming the filters and, well, because like shingles on your roof, they overlap each other.</p>
<p>What are shingles good for?  Many people use them to create &#8220;pseudo-phrases&#8221; during the indexing process since the shingle ends up being a single token, which is then subject to the normal TF-IDF scoring that is used in Lucene.  In many cases, searching for phrases yields relevance improvements, but finding phrases at query-time can be more expensive than normal term queries, so people sometimes try to get ahead of the game and use shingles.</p>
<p>If you want to see shingles in action and compare them to n-grams, add the following field types to a sample Solr schema:</p>
<pre>&lt;fieldtype name="shingle"&gt;
 &lt;analyzer&gt;
 &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
 &lt;filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="5"/&gt;
 &lt;/analyzer&gt;
 &lt;/fieldtype&gt;
 &lt;fieldtype name="ngram"&gt;
 &lt;analyzer&gt;
 &lt;tokenizer class="solr.NGramTokenizerFactory" maxGramSize="5" minGramSize="2"/&gt;
 &lt;/analyzer&gt;
 &lt;/fieldtype&gt;
</pre>
<p>Next start up your Solr instance and browse to <a href="http://localhost:8983/solr/admin/analysis.jsp">http://localhost:8983/solr/admin/analysis.jsp</a> and do the following steps:</p>
<ol>
<li>In the Field row, choose &#8220;Type&#8221; from the dropdown and enter shingle in the text area</li>
<li>In the Field Value section, choose Verbose output and enter &#8220;The quick red fox jumped over the lazy brown dogs&#8221;.</li>
<li>Hit Submit.  You should see something like:</li>
</ol>
<div id="attachment_2813" class="wp-caption alignnone" style="width: 780px"><a href="http://www.lucidimagination.com/blog/wp-content/uploads/2010/12/shingle.jpg"><img class="size-full wp-image-2813" title="Shingle example" src="http://www.lucidimagination.com/blog/wp-content/uploads/2010/12/shingle.jpg" alt="" width="770" height="540" /></a><p class="wp-caption-text">Output of Solr&#39;s Analysis page </p></div>
<p>As you can see, there are multiple tokens put out for each position, many of which contain multiple words as a single token.</p>
<p>Next, try the same sentence, but switch from &#8220;shingle&#8221; to &#8220;ngram&#8221; for the field type.  This time you should see the words split up into character groups.</p>
<p>For more info, see <a href="http://en.wikipedia.org/wiki/N-gram">http://en.wikipedia.org/wiki/N-gram</a>.  Note, you might also find Google Book&#8217;s ngram viewer interesting too: <a href="http://ngrams.googlelabs.com/">http://ngrams.googlelabs.com/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/12/17/whats-a-shingle-in-lucene-parlance/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Summary of first ever RTP (Raleigh/Chapel Hill/Durham) Apache Lucene/Solr Meetup</title>
		<link>http://www.lucidimagination.com/blog/2010/09/29/summary-of-first-ever-rtp-raleighchapel-hilldurham-apache-lucenesolr-meetup/</link>
		<comments>http://www.lucidimagination.com/blog/2010/09/29/summary-of-first-ever-rtp-raleighchapel-hilldurham-apache-lucenesolr-meetup/#comments</comments>
		<pubDate>Wed, 29 Sep 2010 12:55:42 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[apache]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[auto suggest]]></category>
		<category><![CDATA[faceting]]></category>
		<category><![CDATA[Grant Ingersoll]]></category>
		<category><![CDATA[solr 4.0]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=2499</guid>
		<description><![CDATA[<p>A week and a day later, I&#8217;ve finally got a chance to put up my thoughts/notes on the first ever RTP <a href="http://lucene.apache.org">Apache Lucene/Solr</a> Meetup hosted by <a href="http://www.lulu.com/">Lulu Press</a> and co-sponsored by Lucid Imagination.</p>
<p>First off, hats off to Lulu for the excellent hosting, coordination and marketing of the event.  You could definitely see the evidence of Lulu&#8217;s &#8220;Be Remarkable&#8221; philosophy in the event. I&#8217;d say we had roughly 30-40 people for the first time event, &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>A week and a day later, I&#8217;ve finally got a chance to put up my thoughts/notes on the first ever RTP <a href="http://lucene.apache.org">Apache Lucene/Solr</a> Meetup hosted by <a href="http://www.lulu.com/">Lulu Press</a> and co-sponsored by Lucid Imagination.</p>
<p>First off, hats off to Lulu for the excellent hosting, coordination and marketing of the event.  You could definitely see the evidence of Lulu&#8217;s &#8220;Be Remarkable&#8221; philosophy in the event. I&#8217;d say we had roughly 30-40 people for the first time event, with a good mix of developers, technical managers and a few recruiters.  There was even a &#8220;competitor&#8221; from an unnamed proprietary vendor present.  On the application front, there was a large mix of usages represented: ecommerce, publishing, video search, procurement, biopharma, etc.</p>
<p>After some socialization, we kicked off the night with Lulu CEO <a href="http://en.wikipedia.org/wiki/Bob_Young_%28businessman%29">Bob Young</a>, who gave a short intro to Lulu as well as a warm welcome to all.  Next up, I gave a talk (<a href="http://files.meetup.com/1698968/newLuceneSolr-sept2010.pptx">slides</a>) on what&#8217;s coming in Lucene/Solr 3.x and beyond as well as answered some questions about features and functionality.  After me, Tarun Jain of <a href="http://www.abb.com/">The ABB Group</a>, one of Lucid&#8217;s first customers and the world&#8217;s largest producer of industrial robots as well as a global leader in power and industrial automation with revenues around $33B USD, gave a presentation titled &#8220;Extreme Faceting Using Solr&#8221; (<a href="http://files.meetup.com/1698968/Extreme%20Faceting%20using%20SOLR.ppt">slides</a>) on their move from a legacy proprietary vendor to Solr for searching all of their customer facing (and internal) product catalog (420K SKUs with 20+ million attributes and over 6M hits per month).   After setting the stage about the content to be searched and faceted, Tarun detailed how they went from wanting to &#8220;do everything in the DB&#8221; to doing nearly everything in Solr because it was that easy.  Moreover, slide 8 details the comparison they did between Solr and a very large proprietary search vendor (one of the so called top 3).  Here are the bullet points:</p>
<ol>
<li>
<div>Stress test results in Proof of concept</div>
<ol>
<li>
<div>SOLR 35 req/sec vs 2 req/sec</div>
</li>
<li>
<div>Average response times 200 ms vs 1-7 secs</div>
</li>
<li>
<div>CPU usage 2-3% vs 100%</div>
</li>
</ol>
</li>
<li>Sadly matchup was not even close (at least for the scenarios we tested for)</li>
<li>
<div><strong>Conclusion .. Performance of SOLR is inversely proportional to the cost</strong></div>
</li>
<li>
<div>Winner – SOLR by a KO</div>
</li>
</ol>
<p>After Tarun&#8217;s talk, Paul Oakes from Lulu gave an excellent technical presentation (<a href="http://files.meetup.com/1698968/Implementing%20Autocomplete%20with%20Solr%20and%20jQuery.ppt">slides</a>) on implementing auto-suggest in Solr using <a href="http://jquery.com/">jQuery</a>.  Just for grins, he also showed how trivial it was to add Google&#8217;s much hyped &#8220;Instant&#8221; search capability to Solr as well simply by making an extra jQuery call.  Naturally, the real work behind &#8220;Instant&#8221; is in capacity planning at scale, not in the programming of a few lines of Javascript.</p>
<p>As for the RTP meetup in general, I would suspect we will try to meet once a quarter, but maybe more often if the group so desires.</p>
<p>All in all, an excellent night, in my opinion.  Best of all it was a &#8220;home&#8221; event for me, so I didn&#8217;t have to fly anywhere or bum a ride back to a hotel!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2010/09/29/summary-of-first-ever-rtp-raleighchapel-hilldurham-apache-lucenesolr-meetup/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Sorting, Faceting and Schema Design in Solr</title>
		<link>http://www.lucidimagination.com/blog/2009/02/09/sorting-faceting-and-schema-design-in-solr/</link>
		<comments>http://www.lucidimagination.com/blog/2009/02/09/sorting-faceting-and-schema-design-in-solr/#comments</comments>
		<pubDate>Mon, 09 Feb 2009 19:11:09 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[schema]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[best practices]]></category>
		<category><![CDATA[Grant Ingersoll]]></category>
		<category><![CDATA[schema design]]></category>
		<category><![CDATA[sint]]></category>
		<category><![CDATA[sortable]]></category>

		<guid isPermaLink="false">http://blog.lucidimagination.com/?p=42</guid>
		<description><![CDATA[<p>I was recently with a client doing a <a href="http://www.lucidimagination.com/How-We-Can-Help/">Best Practices assesment</a> when I came across a common source of confusion related to sorting, faceting and schema design.</p>
<p>As background, Solr provides a <a href="http://wiki.apache.org/solr/SchemaXml">schema</a> that describes the Fields and Field Types (FT) that are used by an application.  Field Types describe how Solr should handle the information contained in a Field.  For instance, the integer FT tells Solr to treat the contents of any Field of &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>I was recently with a client doing a <a href="http://www.lucidimagination.com/How-We-Can-Help/">Best Practices assesment</a> when I came across a common source of confusion related to sorting, faceting and schema design.</p>
<p>As background, Solr provides a <a href="http://wiki.apache.org/solr/SchemaXml">schema</a> that describes the Fields and Field Types (FT) that are used by an application.  Field Types describe how Solr should handle the information contained in a Field.  For instance, the integer FT tells Solr to treat the contents of any Field of type integer as, you guessed it, an integer.  By integer here, I mean, good old fashioned Java ints.  Solr provides other FTs like long, double, float, string, date, as well as Text (which can be associated with Lucene&#8217;s analysis process).  Additionally, Solr provides several &#8220;sortable&#8221; FTs such as sint, slong, sdouble and sfloat.  Therein lies the confusion.  I think what happens is developers hear the word &#8220;sortable&#8221; and think they should use the sortable FT for any field they want to sort results by.  However, there is some subtlety here.  Namely, &#8220;sortable&#8221; FTs manipulate the content so that the lexicographic order is the same as the numeric order for use during search.  Sortables are thus really meant to be used when doing things like range queries (i.e. [price:2 TO 100]) and not for sorting as it relates to returning results.  Due to these required changes, sortables take up more space in the index (and in memory) then their non-sortable compadres.</p>
<p>What&#8217;s this got to do with schema design?  Well, this client had three fields, all defined as sortable integer FTs, as in:</p>
<ol>
<li>fieldOriginal  -  The source of the content.  This was the main field used for sorting</li>
<li>fieldSearch &#8211; Copy field of Original, but rounded to the nearest 100.  This was the main field for searching.</li>
<li>fieldFacet &#8211; Copy field of Original, but rounded based on a percentage of the original value so as to provide a sliding scale for faceting.  This was the main field used for faceting.</li>
</ol>
<p>In this case, the client was using the Original for sorting, Search for searching, and Facet for faceting.  They were not doing any range queries, so they did not need fieldSearch to be &#8220;sortable&#8221;.  Furthermore, the Original field had over 1 million unique terms, so sorting on it was taking up a good chunk of memory and disk space.  The other two fields were smaller, so the cost of sortables was not that big of a deal.  Finally, this field &#8220;pattern&#8221; was replicated for several other fields as well, some of which also had a significant number of unique terms.</p>
<p>Thus, simply by changing the Fields to use integers where appropriate, we significantly reduced the memory footprint and the disk space required in this client application.</p>
<p>So, as is always the case, play close attention to your schema design.  While the Solr example schema is pretty good out of the box, you shouldn&#8217;t just take it as gospel, either.  Spend some time thinking about your needs during design and it will likely save you much time later when debugging and testing your application.</p>
<p>**UPDATE**:  Note, making these changes will require you to re-index.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2009/02/09/sorting-faceting-and-schema-design-in-solr/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Lucene, Solr, Mahout and Droids ApacheCon EU in Amsterdam March 23-27</title>
		<link>http://www.lucidimagination.com/blog/2009/02/09/lucene-solr-mahout-and-droids-apachecon-eu-in-amsterdam-march-23-27/</link>
		<comments>http://www.lucidimagination.com/blog/2009/02/09/lucene-solr-mahout-and-droids-apachecon-eu-in-amsterdam-march-23-27/#comments</comments>
		<pubDate>Mon, 09 Feb 2009 17:44:20 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[Events]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Erik Hatcher]]></category>
		<category><![CDATA[Grant Ingersoll]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=82</guid>
		<description><![CDATA[[ Monday, 23 March 2009 to Friday, 27 March 2009. ] <p><a href="http://lucene.grantingersoll.com/2009/02/09/lucene-and-me-at-apachecon-eu-in-amsterdam-march-23-27/">Lucene and me at ApacheCon EU in Amsterdam March 23-27</a>.</p>
<p>I&#8217;ve posted a Lucene related event schedule on my blog for people who are interested.  Of particular note are the two days of pre-conference training on both Lucene and Solr.  These are shorter ApacheCon versions of our <a href="http://www.lucidimagination.com/How-We-Can-Help/Training/">3 day training classes</a>.  Obviously, we can&#8217;t cover all the material that we do in &#8230;</p>]]></description>
			<content:encoded><![CDATA[[ Monday, 23 March 2009 to Friday, 27 March 2009. ] <p><a href="http://lucene.grantingersoll.com/2009/02/09/lucene-and-me-at-apachecon-eu-in-amsterdam-march-23-27/">Lucene and me at ApacheCon EU in Amsterdam March 23-27</a>.</p>
<p>I&#8217;ve posted a Lucene related event schedule on my blog for people who are interested.  Of particular note are the two days of pre-conference training on both Lucene and Solr.  These are shorter ApacheCon versions of our <a href="http://www.lucidimagination.com/How-We-Can-Help/Training/">3 day training classes</a>.  Obviously, we can&#8217;t cover all the material that we do in our full courses, but they are a means for people who are attending the conference or who are in the vicinity to get some Solr and Lucene training.  Of course, if you can&#8217;t make it to ApacheCon, we&#8217;d be happy to discuss alternative training opportunities.  In fact, last year, I did an on-site training at a company in Europe right around conference time.  Just drop us a line at training@lucidimagination.com and we can see what works out.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2009/02/09/lucene-solr-mahout-and-droids-apachecon-eu-in-amsterdam-march-23-27/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

