<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lucid Imagination &#187; schema</title>
	<atom:link href="http://www.lucidimagination.com/blog/category/solr/schema/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lucidimagination.com/blog</link>
	<description>Exclusively dedicated to Apache Lucene/Solr open source search technology</description>
	<lastBuildDate>Sat, 04 Feb 2012 01:12:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>What&#8217;s with lowercasing wildcard (multiterm) queries in Solr?</title>
		<link>http://www.lucidimagination.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/</link>
		<comments>http://www.lucidimagination.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/#comments</comments>
		<pubDate>Tue, 29 Nov 2011 21:37:25 +0000</pubDate>
		<dc:creator>Erick Erickson</dc:creator>
				<category><![CDATA[schema]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[wildcards multiterm queryparser]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=4476</guid>
		<description><![CDATA[<h1><span class="Apple-style-span" style="font-size: 20px;">Wildcard query terms aren&#8217;t analyzed, why is that?</span></h1>
<p>Prior to the current 3x branch (which will be released as 3.6) and the trunk (4.0) Solr code, users have frequently been perplexed by wildcard searching being un-analyzed, often manifesting in case sensitivity. Say you have an analysis chain in your schema.xml file defined as follows and a field named <code>lc_field</code> of this type:</p>
<pre>&#60;fieldType name="lowercase" class="solr.TextField" &#62;
  &#60;tokenizer class="solr.WhitespaceTokenizerFactory"/&#62;
  &#60;filter class="solr.LowercaseFilterFactory" /&#62;
&#60;/fieldType&#62;
</pre>
<p>Now, you index &#8230;</p>]]></description>
			<content:encoded><![CDATA[<h1><span class="Apple-style-span" style="font-size: 20px;">Wildcard query terms aren&#8217;t analyzed, why is that?</span></h1>
<p>Prior to the current 3x branch (which will be released as 3.6) and the trunk (4.0) Solr code, users have frequently been perplexed by wildcard searching being un-analyzed, often manifesting in case sensitivity. Say you have an analysis chain in your schema.xml file defined as follows and a field named <code>lc_field</code> of this type:</p>
<pre>&lt;fieldType name="lowercase" class="solr.TextField" &gt;
  &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
  &lt;filter class="solr.LowercaseFilterFactory" /&gt;
&lt;/fieldType&gt;
</pre>
<p>Now, you index the text &#8220;My Dog Has Fleas&#8221;. So far, so good. Searching on this field as<br />
<code>field_lc:fleas</code> returns the document, as does <code>field_lc:flea*</code>.</p>
<p>But now you search on <code>field_lc:Flea*</code> and you don&#8217;t get any results. What?!?!?! Nearly everyone scratches their heads about this, and it&#8217;s a question that often comes up on the Solr user&#8217;s list. Users wonder why the analysis chain above isn&#8217;t applied to the wildcard queries. It turns out that it&#8217;s trickier than you might think at first. What happens when a single input term gets split up into multiple parts? For instance, for those of you familiar with WordDelimiterFilterFactory (WDDF) that can split on case change. What does it mean to parse &#8216;fleA*&#8217;? Applying WDDF might well give the two tokens &#8216;fle&#8217; and &#8216;A&#8217; and possibly &#8216;fleA&#8217;. If a wildcard is present, what tokens should be emitted?</p>
<ol>
<ol>
<li>&#8216;fleA*&#8217;</li>
<li>&#8216;fle*&#8217;, &#8216;A*&#8217;, &#8216;fleA*&#8217;</li>
<li>&#8216;fle*&#8217;, &#8216;A*&#8217;</li>
<li>&lt;insert your solution here&gt;</li>
</ol>
</ol>
<p>You can, I daresay, create any rule that suits your fancy. And it&#8217;ll be wrong in some situations. Of particular horror is anything that produces &#8216;A*&#8217; as above, conceptually, you&#8217;d than have an enormous OR clause consisting of all the terms that started with &#8216;A&#8217; in your index. Unless you had a rule like &#8220;only do this if the preceding fragment was 2 characters or more&#8221;. But then someone would say &#8220;I need three characters&#8221;, so can WDDF provide a &#8220;wildCardMin=#&#8221; parameter? I have trouble keeping all the parameters with WDDF and how they interact in my mind already, going down this path would be a nightmare. And I haven&#8217;t even considered some of the <strong>really</strong> interesting issues, like how proximity would be incorporated in all this.</p>
<h3>Wildcards aren&#8217;t the only issue</h3>
<p>The same issue occurs with accent folding, normalizations, and, really, any other component of an analysis chain that somehow changes the query terms. This behavior has mostly been ignored in releases past, it&#8217;s been up to the application programmer to manually &#8220;do the right thing&#8221; before sending the query to Solr. This often involves operations such as lower-casing and accent folding on the application side when a wildcard is encountered.</p>
<h1>The new way of handling these cases</h1>
<p>As of <a title="SOLR-2438" href="https://issues.apache.org/jira/browse/SOLR-2438">SOLR-2438</a> this behavior is no longer true for a number of the most common cases. A query analysis chain that contains any of the following components will automatically &#8220;do the right thing&#8221; and apply them for multi-term queries. If your analysis chain consists of any of these elements, and you want them applied to &#8220;multi-term&#8221; queries, you don&#8217;t have to do anything at all, it will &#8220;just work&#8221;. At query time, the indicated transformations are applied to the query terms and everyone is happy. Or should be. Do note that it&#8217;s an all-or-nothing operation. <strong>All</strong> of the elements below that are found in the query analysis chain are applied to the multi-term terms.</p>
<ul>
<ul>
<li>ASCIIFoldingFilterFactory</li>
<li>LowerCaseFilterFactory</li>
<li>LowerCaseTokenizerFactory</li>
<li>MappingCharFilterFactory</li>
<li>PersianCharFilterFactory</li>
</ul>
</ul>
<p>Again, this effectively means you don&#8217;t need to care about these transformations any more. One note of explanation, though. I&#8217;ve talked about the &#8220;query analysis chain&#8221;. But what if you don&#8217;t have one? Remember that your <code>&lt;analyzer&gt;</code> tag can have several possible &#8216;type&#8217; parameters; &#8220;index&#8221;, or &#8220;query&#8221;, or none. Well, if a &#8216; type=&#8221;query&#8221; &#8216; is found, that analysis chain is inspected and any of the above components are recorded to be used on multi-term queries. If no &#8216; type=&#8221;query&#8221; &#8216; is found, then the &#8216; type=&#8221;index&#8221; &#8216; is used. And if no &#8216; type=&#8221;index&#8221; &#8216; is found, than the one with no &#8216;type&#8217; parameter is used.</p>
<h2>What does &#8220;multi-term&#8221; mean anyway?</h2>
<p>I&#8217;ve also sprinkled the phrase &#8220;mult-term&#8221; around, and sometimes &#8220;wildcard&#8221;. It turns out that the simple wildcard case is a specialization of a broader category of queries, including:</p>
<ul>
<ul>
<li>wildcard</li>
<li>range</li>
<li>prefix</li>
</ul>
</ul>
<p>All of these are now handled as above.</p>
<h3>Expert level schema possibilities</h3>
<p>All of the above is automatic, but there are three immediate questions:</p>
<ul>
<ul>
<li>what about some of the <em>other</em> components?</li>
<li>what if I need the old behavior?</li>
<li>what if I want something completely different?</li>
</ul>
</ul>
<p>It turns out that all three of these questions have the same answer. But before I outline it, I want to emphasize that <strong>you very probably don&#8217;t need to care about what follows!</strong> You might need to know about this in special cases, so I&#8217;ll mention it here.</p>
<p>In the above explanations, I wrote that &#8220;analysis chain is inspected and any of the above components are recorded to be used on multi-term queries&#8221;. Well, what actually happens is that there&#8217;s a new analysis chain in town that can be specified in the schema.xml file called, you guessed it, &#8220;multiterm&#8221;. You specify it like this as part of a <code>&lt;fieldType&gt;</code>:</p>
<pre>
&lt;analyzer type="multiterm" &gt;
  &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
  &lt;filter class="solr.ASCIIFoldingFilterFactory" /&gt;
  &lt;filter class="solr.YourFavoriteFilterFactoryHere" /&gt;
&lt;/analyzer&gt;
</pre>
<p>You can put <em>any</em> component that&#8217;s legal in a &#8216;type=&#8221;index&#8221; &#8216; or &#8216;type=&#8221;query&#8221; &#8216; analysis chain. If you wanted, for instance, to enforce the old-style behavior, you could specify</p>
<pre>  &lt;tokenizer class="solr.KeywordTokenizerFactory" /&gt;</pre>
<p>as the entire &#8220;multiterm&#8221; analysis chain. It seems a bit odd to use KeywordTokenizerFactory here, but this applies to the individual terms, not the entire input. So it&#8217;s in effect saying &#8220;don&#8217;t analyze the terms at all&#8221;. Sound familiar? This is just what happened historically.</p>
<h3>How does this relate to the automatic behavior?</h3>
<p>Well, what really happens under the covers is that if you don&#8217;t define your own &#8220;multiterm&#8221; analysis chain, Solr constructs one for you from the analyzers you <em>have</em> defined as outlined above; query, index or default, in that order.</p>
<h2>Waaaaay under the covers, down in the code</h2>
<p>All this is accomplished by making components &#8220;multiterm aware&#8221;. This means implementing the &#8220;MultiTermAwareComponent&#8221; interface. Currently, the components listed above are the only ones that implement this interface, but others may be good candidates, and some of these are listed in JIRA <a title="SOLR-2921" href="https://issues.apache.org/jira/browse/SOLR-2921">SOLR-2921</a>. By and large, implementing these in the code <em>may</em> be trivial. What&#8217;s <em>not</em> trivial is understanding what &#8220;the right thing&#8221; is. Some examples:</p>
<ul>
<ul>
<li>stemmers</li>
<li>various language-specific normalization filters</li>
<li>various language-specific lowercase filters.</li>
<li>various ICU filters</li>
</ul>
</ul>
<p>The reason these haven&#8217;t been made &#8220;multi term aware&#8221; is the usual open-source reason; &#8220;What we have is a good step forward, we shouldn&#8217;t delay everything in order to get the last use cases taken care of&#8221;. In other words the implementors (me in this case, with lots of help from others) are tired <img src='http://www.lucidimagination.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> .</p>
<p>Anyone who really understands what the right thing to do in the cases of components that do not yet implement &#8220;MultiTermAwareComponent&#8221; and could provide use cases for them would be giving us a great help, especially by providing examples illustrating the correct inputs and outputs for wildcard cases. And some examples of what should <em>not</em> come out as well. Or even better, a draft JUnit test that would show the expected behavior. Or even better yet, a full patch!</p>
<p>Any modification that potentially produces more than one token needs to be handled with care, see the code for LowerCaseTokenizerFactory for a case in point. Consider that Solr will now throw an exception if the transformation produces more than one token, so tread cautiously!</p>
<p>This change should remove a long-standing point of confusion for solr users. We&#8217;d be very interested in any feedback from the community, and especially any problems that crop up. SOLR-2438 has patches for both the 3x and 4x code lines, but it&#8217;s probably easier just to get a current 3x or 4x branch (or nightly build) if you want to test this &#8220;in the wild&#8221;; the code has been committed and built. There remains some work to be done to incorporate this change for more analysis components, anyone want to volunteer?</p>
<h2>Resources:</h2>
<p>This page on the Solr Wiki has the Wiki documentation: <a title="Multi Term Query Analysis" href="http://wiki.apache.org/solr/MultitermQueryAnalysis">Multi Term Query Analysis</a></p>
<p>Main JIRA (already in 3.6 and 4.0 code lines): <a title="SOLR-2438" href="https://issues.apache.org/jira/browse/SOLR-2438">SOLR-2438</a></p>
<p>JIRA for other components not yet &#8220;multi-term aware&#8221; that are possibilities in the future: <a title="SOLR-2921" href="https://issues.apache.org/jira/browse/SOLR-2921">SOLR-2921</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lucene in Action, 2nd edition, available!</title>
		<link>http://www.lucidimagination.com/blog/2009/03/11/lia2/</link>
		<comments>http://www.lucidimagination.com/blog/2009/03/11/lia2/#comments</comments>
		<pubDate>Wed, 11 Mar 2009 14:34:42 +0000</pubDate>
		<dc:creator>Erik Hatcher</dc:creator>
				<category><![CDATA[Enterprise Search]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[schema]]></category>

		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=378</guid>
		<description><![CDATA[<p><a href="http://www.manning.com/hatcher3"><img class="alignleft" title="Lucene in Action, 2nd edition" src="http://www.manning.com/hatcher3/hatcher3_cover150.jpg" alt="" width="150" height="188" /></a><a title="Lucene in Action, 2nd edition" href="http://www.manning.com/hatcher3">Lucene in Action, 2nd edition</a> is now available through the Manning Early Access Program. We&#8217;ve arranged for an exclusive discount, on either printbook+ebook or just the ebook, for our readers. Simply enter the code <em>lucene40</em> and get 40% off the book until April 1, 2009.</p>
<p><a title="Lucene in Action, 2nd edition" href="http://www.manning.com/hatcher3"><em>Lucene in Action, Second Edition</em></a>, completely revises and updates the best-selling first edition and remains the authoritative book on Lucene. This book shows you how to index your documents, &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.manning.com/hatcher3"><img class="alignleft" title="Lucene in Action, 2nd edition" src="http://www.manning.com/hatcher3/hatcher3_cover150.jpg" alt="" width="150" height="188" /></a><a title="Lucene in Action, 2nd edition" href="http://www.manning.com/hatcher3">Lucene in Action, 2nd edition</a> is now available through the Manning Early Access Program. We&#8217;ve arranged for an exclusive discount, on either printbook+ebook or just the ebook, for our readers. Simply enter the code <em>lucene40</em> and get 40% off the book until April 1, 2009.</p>
<p><a title="Lucene in Action, 2nd edition" href="http://www.manning.com/hatcher3"><em>Lucene in Action, Second Edition</em></a>, completely revises and updates the best-selling first edition and remains the authoritative book on Lucene. This book shows you how to index your documents, including types such as MS Word, PDF, HTML, and XML. It introduces you to searching, sorting, and filtering, and covers the numerous changes to Lucene since the first edition. All source code has been updated to latest APIs (2.4.x and aims to be current to the upcoming Lucene 2.9 and 3.0 APIs)</p>
<p><strong>What&#8217;s been updated in the 2nd edition:</strong></p>
<ul>
<li>Updating and deleting documents using IndexWriter</li>
<li>Using the different LockFactory, DeletionPolicy, MergePolicy and MergeScheduler implementations that have been factored out.</li>
<li>Using the new autoCommit option in IndexWriter</li>
<li>Understanding simplifications to Lucene&#8217;s locking</li>
<li>Adding payloads to your index and using them with BoostingTermQuery</li>
<li>Using Function queries</li>
<li>Using FieldSelector to speed up loading of stored fields</li>
<li>Using IndexReader.reopen() to efficiently opening a new reader from an existing one</li>
<li>Measuring performance using the &#8220;benchmark&#8221; contrib package</li>
<li>Tuning the indexing or searching speed</li>
<li>Using threads to gain concurrency</li>
<li>Managing resources like memory, disk, and file descriptors usage</li>
<li>Making a backup copy of your index without pausing indexing.</li>
<li>Debugging common problems</li>
</ul>
<p>Still to come (and if you purchase now, you will receive the added content through MEAP and in the final print book): a chapter on Solr and a case study on our very own <a title="LucidFind" href="http://www.lucidimagination.com/search">LucidFind</a> search system of the entire Lucene ecosystem.</p>
<p>To purchase, visit <a href="http://www.manning.com/hatcher3">http://www.manning.com/hatcher3</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2009/03/11/lia2/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Sorting, Faceting and Schema Design in Solr</title>
		<link>http://www.lucidimagination.com/blog/2009/02/09/sorting-faceting-and-schema-design-in-solr/</link>
		<comments>http://www.lucidimagination.com/blog/2009/02/09/sorting-faceting-and-schema-design-in-solr/#comments</comments>
		<pubDate>Mon, 09 Feb 2009 19:11:09 +0000</pubDate>
		<dc:creator>Grant Ingersoll</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[schema]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[best practices]]></category>
		<category><![CDATA[Grant Ingersoll]]></category>
		<category><![CDATA[schema design]]></category>
		<category><![CDATA[sint]]></category>
		<category><![CDATA[sortable]]></category>

		<guid isPermaLink="false">http://blog.lucidimagination.com/?p=42</guid>
		<description><![CDATA[<p>I was recently with a client doing a <a href="http://www.lucidimagination.com/How-We-Can-Help/">Best Practices assesment</a> when I came across a common source of confusion related to sorting, faceting and schema design.</p>
<p>As background, Solr provides a <a href="http://wiki.apache.org/solr/SchemaXml">schema</a> that describes the Fields and Field Types (FT) that are used by an application.  Field Types describe how Solr should handle the information contained in a Field.  For instance, the integer FT tells Solr to treat the contents of any Field of &#8230;</p>]]></description>
			<content:encoded><![CDATA[<p>I was recently with a client doing a <a href="http://www.lucidimagination.com/How-We-Can-Help/">Best Practices assesment</a> when I came across a common source of confusion related to sorting, faceting and schema design.</p>
<p>As background, Solr provides a <a href="http://wiki.apache.org/solr/SchemaXml">schema</a> that describes the Fields and Field Types (FT) that are used by an application.  Field Types describe how Solr should handle the information contained in a Field.  For instance, the integer FT tells Solr to treat the contents of any Field of type integer as, you guessed it, an integer.  By integer here, I mean, good old fashioned Java ints.  Solr provides other FTs like long, double, float, string, date, as well as Text (which can be associated with Lucene&#8217;s analysis process).  Additionally, Solr provides several &#8220;sortable&#8221; FTs such as sint, slong, sdouble and sfloat.  Therein lies the confusion.  I think what happens is developers hear the word &#8220;sortable&#8221; and think they should use the sortable FT for any field they want to sort results by.  However, there is some subtlety here.  Namely, &#8220;sortable&#8221; FTs manipulate the content so that the lexicographic order is the same as the numeric order for use during search.  Sortables are thus really meant to be used when doing things like range queries (i.e. [price:2 TO 100]) and not for sorting as it relates to returning results.  Due to these required changes, sortables take up more space in the index (and in memory) then their non-sortable compadres.</p>
<p>What&#8217;s this got to do with schema design?  Well, this client had three fields, all defined as sortable integer FTs, as in:</p>
<ol>
<li>fieldOriginal  -  The source of the content.  This was the main field used for sorting</li>
<li>fieldSearch &#8211; Copy field of Original, but rounded to the nearest 100.  This was the main field for searching.</li>
<li>fieldFacet &#8211; Copy field of Original, but rounded based on a percentage of the original value so as to provide a sliding scale for faceting.  This was the main field used for faceting.</li>
</ol>
<p>In this case, the client was using the Original for sorting, Search for searching, and Facet for faceting.  They were not doing any range queries, so they did not need fieldSearch to be &#8220;sortable&#8221;.  Furthermore, the Original field had over 1 million unique terms, so sorting on it was taking up a good chunk of memory and disk space.  The other two fields were smaller, so the cost of sortables was not that big of a deal.  Finally, this field &#8220;pattern&#8221; was replicated for several other fields as well, some of which also had a significant number of unique terms.</p>
<p>Thus, simply by changing the Fields to use integers where appropriate, we significantly reduced the memory footprint and the disk space required in this client application.</p>
<p>So, as is always the case, play close attention to your schema design.  While the Solr example schema is pretty good out of the box, you shouldn&#8217;t just take it as gospel, either.  Spend some time thinking about your needs during design and it will likely save you much time later when debugging and testing your application.</p>
<p>**UPDATE**:  Note, making these changes will require you to re-index.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lucidimagination.com/blog/2009/02/09/sorting-faceting-and-schema-design-in-solr/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
	</channel>
</rss>

