<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Integrating Apache Mahout with Apache Lucene and Solr &#8211; Part I (of 3)</title>
	<atom:link href="http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/</link>
	<description>Exclusively dedicated to Apache Lucene/Solr open source search technology</description>
	<lastBuildDate>Sat, 04 Feb 2012 01:13:03 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
	<item>
		<title>By: Janki</title>
		<link>http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/comment-page-1/#comment-8664</link>
		<dc:creator>Janki</dc:creator>
		<pubDate>Thu, 12 Jan 2012 09:34:44 +0000</pubDate>
		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1851#comment-8664</guid>
		<description>Hi,

I am new to Mahout and Lucene. I want to do clustering of users. I have 7 dimensions (features) in data. I have tried kMeans clustering taking data from csv. Now I want to get data from Lucene. I have one question that while converting lucene documents to vectors, how will it consider dimensions? How should I generate Lucene documents if I want to generate vectors with n dimesions (features)?</description>
		<content:encoded><![CDATA[<p>Hi,</p>
<p>I am new to Mahout and Lucene. I want to do clustering of users. I have 7 dimensions (features) in data. I have tried kMeans clustering taking data from csv. Now I want to get data from Lucene. I have one question that while converting lucene documents to vectors, how will it consider dimensions? How should I generate Lucene documents if I want to generate vectors with n dimesions (features)?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark Rosenberg</title>
		<link>http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/comment-page-1/#comment-8366</link>
		<dc:creator>Mark Rosenberg</dc:creator>
		<pubDate>Wed, 02 Nov 2011 21:24:30 +0000</pubDate>
		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1851#comment-8366</guid>
		<description>Trying the cookbook example provided by the article with Mahout trunk and Solr 3.4.0. Looks like --field title-clustering doesn&#039;t have enough term vectors so I may be running afoul of https://issues.apache.org/jira/browse/MAHOUT-675.

11/11/02 14:16:41 ERROR lucene.LuceneIterator: There are too many documents that do not have a term vector for title-clustering
Exception in thread &quot;main&quot; java.lang.IllegalStateException: There are too many documents that do not have a term vector for title-clustering.

If I use --field text then mahout completes normally and writes 17 vectors. The recommendation to use copyField to accumulate field contents in a new title-clustering field appears to be mandatory if the article&#039;s mahout command line is to be used without modification.</description>
		<content:encoded><![CDATA[<p>Trying the cookbook example provided by the article with Mahout trunk and Solr 3.4.0. Looks like &#8211;field title-clustering doesn&#8217;t have enough term vectors so I may be running afoul of <a href="https://issues.apache.org/jira/browse/MAHOUT-675" rel="nofollow">https://issues.apache.org/jira/browse/MAHOUT-675</a>.</p>
<p>11/11/02 14:16:41 ERROR lucene.LuceneIterator: There are too many documents that do not have a term vector for title-clustering<br />
Exception in thread &#8220;main&#8221; java.lang.IllegalStateException: There are too many documents that do not have a term vector for title-clustering.</p>
<p>If I use &#8211;field text then mahout completes normally and writes 17 vectors. The recommendation to use copyField to accumulate field contents in a new title-clustering field appears to be mandatory if the article&#8217;s mahout command line is to be used without modification.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Grant Ingersoll</title>
		<link>http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/comment-page-1/#comment-8347</link>
		<dc:creator>Grant Ingersoll</dc:creator>
		<pubDate>Sat, 29 Oct 2011 12:26:43 +0000</pubDate>
		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1851#comment-8347</guid>
		<description>Mahout trunk should now be on Lucene 3.4.  In general, if you are replacing the jars, I think you need to make sure they are packaged in to Mahout&#039;s Job jars correctly.</description>
		<content:encoded><![CDATA[<p>Mahout trunk should now be on Lucene 3.4.  In general, if you are replacing the jars, I think you need to make sure they are packaged in to Mahout&#8217;s Job jars correctly.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bob Stewart</title>
		<link>http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/comment-page-1/#comment-8340</link>
		<dc:creator>Bob Stewart</dc:creator>
		<pubDate>Thu, 27 Oct 2011 15:02:26 +0000</pubDate>
		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1851#comment-8340</guid>
		<description>I have the same problem, seems that Lucene version is out of sync between Solr and Mahout.  Question is how exactly do I make them in sync?  I have mahout having lucene-core-3.1.0.jar in mahout/lib directory.  I have Solr 3.4.  I downloaded Lucene 3.4 jar files and replaced lucene jars inside mahout/lib but that did not work (doesnt seem that mahout loads those lucene jars at all).  So how to I make sure they use the same lucene version?  I am somewhat new to java/linux world.</description>
		<content:encoded><![CDATA[<p>I have the same problem, seems that Lucene version is out of sync between Solr and Mahout.  Question is how exactly do I make them in sync?  I have mahout having lucene-core-3.1.0.jar in mahout/lib directory.  I have Solr 3.4.  I downloaded Lucene 3.4 jar files and replaced lucene jars inside mahout/lib but that did not work (doesnt seem that mahout loads those lucene jars at all).  So how to I make sure they use the same lucene version?  I am somewhat new to java/linux world.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Grant Ingersoll</title>
		<link>http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/comment-page-1/#comment-8293</link>
		<dc:creator>Grant Ingersoll</dc:creator>
		<pubDate>Fri, 21 Oct 2011 06:36:32 +0000</pubDate>
		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1851#comment-8293</guid>
		<description>You should be fine upgrading Mahout&#039;s version.  In fact, we should do it in Mahout.  Feel free to open an issue there.  Although, the Java 7 issue and the 3.4.0 issue are separate.  The 3.4.0 issue was due to a fsync issue in Lucene 3.3.0</description>
		<content:encoded><![CDATA[<p>You should be fine upgrading Mahout&#8217;s version.  In fact, we should do it in Mahout.  Feel free to open an issue there.  Although, the Java 7 issue and the 3.4.0 issue are separate.  The 3.4.0 issue was due to a fsync issue in Lucene 3.3.0</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark Rosenberg</title>
		<link>http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/comment-page-1/#comment-8246</link>
		<dc:creator>Mark Rosenberg</dc:creator>
		<pubDate>Thu, 20 Oct 2011 21:59:01 +0000</pubDate>
		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1851#comment-8246</guid>
		<description>Hi Grant,

Thanks for the quick response! We seem to be in an awkward situation WRT Mahout and Solr Lucene version dependencies. I&#039;m using Mahout 0.6 snapshot, which has a Lucene 3.3.0 dependency. Due to Oracle Java 7 sabotage, Lucene users are being advised to upgrade to 3.4.0. Do I have an alternative to using the Mahout 0.5 release?</description>
		<content:encoded><![CDATA[<p>Hi Grant,</p>
<p>Thanks for the quick response! We seem to be in an awkward situation WRT Mahout and Solr Lucene version dependencies. I&#8217;m using Mahout 0.6 snapshot, which has a Lucene 3.3.0 dependency. Due to Oracle Java 7 sabotage, Lucene users are being advised to upgrade to 3.4.0. Do I have an alternative to using the Mahout 0.5 release?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Grant Ingersoll</title>
		<link>http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/comment-page-1/#comment-8238</link>
		<dc:creator>Grant Ingersoll</dc:creator>
		<pubDate>Thu, 20 Oct 2011 20:24:30 +0000</pubDate>
		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1851#comment-8238</guid>
		<description>Hi Mark,

the issue here is likely a version mismatch between the Lucene version in Mahout and the Lucene version you created your index with.  If you sync those up, you should be fine.</description>
		<content:encoded><![CDATA[<p>Hi Mark,</p>
<p>the issue here is likely a version mismatch between the Lucene version in Mahout and the Lucene version you created your index with.  If you sync those up, you should be fine.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark Rosenberg</title>
		<link>http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/comment-page-1/#comment-8225</link>
		<dc:creator>Mark Rosenberg</dc:creator>
		<pubDate>Thu, 20 Oct 2011 18:14:08 +0000</pubDate>
		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1851#comment-8225</guid>
		<description>I&#039;m having some trouble getting this to work with my own data. I issue the following command:

mahout lucene.vector --dir /home/markr/shgs/apache-solr-3.4.0/example/solr/data/index/ --output /tmp/part-out.vec --field content_encoded --idField id --dictOut /tmp/dict.out --norm 2

My intent is to generate term vectors for the content_encoded field whose schema.xml entry has the termVectors=&quot;true&quot; attribute setting. There is also a field named &#039;id&#039;. My data was imported into a sqlite3 db, and id is &#039;not null&#039;, but content_encoded may be null. When I run, I get the SLF4J multiple binding warning (just a warning?), and then the following exception:

Exception in thread &quot;main&quot; org.apache.lucene.index.CorruptIndexException: unrecognized format -3 in file &quot;_b.fnm&quot;
	at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:351)
	at org.apache.lucene.index.FieldInfos.(FieldInfos.java:71)
	at org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:72)
	at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:114)
	at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:92)
	at org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:113)
	at org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:29)
	at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:81)
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:750)
	at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:75)
	at org.apache.lucene.index.IndexReader.open(IndexReader.java:428)
	at org.apache.lucene.index.IndexReader.open(IndexReader.java:288)
	at org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:84)
	at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:250)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:616)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)

Advise on how to debug this problem would be greatly appreciated.

Mark</description>
		<content:encoded><![CDATA[<p>I&#8217;m having some trouble getting this to work with my own data. I issue the following command:</p>
<p>mahout lucene.vector &#8211;dir /home/markr/shgs/apache-solr-3.4.0/example/solr/data/index/ &#8211;output /tmp/part-out.vec &#8211;field content_encoded &#8211;idField id &#8211;dictOut /tmp/dict.out &#8211;norm 2</p>
<p>My intent is to generate term vectors for the content_encoded field whose schema.xml entry has the termVectors=&#8221;true&#8221; attribute setting. There is also a field named &#8216;id&#8217;. My data was imported into a sqlite3 db, and id is &#8216;not null&#8217;, but content_encoded may be null. When I run, I get the SLF4J multiple binding warning (just a warning?), and then the following exception:</p>
<p>Exception in thread &#8220;main&#8221; org.apache.lucene.index.CorruptIndexException: unrecognized format -3 in file &#8220;_b.fnm&#8221;<br />
	at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:351)<br />
	at org.apache.lucene.index.FieldInfos.(FieldInfos.java:71)<br />
	at org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:72)<br />
	at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:114)<br />
	at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:92)<br />
	at org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:113)<br />
	at org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:29)<br />
	at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:81)<br />
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:750)<br />
	at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:75)<br />
	at org.apache.lucene.index.IndexReader.open(IndexReader.java:428)<br />
	at org.apache.lucene.index.IndexReader.open(IndexReader.java:288)<br />
	at org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:84)<br />
	at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:250)<br />
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)<br />
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)<br />
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)<br />
	at java.lang.reflect.Method.invoke(Method.java:616)<br />
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)<br />
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)<br />
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)</p>
<p>Advise on how to debug this problem would be greatly appreciated.</p>
<p>Mark</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Moshe Lichman</title>
		<link>http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/comment-page-1/#comment-7487</link>
		<dc:creator>Moshe Lichman</dc:creator>
		<pubDate>Tue, 29 Mar 2011 17:41:47 +0000</pubDate>
		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1851#comment-7487</guid>
		<description>Hi, great post.
Have a Q though - I&#039;m running the MAHOUT through the Eclipse and I created the vector from my Lucene index. Two file were created:
 1. The vector file.
 2. The Dict file.

When running the FuzzyKMeans on the vector file - I got Exception while the job was parsing it - NotANumber Exception - for the vec file is a &#039;compiled&#039; file. Any ideas how to make it work?</description>
		<content:encoded><![CDATA[<p>Hi, great post.<br />
Have a Q though &#8211; I&#8217;m running the MAHOUT through the Eclipse and I created the vector from my Lucene index. Two file were created:<br />
 1. The vector file.<br />
 2. The Dict file.</p>
<p>When running the FuzzyKMeans on the vector file &#8211; I got Exception while the job was parsing it &#8211; NotANumber Exception &#8211; for the vec file is a &#8216;compiled&#8217; file. Any ideas how to make it work?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Matthew Sacks</title>
		<link>http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/comment-page-1/#comment-7327</link>
		<dc:creator>Matthew Sacks</dc:creator>
		<pubDate>Thu, 10 Feb 2011 04:26:11 +0000</pubDate>
		<guid isPermaLink="false">http://www.lucidimagination.com/blog/?p=1851#comment-7327</guid>
		<description>Hi Grant,
If you eventually wanted to dump results into Solritas (VelocityRepsonseWriter), what would the flow of data need to look like? Raw Data-&gt;Lucene-&gt;Mahout-&gt;Solr?

Thanks,
Matthew</description>
		<content:encoded><![CDATA[<p>Hi Grant,<br />
If you eventually wanted to dump results into Solritas (VelocityRepsonseWriter), what would the flow of data need to look like? Raw Data-&gt;Lucene-&gt;Mahout-&gt;Solr?</p>
<p>Thanks,<br />
Matthew</p>
]]></content:encoded>
	</item>
</channel>
</rss>

