| The last time I wrote about integrating Apache Nutch with Apache Solr (about two years ago), it was quite difficult to integrate the two components – you had to apply patches, hunt down required components from various places etc. Now there is easier way.The soon to be released Nutch 1.0 contains Solr integration “out of the box”. There are many different ways to take advantage of this new feature, but I am just going to go through one of them here. In this solution, Solr will be used as the only source for serving search results (including snippets). This way you can totally decouple your search application from Nutch and still use Nutch where it is at its best: crawling and extracting the content. Using Solr as the search backend, on the other hand, allows you to use all of the advanced features of a Solr server – like query spell checking, “more like this” suggestions, data replication and easy query time relevancy tuning, to mention just a few. | You might also be interested in: |

Why Nutch instead of a simpler Fetcher?
One possible way to implement something similar to what I present here would be to use a simpler crawler framework such as Apache Droids. But using Nutch gives you some pretty nice advantages. One of these is obviously the fact that Nutch provides a complete set of features you commonly need for a generic web search application. Another benefit of using Nutch is that it is a highly scalable and relatively feature rich crawler (this does not mean that you cannot do the same with some other framework). Nutch offers features like politeness (obeys robots.txt rules), robustness and scalability (Nutch runs on hadoop, so you can run Nutch on a single machine or on a cluster of 100 machines), quality (you can bias the crawling to fetch “important” pages first) and extendability (there are many apis you can plug in your functionality. One of the most important single feature is Nutch provides out of the box is, in my subjective opinion, a Linkdatabase. You might already know that Nutch tracks links between pages so that the relevancy of search results within a collection of interlinked documents goes well beyond the naive case where you index documents without link information and anchor texts.
Setup
The first step to get started is to download the required software components, namely Apache Solr and Nutch.
1. Download Solr version 1.3.0 or LucidWorks for Solr from Download page
2. Extract Solr package
3. Download Nutch version 1.0 or later (Alternatively download the the nightly version of Nutch that contains the required functionality)
4. Extract the Nutch package
tar xzf apache-nutch-1.0.tar.gz
5. Configure Solr
For the sake of simplicity we are going to use the example
configuration of Solr as a base.
a. Copy the provided Nutch schema from directory
apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file)
We want to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it:
b. Change schema.xml so that the stored attribute of field “content” is true.
<field name=”content” type=”text” stored=”true” indexed=”true”/>
We want to be able to tweak the relevancy of queries easily so we’ll create new dismax request handler configuration for our use case:
d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it
<requestHandler name="/nutch" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">dismax</str> <str name="echoParams">explicit</str> <float name="tie">0.01</float> <str name="qf"> content^0.5 anchor^1.0 title^1.2 </str> <str name="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 </str> <str name="fl"> url </str> <str name="mm"> 2<-1 5<-2 6<90% </str> <int name="ps">100</int> <bool hl="true"/> <str name="q.alt">*:*</str> <str name="hl.fl">title url content</str> <str name="f.title.hl.fragsize">0</str> <str name="f.title.hl.alternateField">title</str> <str name="f.url.hl.fragsize">0</str> <str name="f.url.hl.alternateField">url</str> <str name="f.content.hl.fragmenter">regex</str> </lst> </requestHandler>
6. Start Solr
cd apache-solr-1.3.0/example java -jar start.jar
7. Configure Nutch
a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s contents with the following (we specify our crawler name, active plugins and limit maximum url count for single host per run to be 100) :
<?xml version="1.0"?> <configuration> <property> <name>http.agent.name</name> <value>nutch-solr-integration</value> </property> <property> <name>generate.max.per.host</name> <value>100</value> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> </property> </configuration>
b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,
replace it’s content with following:
-^(https|telnet|file|ftp|mailto): # skip some suffixes -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # allow urls in foofactory.fi domain +^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/ # deny anything else -.
8. Create a seed list (the initial urls to fetch)
mkdir urls echo "http://www.lucidimagination.com/" > urls/seed.txt
9. Inject seed url(s) to nutch crawldb (execute in nutch directory)
bin/nutch inject crawl/crawldb urls
10. Generate fetch list, fetch and parse content
bin/nutch generate crawl/crawldb crawl/segments
The above command will generate a new segment directory under crawl/segments that at this point contains files that store the url(s) to be fetched. In the following commands we need the latest segment dir as parameter so we’ll store it in an environment variable:
export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
Now I launch the fetcher that actually goes to get the content:
bin/nutch fetch $SEGMENT -noParsing
Next I parse the content:
bin/nutch parse $SEGMENT
Then I update the Nutch crawldb. The updatedb command wil store all new urls discovered during the fetch and parse of the previous segment into Nutch database so they can be fetched later. Nutch also stores information about the pages that were fetched so the same urls won’t be fetched again and again.
bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
Now a full Fetch cycle is completed. Next you can repeat step 10 couple of more times to get some more content.
11. Create linkdb
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
12. Finally index all content from all segments to Solr
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
Now the indexed content is available through Solr. You can try to execute searches from the Solr admin ui from
http://127.0.0.1:8983/solr/admin
, or directly with url like
http://127.0.0.1:8983/solr/nutch/?q=solr&version=2.2&start=0&rows=10&indent=on&wt=json
Conclusion
Nutch in combination with Solr is quite a powerful base on which to build your search application. Even if the base is solid there are a few things missing from the stack that you will soon be aware of if you start to index content on larger scale. One of the missing features is duplicate content removal, but luckily there is an improvement issue for this in Nutch Jira https://issues.apache.org/jira/browse/NUTCH-684. Another missing piece from Solr side is a feature called field collapsing
(https://issues.apache.org/jira/browse/SOLR-236). The field collapsing feature could be used on when displaying results so that for example at most two pages would be shown for a single host.
The setup explained here has one significant caveat you also need to keep in mind: scale. You cannot use this kind of setup with vertical scale (collection size) that goes beyond one Solr box. The horizontal scaling (query throughput) is still possible with the standard Solr replication tools.
You might also be interested in:


I am new to Nutch so, pardon me if this a novice question. What does this sentence mean?
“The setup explained here has one significant caveat you also need to keep in mind: scale. You cannot use this kind of setup with vertical scale (collection size) that goes beyond one Solr box. The horizontal scaling (query throughput) is still possible with the standard Solr replication tools.”
March 9, 2009 12:13 — Taruvai
What I was trying explain there is that the Nutch Solr indexer can currently only be used with a single Solr instance. That is why you cannot easily handle “unlimited” amount of docs with the setup presented here. Depending on your requirements and hardware you can still fit up to 10-20 million, perhaps more, docs in one Solr instance. If you need to go beyond that you can, for example, enhance the Nutch Solr Indexer so it can use multiple Solr instances with help of consistent hashing or something similar. I am sure that this problem will be tackled rather soon either in Solr or Nutch land.
March 9, 2009 22:20 — Sami Siren
[...] Lucid Imagination » Using Nutch with Solr The soon to be released Nutch 1.0 contains Solr integration “out of the box”. There are many different ways to take advantage of this new feature, but I am just going to go through one of them here. In this solution, Solr will be used as the only source for serving search results (including snippets). This way you can totally decouple your search application from Nutch and still use Nutch where it is at its best: crawling and extracting the content. Using Solr as the search backend, on the other hand, allows you to use all of the advanced features of a Solr server – like query spell checking, “more like this” suggestions, data replication and easy query time relevancy tuning, to mention just a few. (tags: todo opensource searchengine rend solr nutch) [...]
March 10, 2009 14:02 — Webhamer Weblog: Search & ICT-related blogging » links for 2009-03-10
[...] Lucid Imagination » Using Nutch with Solr The soon to be released Nutch 1.0 contains Solr integration “out of the box”. There are many different ways to take advantage of this new feature, but I am just going to go through one of them here. In this solution, Solr will be used as the only source for serving search results (including snippets). This way you can totally decouple your search application from Nutch and still use Nutch where it is at its best: crawling and extracting the content. Using Solr as the search backend, on the other hand, allows you to use all of the advanced features of a Solr server – like query spell checking, “more like this” suggestions, data replication and easy query time relevancy tuning, to mention just a few. (tags: todo opensource searchengine rend solr nutch) [...]
March 10, 2009 15:04 — StauthamerNet :: Staut’s Family Blog» Blog Archive » links for 2009-03-10
Nice description.. working on first try…
March 28, 2009 22:00 — VIv
Hi, my drupal site uses apache-solr engine to index and query for data. It works great. I just installed nutch 1.0 to take advantage of nutch’s ability to crawl specific external sites for appropriate data and have these data indexed by the same apache-solr engine. I followed the instruction presented here and got my nutch 1.0 to work. Using the provided schema.xml that comes with nutch my site can query and display data coming from nutch crawling. Unfortunately the other data set is now no longer available for query. It is only available when I revert the schema.xml to the original one. Is it possible to combine the two indexes so they can be conveniently available to same apache-solr query. Would you please kindly point me in the right direction. Much thanks.
April 9, 2009 12:23 — Kham
Got it to work. Just a matter of careful merging of the two schema.xml files.
April 9, 2009 17:32 — Kham
You guys are great ! First shot right.
April 13, 2009 23:57 — Ray
Is there a way to get the full markup as opposed to the parsed text as content or another field?
April 18, 2009 07:21 — Ben
Hi, I wonder if there is a command that ask Nutch to return Documents that have been changed in the indexe and show changes.
Thanks for helping.
April 27, 2009 07:45 — aida
Hello,
How additional index can be added manually to nutch indexes, using solr?
May 12, 2009 14:06 — Alex
Hi, we followed step by step instructions above. When we went to view the results, we have noticed that the content is being translated into “Chinese” or some Asian language. We see results ONLY if we query in Asian language. what are we doing wrong?
May 20, 2009 13:52 — Sai Thumuluri
Sai, can you elaborate your problem a bit, how is the content translated? Do you have an example URL that demonstrates the problem? Thanks.
May 28, 2009 23:38 — Sami Siren
Thanks for the nice description. I have certain queries.
Once I crawl the websites using nutch I want to do the following:
1. Extract entities like people, places, and other master information. Is it possible in nutch?
2. I need the entire DOM of the page that is parsed. Is it possible in nutch? Or do I have to use someother mechanism.
The idea is: I have a project requirement where I have to crawl very specific sites, ‘extract specific values from the html DOM’ and index it. Next I want to search these values using solr.
Will Nutch+Solr suffice, OR I have to use somthing like Nutch+Solr+(Aperture/Tika).
Thanks for helping
June 4, 2009 22:22 — Jagdish Yadav
If I follow these instructions I can get everything to work swimmingly, however I need to add some ‘custom’ static fields so I was hoping to employ index-extra plug in. I put the plugin in place, crawl with seemingly no problem, but when I try to commit to solr, I get nothing added to the index, and no error. Has anyone else tried to use index-extra along with the nutch/solr integration?
June 11, 2009 09:06 — Edward
I think our problem is related to having our content in charset UTF-16, can you tell us how we can change charset to UTF-8 during index/crawl?
June 15, 2009 14:45 — Sai Thumuluri
I think I have all the nutch stuff working in that the data is searchable through the nutch webapp but I cant get any results through Solr
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
This makes the computer work heavily for an hour or so but I cant see where the solr index goes.
http://127.0.0.1:8983/solr/admin gives me a web interface but I cant get it to return any results.
Any idea what I should be looking at?
Thanks
July 7, 2009 10:17 — AlexMc
I’m having the same problem as AlexMC too.
July 13, 2009 11:24 — Zaihan
Hi,
I need some help here actually in our project I need to pass XML file as input to the NUTCH instead of .txt files as given in the example.When I tried to pass XML input file I am getting the following exception:-
WARN crawl.Injector – Skipping :java.net.MalformedURLException: no protocol:
2009-09-15 18:21:36,556 INFO crawl.Injector – Injector: Merging injected urls into crawl db.
2009-09-15 18:21:38,321 WARN util.NativeCodeLoader – Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Please help me in resolving the same and pls let me know how can I pass .xml file as input.
Thanks
September 15, 2009 05:34 — Ravikiran A
Hello,
I am new to Nutch too, excuse my novice question. What does this sentence mean?
“Now a full Fetch cycle is completed. Next you can repeat step 10 couple of more times to get some more content”.
is there any possibility to have these steps done automatically?
September 21, 2009 06:37 — LPillet
How can we use nutch and solr for multiple sites?
What are the steps I need to take care in configuring Nutch and Solr for multiple sites
November 4, 2009 12:25 — ACG
adding the fragment mentioned above to apache-solr-1.3.0/example/solr/conf/solrconfig.xml
makes solr fail. I get following exception and web site throws error. Please note solr works with original solrconfig.xml. The problem starts as soon as I add the fragement.
sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.mortbay.start.Main.invokeMain(Main.java:183)
at org.mortbay.start.Main.start(Main.java:497)
at org.mortbay.start.Main.main(Main.java:115)
Nov 16, 2009 10:02:11 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: invalid boolean value:
at org.apache.solr.common.util.StrUtils.parseBool(StrUtils.java:237)
at org.apache.solr.common.util.DOMUtil.addToNamedList(DOMUtil.java:140
at org.apache.solr.common.util.DOMUtil.nodesToNamedList(DOMUtil.java:9
at org.apache.solr.common.util.DOMUtil.childNodesToNamedList(DOMUtil.j
a:88)
at org.apache.solr.common.util.DOMUtil.addToNamedList(DOMUtil.java:142
at org.apache.solr.common.util.DOMUtil.nodesToNamedList(DOMUtil.java:9
at org.apache.solr.common.util.DOMUtil.childNodesToNamedList(DOMUtil.j
a:88)
at org.apache.solr.core.PluginInfo.(PluginInfo.java:54)
at org.apache.solr.core.SolrConfig.readPluginInfos(SolrConfig.java:220
at org.apache.solr.core.SolrConfig.loadPluginInfo(SolrConfig.java:212)
at org.apache.solr.core.SolrConfig.(SolrConfig.java:184)
at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreConta
er.java:134)
at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.
va:83)
at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.jav
40)
at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.
va:594)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.j
a:1218)
at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.jav
500)
at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:4
)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.jav
40)
at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollecti
.java:147)
at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextH
dlerCollection.java:161)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.jav
40)
at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollecti
.java:147)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.jav
40)
at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.jav
117)
at org.mortbay.jetty.Server.doStart(Server.java:210)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.jav
40)
at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImp
java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc
sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.mortbay.start.Main.invokeMain(Main.java:183)
at org.mortbay.start.Main.start(Main.java:497)
at org.mortbay.start.Main.main(Main.java:115)
November 16, 2009 22:05 — Neha
Any ideas what might be wrong? Any help would be highly appreciated.
November 16, 2009 22:05 — Neha
Hi,
I had the same Problem,
just change the xml syntax line as follows:
true
this should work
all the best
clausi
November 18, 2009 10:35 — clausi
sorry there is something missing
this line:
<bool name=”hl”>true</bool>
November 18, 2009 10:36 — clausi
in solrconfig.xml, tag:
should be
true
November 20, 2009 08:13 — igor vrdoljak
NOTE: In the request handler config section there are three settings which need to be edited:
- Where it says “int” should be changed to “str”.
- Where it says “float” should be changed to “str”.
- Where it says “boolean” should be changed to “str”.
With those changes the “/nutch” request handler will be configured correctly and Solr will start as expected.
November 25, 2009 22:09 — Jay Hill
Thanks Jay Hill, my these instructions broke my Solr configuration until I made the changes you mentioned.
I was also having problems getting Nutch 1.0 to run on top of Xampp Tomcat for Windows and I discovered I needed to use a more recent version of Java.
The Xampp Tomcat distrib had JRE 1.5 rev 20, and kept failing to run Nutch, logging “java.lang.UnsupportedClassVersionError: Bad version number in .class file “. I worked around this by installing a newer Java version (JRE 1.6 rev 16) and pointing my JRE_HOME to the new location.
Thx for a really insightful article
November 28, 2009 17:36 — artemgy
Very useful and straightforward guide. Thank you.
Just one question: does the solrindex command have options or configuration parameters that I can use to, say, index only pages containing/not_containing certain words and discard all the rest?
G
December 15, 2009 03:40 — Giona
How can I make this multi crawler and avoid duplicate content using multi crawler ??
December 29, 2009 16:43 — jason
How can I rank the content after getting the content?
December 29, 2009 16:47 — jason
Do I use “/nutch” or the path to my nutch directory (e.g. /opt/nutch-1.0)?
January 5, 2010 16:18 — Ken
I executed ‘java -jar start.jar’, and got this at the very end:
Started SocketConnector @ 0.0.0.0:8983
I’m about to get solr to load when I url http://localhost:8983/solr/admin
Does it mean solr is working and running, where I open a new terminal to do others things and not press “Ctlr+C”.
January 5, 2010 17:32 — Ken
Hi
Can we use this setup to crawl multiple domains and return results for specific domains depending on which user it is ?
Thanks
January 6, 2010 17:28 — shantanu
[...] de recherche libre vous pouvez intégrer Nutch avec Solr, comme décrit dans cette article : Using Nutch with Solr. Afin d’obtenir un moteur de recherche professionnel pour une entreprise. (principalement [...]
January 11, 2010 14:23 — Christophe Nowicki » Faire son propre moteur de recherche avec Nutch
tnx for the nice and clear howto. Any idea on how to use instead Nutch interface with the Solr capabilities? I mean, fetch by Nutch, search by Nutch interface (very nice and easy) and engine provided by Solr?
thanks
January 21, 2010 05:34 — Davide
Installed solr and nutch, and executed this tutorial. Had an error during the execution of step 9.:
plugin.PluginRepository – Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2010-02-25 17:50:49,346 WARN regex.RegexURLNormalizer – can’t find rules for scope ‘inject’, using default
2010-02-25 17:50:51,649 WARN mapred.LocalJobRunner – job_local_0001
java.lang.NumberFormatException: For input string: “free”
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Long.parseLong(Long.java:410)
at java.lang.Long.parseLong(Long.java:468)
at org.apache.hadoop.fs.DF.parseExecResult(DF.java:122)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:179)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:930)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:842)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
2010-02-25 17:50:52,471 FATAL crawl.Injector – Injector: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at org.apache.nutch.crawl.Injector.inject(Injector.java:160)
at org.apache.nutch.crawl.Injector.run(Injector.java:190)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Injector.main(Injector.java:180)
Any idea what this relates to?
February 25, 2010 09:54 — José Grilo
I installed solr and was able to access the admin screen. When I install nutch and follow the command in 5d to paste the fragment in solrconfig.xml, solr’s admin fails.
The error I get is:
HTTP ERROR: 404
missing core name in path
RequestURI=/solr/admin/index.jsp
Why would adding the snippet cause that?
February 25, 2010 13:18 — Ian
Hi,
I tried to submit nutch craw via both crawl -solr option and solrindex command. All the attempt failed and solr log said :
SolrException: ERROR: multiple values encountered for non multiValued copy field id: http://www.x.xx.x.
I couldnt find the problem. I exactly copied schema.xml from nutch 1.1 config dir to the solr conf dir.
bin/nutch crawl urls -dir datalist -threads 200 -depth 2 -solr http://127.0.0.1:8983/solr
thanks
March 23, 2010 02:01 — LoRdxx
Found a fix for the problems with id field:
remove the line
It appears that the url is used as an id by default.
March 26, 2010 16:37 — Jem
Jem, thanks for the last comment (really really useful)
April 7, 2010 08:44 — elaragon
Hi,
I have integrated nutch with solr. I am using carot2 to view the results. I am able to view title,content and url fields in the output. I need not need content feild to be seen instead I required a field which similar to the description field(only one or two lines to display) or do there is any other configuration required to achieve this.
Thanks in advance.
April 16, 2010 04:10 — sharmy
Great intro to nutch and solr. Only had to fix the one (bool) line in solrconfig.xml that clausi noted, and update JDK to 1.6 as artemgy did.(But I didn’t make the 3 changes Jay Hill recommended.) Had the same problem as AlexMC and Zaihan. Everything ran, but nothing was committed to solr. It was caused buy missing a step the first time through. Deleted everything under crawl/crawldb, crawl/linkdb, and crawl/segments and went back to step 10. Working nicely now.
Thanks Sami, and all those who posted their experiences with this.
April 22, 2010 14:49 — ck
The article is really nice. Thanks Sami.
I’ve a huge amount of data already indexed by Nutch. I want the faceting feature to be implemented. For that I decided to install SOLR. I’m going to integrate SOLR with Nutch as per this article. But still dont know how do faceting on data already indexed by Nutch. Can someone help me in this regard. Sami, can you please help me in this regard. Thanks everyone for your time.
April 26, 2010 07:00 — Avi
The line Jem tried to post that was stripped out is the copyField source=”url” dest=”id” tag in the schema.xml file in solr. I had the same “multiple values encountered for non multiValued copy field id” and removing that fixed it.
April 26, 2010 12:20 — Patrick
I followed the instructions and everything worked well.
But I got only one Index in Solr. Where can i define the -depth and -topN?
Is it right here:
bin/nutch generate crawl/crawldb crawl/segments -depth 3 -topN 50 ??
Please help me.
May 18, 2010 05:46 — Olli
This is cool – got it all working with Solr, along with the DataImport stuff to integrate RDBMS too.
Solid article.
May 25, 2010 04:30 — Steven Livingstone
hi,
what about sharding? how to use nutch solrindex with several sharded solr instances?
May 27, 2010 06:44 — denis lihachev
I followed the instructions and everything worked well.
But I got only one Index in Solr. Where can I define the -depth and -topN?
Please Reply.
Thanx
July 12, 2010 11:44 — abhi
Thanks for a great howto, Samir!
I have created a howto for Ubuntu 10.04 Lucid Lynx based on your work, that is available here:
http://ubuntuforums.org/showthread.php?p=9596257
cheers!
Gustavo
July 16, 2010 01:06 — Gustavo Zaera
I have a question if you don’t mind. I’m trying to install Nutch running on Hadoop (on 5 servers) in a way that it produces multiple indices. I want to have distributed search on Solr and I want Nutch to produce 16 different indices and then push them to master Solr. Can you tell me how I can do that?
Thanks a lot!
Sara
August 2, 2010 08:56 — Sara
[...] match ubuntu 10.04 (Lucid). It is based on the work of Sami Siren at Lucid Imagination available on http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ [...]
August 16, 2010 10:13 — The Solr+Nutch on Ubuntu Server HOWTO on… « Information Processing
Sami, In your tutorial – how and where do I specify depth of crawl. Thanks for your help.
August 20, 2010 13:10 — Sai Thumuluri
I am getting stuck up with stage 9 onwards. What directory should I be in when I am trying to enter the following command
“bin/nutch inject crawl/crawldb urls”
I am ware that this is probably quite a simple problem to overcome so apologies in advance. An identical probelm was posted here January 5, 2010 16:18 — Ken
Also, has anyone got a suggested answer to April 16, 2010 04:10 — sharmy, as this would provide excellent help for my current project.
Thank you
Lewis
August 25, 2010 07:00 — Lewis Mc