• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
    • Certified Distributions
      • Certified Solr
      • Certified Lucene
    • Apache Releases
      • Apache Solr
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Training
    • Solr/Lucene Certification
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Technical Leadership
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . Blog

March 9, 2009

Using Nutch with Solr

Posted by Sami Siren

There is an updated version about Nutch Solr integration available at http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/.
The last time I wrote about integrating Apache Nutch with Apache Solr (about two years ago), it was quite difficult to integrate the two components – you had to apply patches, hunt down required components from various places etc. Now there is easier way.The soon to be released Nutch 1.0 contains Solr integration “out of the box”. There are many different ways to take advantage of this new feature, but I am just going to go through one of them here. In this solution, Solr will be used as the only source for serving search results (including snippets). This way you can totally decouple your search application from Nutch and still use Nutch where it is at its best: crawling and extracting the content. Using Solr as the search backend, on the other hand, allows you to use all of the advanced features of a Solr server – like query spell checking, “more like this” suggestions, data replication and easy query time relevancy tuning, to mention just a few.     You might also be interested in:

  • Crawling in Open Source
  • Mastering Solr 1.4 – webinar
  • Solr 1.4 Download
  • Solr 1.4 Reference Guide

solr-nutch-setup


Why Nutch instead of a simpler Fetcher?

One possible way to implement something similar to what I present here would be to use a simpler crawler framework such as Apache Droids. But using Nutch gives you some pretty nice advantages. One of these is obviously the fact that Nutch provides a complete set of features you commonly need for a generic web search application. Another benefit of using Nutch is that it is a highly scalable and relatively feature rich crawler (this does not mean that you cannot do the same with some other framework). Nutch offers features like politeness (obeys robots.txt rules), robustness and scalability (Nutch runs on hadoop, so you can run Nutch on a single machine or on a cluster of 100 machines), quality (you can bias the crawling to fetch “important” pages first) and extendability (there are many apis you can plug in your functionality. One of the most important single feature is Nutch provides out of the box is, in my subjective opinion, a Linkdatabase. You might already know that Nutch tracks links between pages so that the relevancy of search results within a collection of interlinked documents goes well beyond the naive case where you index documents without link information and anchor texts.

Setup

The first step to get started is to download the required software components, namely Apache Solr and Nutch.

1. Download Solr version 1.3.0 or LucidWorks for Solr from Download page

2. Extract Solr package

3. Download Nutch version 1.0 or later (Alternatively download the the nightly version of Nutch that contains the required functionality)

4. Extract the Nutch package

tar xzf apache-nutch-1.0.tar.gz

5. Configure Solr

For the sake of simplicity we are going to use the example
configuration of Solr as a base.

a. Copy the provided Nutch schema from directory
apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file)

We want to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it:

b. Change schema.xml so that the stored attribute of field “content” is true.

<field name=”content” type=”text” stored=”true” indexed=”true”/>

We want to be able to tweak the relevancy of queries easily so we’ll create new dismax request handler configuration for our use case:

d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it

<requestHandler name="/nutch" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">
content^0.5 anchor^1.0 title^1.2
</str>
<str name="pf">
content^0.5 anchor^1.5 title^1.2 site^1.5
</str>
<str name="fl">
url
</str>
<str name="mm">
2&lt;-1 5&lt;-2 6&lt;90%
</str>
<int name="ps">100</int>
<bool hl="true"/>
<str name="q.alt">*:*</str>
<str name="hl.fl">title url content</str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.url.hl.fragsize">0</str>
<str name="f.url.hl.alternateField">url</str>
<str name="f.content.hl.fragmenter">regex</str>
</lst>
</requestHandler>

6. Start Solr

cd apache-solr-1.3.0/example
java -jar start.jar

7. Configure Nutch

a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s contents with the following (we specify our crawler name, active plugins and limit maximum url count for single host per run to be 100) :

<?xml version="1.0"?>
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch-solr-integration</value>
</property>
<property>
<name>generate.max.per.host</name>
<value>100</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
</configuration>

b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,
replace it’s content with following:

-^(https|telnet|file|ftp|mailto):
 
# skip some suffixes
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
 
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
 
# allow urls in foofactory.fi domain
+^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/
 
# deny anything else
-.

8. Create a seed list (the initial urls to fetch)

mkdir urls
echo "http://www.lucidimagination.com/" > urls/seed.txt

9. Inject seed url(s) to nutch crawldb (execute in nutch directory)

bin/nutch inject crawl/crawldb urls

10. Generate fetch list, fetch and parse content

bin/nutch generate crawl/crawldb crawl/segments

The above command will generate a new segment directory under crawl/segments that at this point contains files that store the url(s) to be fetched. In the following commands we need the latest segment dir as parameter so we’ll store it in an environment variable:

export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`

Now I launch the fetcher that actually goes to get the content:

bin/nutch fetch $SEGMENT -noParsing

Next I parse the content:

bin/nutch parse $SEGMENT

Then I update the Nutch crawldb. The updatedb command wil store all new urls discovered during the fetch and parse of the previous segment into Nutch database so they can be fetched later. Nutch also stores information about the pages that were fetched so the same urls won’t be fetched again and again.

bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize

Now a full Fetch cycle is completed. Next you can repeat step 10 couple of more times to get some more content.

11. Create linkdb

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

12. Finally index all content from all segments to Solr

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

Now the indexed content is available through Solr. You can try to execute searches from the Solr admin ui from

http://127.0.0.1:8983/solr/admin

, or directly with url like

http://127.0.0.1:8983/solr/nutch/?q=solr&amp;version=2.2&amp;start=0&amp;rows=10&amp;indent=on&amp;wt=json

Conclusion

Nutch in combination with Solr is quite a powerful base on which to build your search application. Even if the base is solid there are a few things missing from the stack that you will soon be aware of if you start to index content on larger scale. One of the missing features is duplicate content removal, but luckily there is an improvement issue for this in Nutch Jira https://issues.apache.org/jira/browse/NUTCH-684. Another missing piece from Solr side is a feature called field collapsing
(https://issues.apache.org/jira/browse/SOLR-236). The field collapsing feature could be used on when displaying results so that for example at most two pages would be shown for a single host.

The setup explained here has one significant caveat you also need to keep in mind: scale. You cannot use this kind of setup with vertical scale (collection size) that goes beyond one Solr box. The horizontal scaling (query throughput) is still possible with the standard Solr replication tools.


You might also be interested in:

  • Crawling in Open Source
  • Mastering Solr 1.4 – webinar
  • Solr 1.4 Download
  • Solr 1.4 Reference Guide
  • Share this:
  • Email
  • Facebook
  • Digg
  • Share
  • Print
  • Reddit
  • StumbleUpon

Category: Open Source, Solr, Uncategorized

71 Responses to “Using Nutch with Solr”

  1. I am new to Nutch so, pardon me if this a novice question. What does this sentence mean?

    “The setup explained here has one significant caveat you also need to keep in mind: scale. You cannot use this kind of setup with vertical scale (collection size) that goes beyond one Solr box. The horizontal scaling (query throughput) is still possible with the standard Solr replication tools.”

    March 9, 2009 12:13 — Taruvai

  2. What I was trying explain there is that the Nutch Solr indexer can currently only be used with a single Solr instance. That is why you cannot easily handle “unlimited” amount of docs with the setup presented here. Depending on your requirements and hardware you can still fit up to 10-20 million, perhaps more, docs in one Solr instance. If you need to go beyond that you can, for example, enhance the Nutch Solr Indexer so it can use multiple Solr instances with help of consistent hashing or something similar. I am sure that this problem will be tackled rather soon either in Solr or Nutch land.

    March 9, 2009 22:20 — Sami Siren

  3. [...] Lucid Imagination » Using Nutch with Solr The soon to be released Nutch 1.0 contains Solr integration “out of the box”. There are many different ways to take advantage of this new feature, but I am just going to go through one of them here. In this solution, Solr will be used as the only source for serving search results (including snippets). This way you can totally decouple your search application from Nutch and still use Nutch where it is at its best: crawling and extracting the content. Using Solr as the search backend, on the other hand, allows you to use all of the advanced features of a Solr server – like query spell checking, “more like this” suggestions, data replication and easy query time relevancy tuning, to mention just a few. (tags: todo opensource searchengine rend solr nutch) [...]

    March 10, 2009 14:02 — Webhamer Weblog: Search & ICT-related blogging » links for 2009-03-10

  4. [...] Lucid Imagination » Using Nutch with Solr The soon to be released Nutch 1.0 contains Solr integration “out of the box”. There are many different ways to take advantage of this new feature, but I am just going to go through one of them here. In this solution, Solr will be used as the only source for serving search results (including snippets). This way you can totally decouple your search application from Nutch and still use Nutch where it is at its best: crawling and extracting the content. Using Solr as the search backend, on the other hand, allows you to use all of the advanced features of a Solr server – like query spell checking, “more like this” suggestions, data replication and easy query time relevancy tuning, to mention just a few. (tags: todo opensource searchengine rend solr nutch) [...]

    March 10, 2009 15:04 — StauthamerNet :: Staut’s Family Blog» Blog Archive » links for 2009-03-10

  5. Nice description.. working on first try…

    March 28, 2009 22:00 — VIv

  6. Hi, my drupal site uses apache-solr engine to index and query for data. It works great. I just installed nutch 1.0 to take advantage of nutch’s ability to crawl specific external sites for appropriate data and have these data indexed by the same apache-solr engine. I followed the instruction presented here and got my nutch 1.0 to work. Using the provided schema.xml that comes with nutch my site can query and display data coming from nutch crawling. Unfortunately the other data set is now no longer available for query. It is only available when I revert the schema.xml to the original one. Is it possible to combine the two indexes so they can be conveniently available to same apache-solr query. Would you please kindly point me in the right direction. Much thanks.

    April 9, 2009 12:23 — Kham

  7. Got it to work. Just a matter of careful merging of the two schema.xml files.

    April 9, 2009 17:32 — Kham

  8. You guys are great ! First shot right.

    April 13, 2009 23:57 — Ray

  9. Is there a way to get the full markup as opposed to the parsed text as content or another field?

    April 18, 2009 07:21 — Ben

  10. Hi, I wonder if there is a command that ask Nutch to return Documents that have been changed in the indexe and show changes.
    Thanks for helping.

    April 27, 2009 07:45 — aida

  11. Hello,

    How additional index can be added manually to nutch indexes, using solr?

    May 12, 2009 14:06 — Alex

  12. Hi, we followed step by step instructions above. When we went to view the results, we have noticed that the content is being translated into “Chinese” or some Asian language. We see results ONLY if we query in Asian language. what are we doing wrong?

    May 20, 2009 13:52 — Sai Thumuluri

  13. Sai, can you elaborate your problem a bit, how is the content translated? Do you have an example URL that demonstrates the problem? Thanks.

    May 28, 2009 23:38 — Sami Siren

  14. Thanks for the nice description. I have certain queries.

    Once I crawl the websites using nutch I want to do the following:
    1. Extract entities like people, places, and other master information. Is it possible in nutch?
    2. I need the entire DOM of the page that is parsed. Is it possible in nutch? Or do I have to use someother mechanism.

    The idea is: I have a project requirement where I have to crawl very specific sites, ‘extract specific values from the html DOM’ and index it. Next I want to search these values using solr.
    Will Nutch+Solr suffice, OR I have to use somthing like Nutch+Solr+(Aperture/Tika).

    Thanks for helping

    June 4, 2009 22:22 — Jagdish Yadav

  15. If I follow these instructions I can get everything to work swimmingly, however I need to add some ‘custom’ static fields so I was hoping to employ index-extra plug in. I put the plugin in place, crawl with seemingly no problem, but when I try to commit to solr, I get nothing added to the index, and no error. Has anyone else tried to use index-extra along with the nutch/solr integration?

    June 11, 2009 09:06 — Edward

  16. I think our problem is related to having our content in charset UTF-16, can you tell us how we can change charset to UTF-8 during index/crawl?

    June 15, 2009 14:45 — Sai Thumuluri

  17. I think I have all the nutch stuff working in that the data is searchable through the nutch webapp but I cant get any results through Solr

    bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

    This makes the computer work heavily for an hour or so but I cant see where the solr index goes.

    http://127.0.0.1:8983/solr/admin gives me a web interface but I cant get it to return any results.

    Any idea what I should be looking at?

    Thanks

    July 7, 2009 10:17 — AlexMc

  18. I’m having the same problem as AlexMC too.

    July 13, 2009 11:24 — Zaihan

  19. Hi,

    I need some help here actually in our project I need to pass XML file as input to the NUTCH instead of .txt files as given in the example.When I tried to pass XML input file I am getting the following exception:-

    WARN crawl.Injector – Skipping :java.net.MalformedURLException: no protocol:
    2009-09-15 18:21:36,556 INFO crawl.Injector – Injector: Merging injected urls into crawl db.
    2009-09-15 18:21:38,321 WARN util.NativeCodeLoader – Unable to load native-hadoop library for your platform… using builtin-java classes where applicable

    Please help me in resolving the same and pls let me know how can I pass .xml file as input.

    Thanks

    September 15, 2009 05:34 — Ravikiran A

  20. Hello,
    I am new to Nutch too, excuse my novice question. What does this sentence mean?
    “Now a full Fetch cycle is completed. Next you can repeat step 10 couple of more times to get some more content”.
    is there any possibility to have these steps done automatically?

    September 21, 2009 06:37 — LPillet

  21. How can we use nutch and solr for multiple sites?
    What are the steps I need to take care in configuring Nutch and Solr for multiple sites

    November 4, 2009 12:25 — ACG

  22. adding the fragment mentioned above to apache-solr-1.3.0/example/solr/conf/solrconfig.xml
    makes solr fail. I get following exception and web site throws error. Please note solr works with original solrconfig.xml. The problem starts as soon as I add the fragement.

    sorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.mortbay.start.Main.invokeMain(Main.java:183)
    at org.mortbay.start.Main.start(Main.java:497)
    at org.mortbay.start.Main.main(Main.java:115)
    Nov 16, 2009 10:02:11 PM org.apache.solr.common.SolrException log
    SEVERE: org.apache.solr.common.SolrException: invalid boolean value:
    at org.apache.solr.common.util.StrUtils.parseBool(StrUtils.java:237)
    at org.apache.solr.common.util.DOMUtil.addToNamedList(DOMUtil.java:140
    at org.apache.solr.common.util.DOMUtil.nodesToNamedList(DOMUtil.java:9

    at org.apache.solr.common.util.DOMUtil.childNodesToNamedList(DOMUtil.j
    a:88)
    at org.apache.solr.common.util.DOMUtil.addToNamedList(DOMUtil.java:142
    at org.apache.solr.common.util.DOMUtil.nodesToNamedList(DOMUtil.java:9

    at org.apache.solr.common.util.DOMUtil.childNodesToNamedList(DOMUtil.j
    a:88)
    at org.apache.solr.core.PluginInfo.(PluginInfo.java:54)
    at org.apache.solr.core.SolrConfig.readPluginInfos(SolrConfig.java:220
    at org.apache.solr.core.SolrConfig.loadPluginInfo(SolrConfig.java:212)
    at org.apache.solr.core.SolrConfig.(SolrConfig.java:184)
    at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreConta
    er.java:134)
    at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.
    va:83)
    at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.jav
    40)
    at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.
    va:594)
    at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
    at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.j
    a:1218)
    at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.jav
    500)
    at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:4
    )
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.jav
    40)
    at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollecti
    .java:147)
    at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextH
    dlerCollection.java:161)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.jav
    40)
    at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollecti
    .java:147)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.jav
    40)
    at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.jav
    117)
    at org.mortbay.jetty.Server.doStart(Server.java:210)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.jav
    40)
    at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImp
    java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc
    sorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.mortbay.start.Main.invokeMain(Main.java:183)
    at org.mortbay.start.Main.start(Main.java:497)
    at org.mortbay.start.Main.main(Main.java:115)

    November 16, 2009 22:05 — Neha

  23. Any ideas what might be wrong? Any help would be highly appreciated.

    November 16, 2009 22:05 — Neha

  24. Hi,

    I had the same Problem,
    just change the xml syntax line as follows:

    true

    this should work

    all the best
    clausi

    November 18, 2009 10:35 — clausi

  25. sorry there is something missing
    this line:

    <bool name=”hl”>true</bool>

    November 18, 2009 10:36 — clausi

  26. in solrconfig.xml, tag:

    should be

    true

    November 20, 2009 08:13 — igor vrdoljak

  27. NOTE: In the request handler config section there are three settings which need to be edited:
    - Where it says “int” should be changed to “str”.
    - Where it says “float” should be changed to “str”.
    - Where it says “boolean” should be changed to “str”.

    With those changes the “/nutch” request handler will be configured correctly and Solr will start as expected.

    November 25, 2009 22:09 — Jay Hill

  28. Thanks Jay Hill, my these instructions broke my Solr configuration until I made the changes you mentioned.

    I was also having problems getting Nutch 1.0 to run on top of Xampp Tomcat for Windows and I discovered I needed to use a more recent version of Java.

    The Xampp Tomcat distrib had JRE 1.5 rev 20, and kept failing to run Nutch, logging “java.lang.UnsupportedClassVersionError: Bad version number in .class file “. I worked around this by installing a newer Java version (JRE 1.6 rev 16) and pointing my JRE_HOME to the new location.

    Thx for a really insightful article :-)

    November 28, 2009 17:36 — artemgy

  29. Very useful and straightforward guide. Thank you.

    Just one question: does the solrindex command have options or configuration parameters that I can use to, say, index only pages containing/not_containing certain words and discard all the rest?

    G

    December 15, 2009 03:40 — Giona

  30. How can I make this multi crawler and avoid duplicate content using multi crawler ??

    December 29, 2009 16:43 — jason

  31. How can I rank the content after getting the content?

    December 29, 2009 16:47 — jason

  32. Do I use “/nutch” or the path to my nutch directory (e.g. /opt/nutch-1.0)?

    January 5, 2010 16:18 — Ken

  33. I executed ‘java -jar start.jar’, and got this at the very end:

    Started SocketConnector @ 0.0.0.0:8983

    I’m about to get solr to load when I url http://localhost:8983/solr/admin

    Does it mean solr is working and running, where I open a new terminal to do others things and not press “Ctlr+C”.

    January 5, 2010 17:32 — Ken

  34. Hi
    Can we use this setup to crawl multiple domains and return results for specific domains depending on which user it is ?

    Thanks

    January 6, 2010 17:28 — shantanu

  35. [...] de recherche libre vous pouvez intégrer Nutch avec Solr, comme décrit dans cette article : Using Nutch with Solr. Afin d’obtenir un moteur de recherche professionnel pour une entreprise. (principalement [...]

    January 11, 2010 14:23 — Christophe Nowicki » Faire son propre moteur de recherche avec Nutch

  36. tnx for the nice and clear howto. Any idea on how to use instead Nutch interface with the Solr capabilities? I mean, fetch by Nutch, search by Nutch interface (very nice and easy) and engine provided by Solr?

    thanks

    January 21, 2010 05:34 — Davide

  37. Installed solr and nutch, and executed this tutorial. Had an error during the execution of step 9.:

    plugin.PluginRepository – Ontology Model Loader (org.apache.nutch.ontology.Ontology)
    2010-02-25 17:50:49,346 WARN regex.RegexURLNormalizer – can’t find rules for scope ‘inject’, using default
    2010-02-25 17:50:51,649 WARN mapred.LocalJobRunner – job_local_0001
    java.lang.NumberFormatException: For input string: “free”
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong(Long.java:410)
    at java.lang.Long.parseLong(Long.java:468)
    at org.apache.hadoop.fs.DF.parseExecResult(DF.java:122)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:179)
    at org.apache.hadoop.util.Shell.run(Shell.java:134)
    at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
    at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:930)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:842)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
    2010-02-25 17:50:52,471 FATAL crawl.Injector – Injector: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:160)
    at org.apache.nutch.crawl.Injector.run(Injector.java:190)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Injector.main(Injector.java:180)

    Any idea what this relates to?

    February 25, 2010 09:54 — José Grilo

  38. I installed solr and was able to access the admin screen. When I install nutch and follow the command in 5d to paste the fragment in solrconfig.xml, solr’s admin fails.

    The error I get is:

    HTTP ERROR: 404

    missing core name in path

    RequestURI=/solr/admin/index.jsp

    Why would adding the snippet cause that?

    February 25, 2010 13:18 — Ian

  39. Hi,
    I tried to submit nutch craw via both crawl -solr option and solrindex command. All the attempt failed and solr log said :

    SolrException: ERROR: multiple values encountered for non multiValued copy field id: http://www.x.xx.x.

    I couldnt find the problem. I exactly copied schema.xml from nutch 1.1 config dir to the solr conf dir.

    bin/nutch crawl urls -dir datalist -threads 200 -depth 2 -solr http://127.0.0.1:8983/solr

    thanks

    March 23, 2010 02:01 — LoRdxx

  40. Found a fix for the problems with id field:

    remove the line

    It appears that the url is used as an id by default.

    March 26, 2010 16:37 — Jem

  41. Jem, thanks for the last comment (really really useful)

    April 7, 2010 08:44 — elaragon

  42. Hi,

    I have integrated nutch with solr. I am using carot2 to view the results. I am able to view title,content and url fields in the output. I need not need content feild to be seen instead I required a field which similar to the description field(only one or two lines to display) or do there is any other configuration required to achieve this.

    Thanks in advance.

    April 16, 2010 04:10 — sharmy

  43. Great intro to nutch and solr. Only had to fix the one (bool) line in solrconfig.xml that clausi noted, and update JDK to 1.6 as artemgy did.(But I didn’t make the 3 changes Jay Hill recommended.) Had the same problem as AlexMC and Zaihan. Everything ran, but nothing was committed to solr. It was caused buy missing a step the first time through. Deleted everything under crawl/crawldb, crawl/linkdb, and crawl/segments and went back to step 10. Working nicely now.

    Thanks Sami, and all those who posted their experiences with this.

    April 22, 2010 14:49 — ck

  44. The article is really nice. Thanks Sami.
    I’ve a huge amount of data already indexed by Nutch. I want the faceting feature to be implemented. For that I decided to install SOLR. I’m going to integrate SOLR with Nutch as per this article. But still dont know how do faceting on data already indexed by Nutch. Can someone help me in this regard. Sami, can you please help me in this regard. Thanks everyone for your time.

    April 26, 2010 07:00 — Avi

  45. The line Jem tried to post that was stripped out is the copyField source=”url” dest=”id” tag in the schema.xml file in solr. I had the same “multiple values encountered for non multiValued copy field id” and removing that fixed it.

    April 26, 2010 12:20 — Patrick

  46. I followed the instructions and everything worked well.
    But I got only one Index in Solr. Where can i define the -depth and -topN?
    Is it right here:

    bin/nutch generate crawl/crawldb crawl/segments -depth 3 -topN 50 ??
    Please help me.

    May 18, 2010 05:46 — Olli

  47. This is cool – got it all working with Solr, along with the DataImport stuff to integrate RDBMS too.

    Solid article.

    May 25, 2010 04:30 — Steven Livingstone

  48. hi,

    what about sharding? how to use nutch solrindex with several sharded solr instances?

    May 27, 2010 06:44 — denis lihachev

  49. I followed the instructions and everything worked well.
    But I got only one Index in Solr. Where can I define the -depth and -topN?

    Please Reply.

    Thanx

    July 12, 2010 11:44 — abhi

  50. Thanks for a great howto, Samir!

    I have created a howto for Ubuntu 10.04 Lucid Lynx based on your work, that is available here:

    http://ubuntuforums.org/showthread.php?p=9596257

    cheers!
    Gustavo

    July 16, 2010 01:06 — Gustavo Zaera

  51. I have a question if you don’t mind. I’m trying to install Nutch running on Hadoop (on 5 servers) in a way that it produces multiple indices. I want to have distributed search on Solr and I want Nutch to produce 16 different indices and then push them to master Solr. Can you tell me how I can do that?
    Thanks a lot!
    Sara

    August 2, 2010 08:56 — Sara

  52. [...] match ubuntu 10.04 (Lucid). It is based on the work of Sami Siren at Lucid Imagination available on http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ [...]

    August 16, 2010 10:13 — The Solr+Nutch on Ubuntu Server HOWTO on… « Information Processing

  53. Sami, In your tutorial – how and where do I specify depth of crawl. Thanks for your help.

    August 20, 2010 13:10 — Sai Thumuluri

  54. I am getting stuck up with stage 9 onwards. What directory should I be in when I am trying to enter the following command

    “bin/nutch inject crawl/crawldb urls”

    I am ware that this is probably quite a simple problem to overcome so apologies in advance. An identical probelm was posted here January 5, 2010 16:18 — Ken

    Also, has anyone got a suggested answer to April 16, 2010 04:10 — sharmy, as this would provide excellent help for my current project.

    Thank you

    Lewis

    August 25, 2010 07:00 — Lewis Mc

  55. I am able to install nutch 1.0 on windows successfully. But if anyone suggests answer to April 16, 2010 04:10 — sharmy question, it would be grateful.

    Thanks,
    Suni

    September 9, 2010 01:33 — suni

  56. @July 12, 2010 11:44 — abhi

    Even I got only one entry to solr first time. They mentioned to repeat step 10 for more results. So, the next time I run from 10 I got more values. But as I keep repeating the step I’m getting more and more values.

    My Question : So, how many times should I keep running 10th step?( Until I see a constant fetchQueue value?). Is there any way to do it once to get the maximum values for the first time? is there any way to define the depth in this way( other than single crawl command ).

    Thanks for a great article and looking forward for some help in this issue.
    Ram.

    September 21, 2010 10:45 — Ram

  57. I did a refresh of this article at http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/. That might help.

    September 22, 2010 05:19 — Grant Ingersoll

  58. [...] Check out the original for detail Comments [0]Digg it!FacebookTwitterEdit Post [...]

    October 7, 2010 23:33 — Use Nutch with Solr - Nutch Tutorial

  59. [...] If you’re willing to upgrade to nutch 1.0 you can use the solrindex as described in this article by Lucid Imagination: http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/. [...]

    October 8, 2010 00:05 — Using Nutch crawler with Solr - Nutch Tutorial

  60. [...] This HOWTO is updated to match ubuntu 10.04 based on the work of Sami Siren at Lucid Imagination available here: http://www.lucidimagination.com/blog…09/nutch-solr/ [...]

    October 8, 2010 17:49 — Solr+Nutch on Ubuntu Server 10.04 - Nutch Tutorial

  61. [...]  Iam new to using the nutch with solr.I followed the link  http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/  for integration.Iam getting an error while fetching the url from [...]

    October 10, 2010 17:56 — Nutch Fetching the url from crawldb - Nutch Tutorial

  62. [...] http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ [...]

    October 10, 2010 21:54 — nutch solrindex with a pseudo-cluster - Nutch Tutorial

  63. I am getting bellow exception

    Exception in thread “main” org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/icar/Installed/test/nutch-1.2/urls/local.txt
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)

    November 10, 2010 23:20 — Amit

  64. Hello
    Very good post. Thanks . Nutch has distributed search feature and Solr has spell checking and autoplete feature . Any way to use both feature?

    By integrating Nutch with Solr, can i still get distributed search feature ? and I also want auto-complete.

    January 31, 2011 14:40 — enrique

  65. Hi, please guide me through my novice questions.
    1) In the solrconfig.xml file the fragment that is supposed to be pasted, after pasting solr doesn’t start, can u pls tell me where to paste it?

    2) Is tomcat mandatory to run nutch and if so how do we configure tomat with respect to nutch?

    3) from step 9 we use bin/nutch commands in nutch directory, can u pls explain me how do we get nutch directory started?

    Please respond ASAP

    February 15, 2011 00:53 — Rahul

  66. This is very good article.
    i have install nutch+solr. Both are working fine. i have small application using solr api for executting solr response but some field are blank or null?

    February 28, 2011 20:30 — Amit Kumar Gupta

  67. it worked………………….!

    March 22, 2011 04:38 — vijay

  68. [...] http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ [...]

    March 25, 2011 13:01 — Building a Search Engine with Nutch and Solr in 10 minutes | Building Blocks Knowledge Share

  69. I can only get nutch crawling lucidimagination.com, when I try other urls I get the error No URLS to fetch – check your seed list and URL filters.

    May 4, 2011 03:20 — Matthew

  70. [...] But there seem to be decent amount of options, which I did not try out, yet. Some of them – a) Using Nutch with Solr – a LucidImagination article (I did play around with Nutch a while ago, but I did not know [...]

    August 31, 2011 05:11 — On Solr « sowmyawrites ….

  71. Thanks for a great article. It is very useful for a novice like me .
    I created a sample JSP page like Google showing Title , Content and URL , via SolrJ. The content field is too large to show on result page. I just like to show a small fragment of content with highlighted matches. How can i do that please?

    Thanks

    December 18, 2011 06:32 — yewint

Leave a Reply

Go to Blog Front Page

  • Recent Posts

    • Lucene Revolution 2012 – Call for Participation now open!
    • SolrCloud is Coming (and looking to mix in even more ‘NoSQL’)
    • Our Solr Reference Guide updated for v3.5
    • Enhancing Discovery with Solr and Mahout – session slides now available!
    • Solr and LucidWorks feature matrix available
    • LucidWorks Enterprise latest version 2.0.1 released!
    • Why Not AND, OR, And NOT?
    • Options to tune document’s relevance in Solr
    • Dallas JavaMUG December 14th 2011
    • Apache Mahout user meeting – session slides and videos are now available!
  • Archives

    • January 2012
    • December 2011
    • November 2011
    • October 2011
    • September 2011
    • August 2011
  • Tags

    acts_as_solr apache Apache Mahout best practices chump code4lib dismax drupal enterprise search Erik Hatcher field collapsing function query Grant Ingersoll hoss image isfdb local params Lucene lucene revolution LucidGaze lucid imagination Mahout Marc Krellenstein Mark Miller nested queries nutch Open Source Open Source Search qparser query parser queryparser Rails release result grouping Richmond Ruby schema design sint Solr solr 3.1 solr 4.0 solr cloud sortable Tika VA
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Log in

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2011 Lucid Imagination. All Right reserved.

loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.