Using Nutch with Solr

The last time I wrote about integrating Apache Nutch with Apache Solr (about two years ago), it was quite difficult to integrate the two components – you had to apply patches, hunt down required components from various places etc. Now there is easier way.The soon to be released Nutch 1.0 contains Solr integration “out of the box”. There are many different ways to take advantage of this new feature, but I am just going to go through one of them here. In this solution, Solr will be used as the only source for serving search results (including snippets). This way you can totally decouple your search application from Nutch and still use Nutch where it is at its best: crawling and extracting the content. Using Solr as the search backend, on the other hand, allows you to use all of the advanced features of a Solr server – like query spell checking, “more like this” suggestions, data replication and easy query time relevancy tuning, to mention just a few.     You might also be interested in:

solr-nutch-setup


Why Nutch instead of a simpler Fetcher?

One possible way to implement something similar to what I present here would be to use a simpler crawler framework such as Apache Droids. But using Nutch gives you some pretty nice advantages. One of these is obviously the fact that Nutch provides a complete set of features you commonly need for a generic web search application. Another benefit of using Nutch is that it is a highly scalable and relatively feature rich crawler (this does not mean that you cannot do the same with some other framework). Nutch offers features like politeness (obeys robots.txt rules), robustness and scalability (Nutch runs on hadoop, so you can run Nutch on a single machine or on a cluster of 100 machines), quality (you can bias the crawling to fetch “important” pages first) and extendability (there are many apis you can plug in your functionality. One of the most important single feature is Nutch provides out of the box is, in my subjective opinion, a Linkdatabase. You might already know that Nutch tracks links between pages so that the relevancy of search results within a collection of interlinked documents goes well beyond the naive case where you index documents without link information and anchor texts.

Setup

The first step to get started is to download the required software components, namely Apache Solr and Nutch.

1. Download Solr version 1.3.0 or LucidWorks for Solr from Download page

2. Extract Solr package

3. Download Nutch version 1.0 or later (Alternatively download the the nightly version of Nutch that contains the required functionality)

4. Extract the Nutch package

tar xzf apache-nutch-1.0.tar.gz

5. Configure Solr

For the sake of simplicity we are going to use the example
configuration of Solr as a base.

a. Copy the provided Nutch schema from directory
apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file)

We want to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it:

b. Change schema.xml so that the stored attribute of field “content” is true.

<field name=”content” type=”text” stored=”true” indexed=”true”/>

We want to be able to tweak the relevancy of queries easily so we’ll create new dismax request handler configuration for our use case:

d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it

<requestHandler name="/nutch" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">
content^0.5 anchor^1.0 title^1.2
</str>
<str name="pf">
content^0.5 anchor^1.5 title^1.2 site^1.5
</str>
<str name="fl">
url
</str>
<str name="mm">
2&lt;-1 5&lt;-2 6&lt;90%
</str>
<int name="ps">100</int>
<bool hl="true"/>
<str name="q.alt">*:*</str>
<str name="hl.fl">title url content</str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.url.hl.fragsize">0</str>
<str name="f.url.hl.alternateField">url</str>
<str name="f.content.hl.fragmenter">regex</str>
</lst>
</requestHandler>

6. Start Solr

cd apache-solr-1.3.0/example
java -jar start.jar

7. Configure Nutch

a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s contents with the following (we specify our crawler name, active plugins and limit maximum url count for single host per run to be 100) :

<?xml version="1.0"?>
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch-solr-integration</value>
</property>
<property>
<name>generate.max.per.host</name>
<value>100</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
</configuration>

b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,
replace it’s content with following:

-^(https|telnet|file|ftp|mailto):
 
# skip some suffixes
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
 
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
 
# allow urls in foofactory.fi domain
+^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/
 
# deny anything else
-.

8. Create a seed list (the initial urls to fetch)

mkdir urls
echo "http://www.lucidimagination.com/" > urls/seed.txt

9. Inject seed url(s) to nutch crawldb (execute in nutch directory)

bin/nutch inject crawl/crawldb urls

10. Generate fetch list, fetch and parse content

bin/nutch generate crawl/crawldb crawl/segments

The above command will generate a new segment directory under crawl/segments that at this point contains files that store the url(s) to be fetched. In the following commands we need the latest segment dir as parameter so we’ll store it in an environment variable:

export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`

Now I launch the fetcher that actually goes to get the content:

bin/nutch fetch $SEGMENT -noParsing

Next I parse the content:

bin/nutch parse $SEGMENT

Then I update the Nutch crawldb. The updatedb command wil store all new urls discovered during the fetch and parse of the previous segment into Nutch database so they can be fetched later. Nutch also stores information about the pages that were fetched so the same urls won’t be fetched again and again.

bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize

Now a full Fetch cycle is completed. Next you can repeat step 10 couple of more times to get some more content.

11. Create linkdb

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

12. Finally index all content from all segments to Solr

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

Now the indexed content is available through Solr. You can try to execute searches from the Solr admin ui from

http://127.0.0.1:8983/solr/admin

, or directly with url like

http://127.0.0.1:8983/solr/nutch/?q=solr&amp;version=2.2&amp;start=0&amp;rows=10&amp;indent=on&amp;wt=json

Conclusion

Nutch in combination with Solr is quite a powerful base on which to build your search application. Even if the base is solid there are a few things missing from the stack that you will soon be aware of if you start to index content on larger scale. One of the missing features is duplicate content removal, but luckily there is an improvement issue for this in Nutch Jira https://issues.apache.org/jira/browse/NUTCH-684. Another missing piece from Solr side is a feature called field collapsing
(https://issues.apache.org/jira/browse/SOLR-236). The field collapsing feature could be used on when displaying results so that for example at most two pages would be shown for a single host.

The setup explained here has one significant caveat you also need to keep in mind: scale. You cannot use this kind of setup with vertical scale (collection size) that goes beyond one Solr box. The horizontal scaling (query throughput) is still possible with the standard Solr replication tools.


You might also be interested in:

38 Responses to “Using Nutch with Solr”

  1. I am new to Nutch so, pardon me if this a novice question. What does this sentence mean?

    “The setup explained here has one significant caveat you also need to keep in mind: scale. You cannot use this kind of setup with vertical scale (collection size) that goes beyond one Solr box. The horizontal scaling (query throughput) is still possible with the standard Solr replication tools.”

    March 9, 2009 12:13Taruvai

  2. What I was trying explain there is that the Nutch Solr indexer can currently only be used with a single Solr instance. That is why you cannot easily handle “unlimited” amount of docs with the setup presented here. Depending on your requirements and hardware you can still fit up to 10-20 million, perhaps more, docs in one Solr instance. If you need to go beyond that you can, for example, enhance the Nutch Solr Indexer so it can use multiple Solr instances with help of consistent hashing or something similar. I am sure that this problem will be tackled rather soon either in Solr or Nutch land.

    March 9, 2009 22:20Sami Siren

  3. [...] Lucid Imagination » Using Nutch with Solr The soon to be released Nutch 1.0 contains Solr integration “out of the box”. There are many different ways to take advantage of this new feature, but I am just going to go through one of them here. In this solution, Solr will be used as the only source for serving search results (including snippets). This way you can totally decouple your search application from Nutch and still use Nutch where it is at its best: crawling and extracting the content. Using Solr as the search backend, on the other hand, allows you to use all of the advanced features of a Solr server – like query spell checking, “more like this” suggestions, data replication and easy query time relevancy tuning, to mention just a few. (tags: todo opensource searchengine rend solr nutch) [...]

    March 10, 2009 14:02Webhamer Weblog: Search & ICT-related blogging » links for 2009-03-10

  4. [...] Lucid Imagination » Using Nutch with Solr The soon to be released Nutch 1.0 contains Solr integration “out of the box”. There are many different ways to take advantage of this new feature, but I am just going to go through one of them here. In this solution, Solr will be used as the only source for serving search results (including snippets). This way you can totally decouple your search application from Nutch and still use Nutch where it is at its best: crawling and extracting the content. Using Solr as the search backend, on the other hand, allows you to use all of the advanced features of a Solr server – like query spell checking, “more like this” suggestions, data replication and easy query time relevancy tuning, to mention just a few. (tags: todo opensource searchengine rend solr nutch) [...]

    March 10, 2009 15:04StauthamerNet :: Staut’s Family Blog» Blog Archive » links for 2009-03-10

  5. Nice description.. working on first try…

    March 28, 2009 22:00VIv

  6. Hi, my drupal site uses apache-solr engine to index and query for data. It works great. I just installed nutch 1.0 to take advantage of nutch’s ability to crawl specific external sites for appropriate data and have these data indexed by the same apache-solr engine. I followed the instruction presented here and got my nutch 1.0 to work. Using the provided schema.xml that comes with nutch my site can query and display data coming from nutch crawling. Unfortunately the other data set is now no longer available for query. It is only available when I revert the schema.xml to the original one. Is it possible to combine the two indexes so they can be conveniently available to same apache-solr query. Would you please kindly point me in the right direction. Much thanks.

    April 9, 2009 12:23Kham

  7. Got it to work. Just a matter of careful merging of the two schema.xml files.

    April 9, 2009 17:32Kham

  8. You guys are great ! First shot right.

    April 13, 2009 23:57Ray

  9. Is there a way to get the full markup as opposed to the parsed text as content or another field?

    April 18, 2009 07:21Ben

  10. Hi, I wonder if there is a command that ask Nutch to return Documents that have been changed in the indexe and show changes.
    Thanks for helping.

    April 27, 2009 07:45aida

  11. Hello,

    How additional index can be added manually to nutch indexes, using solr?

    May 12, 2009 14:06Alex

  12. Hi, we followed step by step instructions above. When we went to view the results, we have noticed that the content is being translated into “Chinese” or some Asian language. We see results ONLY if we query in Asian language. what are we doing wrong?

    May 20, 2009 13:52Sai Thumuluri

  13. Sai, can you elaborate your problem a bit, how is the content translated? Do you have an example URL that demonstrates the problem? Thanks.

    May 28, 2009 23:38Sami Siren

  14. Thanks for the nice description. I have certain queries.

    Once I crawl the websites using nutch I want to do the following:
    1. Extract entities like people, places, and other master information. Is it possible in nutch?
    2. I need the entire DOM of the page that is parsed. Is it possible in nutch? Or do I have to use someother mechanism.

    The idea is: I have a project requirement where I have to crawl very specific sites, ‘extract specific values from the html DOM’ and index it. Next I want to search these values using solr.
    Will Nutch+Solr suffice, OR I have to use somthing like Nutch+Solr+(Aperture/Tika).

    Thanks for helping

    June 4, 2009 22:22Jagdish Yadav

  15. If I follow these instructions I can get everything to work swimmingly, however I need to add some ‘custom’ static fields so I was hoping to employ index-extra plug in. I put the plugin in place, crawl with seemingly no problem, but when I try to commit to solr, I get nothing added to the index, and no error. Has anyone else tried to use index-extra along with the nutch/solr integration?

    June 11, 2009 09:06Edward

  16. I think our problem is related to having our content in charset UTF-16, can you tell us how we can change charset to UTF-8 during index/crawl?

    June 15, 2009 14:45Sai Thumuluri

  17. I think I have all the nutch stuff working in that the data is searchable through the nutch webapp but I cant get any results through Solr

    bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

    This makes the computer work heavily for an hour or so but I cant see where the solr index goes.

    http://127.0.0.1:8983/solr/admin gives me a web interface but I cant get it to return any results.

    Any idea what I should be looking at?

    Thanks

    July 7, 2009 10:17AlexMc

  18. I’m having the same problem as AlexMC too.

    July 13, 2009 11:24Zaihan

  19. Hi,

    I need some help here actually in our project I need to pass XML file as input to the NUTCH instead of .txt files as given in the example.When I tried to pass XML input file I am getting the following exception:-

    WARN crawl.Injector – Skipping :java.net.MalformedURLException: no protocol:
    2009-09-15 18:21:36,556 INFO crawl.Injector – Injector: Merging injected urls into crawl db.
    2009-09-15 18:21:38,321 WARN util.NativeCodeLoader – Unable to load native-hadoop library for your platform… using builtin-java classes where applicable

    Please help me in resolving the same and pls let me know how can I pass .xml file as input.

    Thanks

    September 15, 2009 05:34Ravikiran A

  20. Hello,
    I am new to Nutch too, excuse my novice question. What does this sentence mean?
    “Now a full Fetch cycle is completed. Next you can repeat step 10 couple of more times to get some more content”.
    is there any possibility to have these steps done automatically?

    September 21, 2009 06:37LPillet

  21. How can we use nutch and solr for multiple sites?
    What are the steps I need to take care in configuring Nutch and Solr for multiple sites

    November 4, 2009 12:25ACG

  22. adding the fragment mentioned above to apache-solr-1.3.0/example/solr/conf/solrconfig.xml
    makes solr fail. I get following exception and web site throws error. Please note solr works with original solrconfig.xml. The problem starts as soon as I add the fragement.

    sorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.mortbay.start.Main.invokeMain(Main.java:183)
    at org.mortbay.start.Main.start(Main.java:497)
    at org.mortbay.start.Main.main(Main.java:115)
    Nov 16, 2009 10:02:11 PM org.apache.solr.common.SolrException log
    SEVERE: org.apache.solr.common.SolrException: invalid boolean value:
    at org.apache.solr.common.util.StrUtils.parseBool(StrUtils.java:237)
    at org.apache.solr.common.util.DOMUtil.addToNamedList(DOMUtil.java:140
    at org.apache.solr.common.util.DOMUtil.nodesToNamedList(DOMUtil.java:9

    at org.apache.solr.common.util.DOMUtil.childNodesToNamedList(DOMUtil.j
    a:88)
    at org.apache.solr.common.util.DOMUtil.addToNamedList(DOMUtil.java:142
    at org.apache.solr.common.util.DOMUtil.nodesToNamedList(DOMUtil.java:9

    at org.apache.solr.common.util.DOMUtil.childNodesToNamedList(DOMUtil.j
    a:88)
    at org.apache.solr.core.PluginInfo.(PluginInfo.java:54)
    at org.apache.solr.core.SolrConfig.readPluginInfos(SolrConfig.java:220
    at org.apache.solr.core.SolrConfig.loadPluginInfo(SolrConfig.java:212)
    at org.apache.solr.core.SolrConfig.(SolrConfig.java:184)
    at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreConta
    er.java:134)
    at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.
    va:83)
    at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.jav
    40)
    at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.
    va:594)
    at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
    at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.j
    a:1218)
    at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.jav
    500)
    at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:4
    )
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.jav
    40)
    at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollecti
    .java:147)
    at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextH
    dlerCollection.java:161)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.jav
    40)
    at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollecti
    .java:147)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.jav
    40)
    at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.jav
    117)
    at org.mortbay.jetty.Server.doStart(Server.java:210)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.jav
    40)
    at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImp
    java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc
    sorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.mortbay.start.Main.invokeMain(Main.java:183)
    at org.mortbay.start.Main.start(Main.java:497)
    at org.mortbay.start.Main.main(Main.java:115)

    November 16, 2009 22:05Neha

  23. Any ideas what might be wrong? Any help would be highly appreciated.

    November 16, 2009 22:05Neha

  24. Hi,

    I had the same Problem,
    just change the xml syntax line as follows:

    true

    this should work

    all the best
    clausi

    November 18, 2009 10:35clausi

  25. sorry there is something missing
    this line:

    <bool name=”hl”>true</bool>

    November 18, 2009 10:36clausi

  26. in solrconfig.xml, tag:

    should be

    true

    November 20, 2009 08:13igor vrdoljak

  27. NOTE: In the request handler config section there are three settings which need to be edited:
    - Where it says “int” should be changed to “str”.
    - Where it says “float” should be changed to “str”.
    - Where it says “boolean” should be changed to “str”.

    With those changes the “/nutch” request handler will be configured correctly and Solr will start as expected.

    November 25, 2009 22:09Jay Hill

  28. Thanks Jay Hill, my these instructions broke my Solr configuration until I made the changes you mentioned.

    I was also having problems getting Nutch 1.0 to run on top of Xampp Tomcat for Windows and I discovered I needed to use a more recent version of Java.

    The Xampp Tomcat distrib had JRE 1.5 rev 20, and kept failing to run Nutch, logging “java.lang.UnsupportedClassVersionError: Bad version number in .class file “. I worked around this by installing a newer Java version (JRE 1.6 rev 16) and pointing my JRE_HOME to the new location.

    Thx for a really insightful article :-)

    November 28, 2009 17:36artemgy

  29. Very useful and straightforward guide. Thank you.

    Just one question: does the solrindex command have options or configuration parameters that I can use to, say, index only pages containing/not_containing certain words and discard all the rest?

    G

    December 15, 2009 03:40Giona

  30. How can I make this multi crawler and avoid duplicate content using multi crawler ??

    December 29, 2009 16:43jason

  31. How can I rank the content after getting the content?

    December 29, 2009 16:47jason

  32. Do I use “/nutch” or the path to my nutch directory (e.g. /opt/nutch-1.0)?

    January 5, 2010 16:18Ken

  33. I executed ‘java -jar start.jar’, and got this at the very end:

    Started SocketConnector @ 0.0.0.0:8983

    I’m about to get solr to load when I url http://localhost:8983/solr/admin

    Does it mean solr is working and running, where I open a new terminal to do others things and not press “Ctlr+C”.

    January 5, 2010 17:32Ken

  34. Hi
    Can we use this setup to crawl multiple domains and return results for specific domains depending on which user it is ?

    Thanks

    January 6, 2010 17:28shantanu

  35. [...] de recherche libre vous pouvez intégrer Nutch avec Solr, comme décrit dans cette article : Using Nutch with Solr. Afin d’obtenir un moteur de recherche professionnel pour une entreprise. (principalement [...]

    January 11, 2010 14:23Christophe Nowicki » Faire son propre moteur de recherche avec Nutch

  36. tnx for the nice and clear howto. Any idea on how to use instead Nutch interface with the Solr capabilities? I mean, fetch by Nutch, search by Nutch interface (very nice and easy) and engine provided by Solr?

    thanks

    January 21, 2010 05:34Davide

  37. Installed solr and nutch, and executed this tutorial. Had an error during the execution of step 9.:

    plugin.PluginRepository – Ontology Model Loader (org.apache.nutch.ontology.Ontology)
    2010-02-25 17:50:49,346 WARN regex.RegexURLNormalizer – can’t find rules for scope ‘inject’, using default
    2010-02-25 17:50:51,649 WARN mapred.LocalJobRunner – job_local_0001
    java.lang.NumberFormatException: For input string: “free”
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong(Long.java:410)
    at java.lang.Long.parseLong(Long.java:468)
    at org.apache.hadoop.fs.DF.parseExecResult(DF.java:122)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:179)
    at org.apache.hadoop.util.Shell.run(Shell.java:134)
    at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
    at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:930)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:842)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
    2010-02-25 17:50:52,471 FATAL crawl.Injector – Injector: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:160)
    at org.apache.nutch.crawl.Injector.run(Injector.java:190)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Injector.main(Injector.java:180)

    Any idea what this relates to?

    February 25, 2010 09:54José Grilo

  38. I installed solr and was able to access the admin screen. When I install nutch and follow the command in 5d to paste the fragment in solrconfig.xml, solr’s admin fails.

    The error I get is:

    HTTP ERROR: 404

    missing core name in path

    RequestURI=/solr/admin/index.jsp

    Why would adding the snippet cause that?

    February 25, 2010 13:18Ian

Leave a Reply