Lucid Imagination

Secondary links

  • Contact Us
  • Log out
  • Downloads
  • Solutions
    • Partners |
    • Blog |
    • Software |
    • Services |
    • Training |
    • Case Studies |
    • Webinars |
  • Developers
    • Blog |
    • Tech Articles |
    • Community |
    • Docs |
    • Downloads |
    • Whitepapers |
    • Podcasts |
  • About
    • Market Overview |
    • Management |
    • Company News |
    • In the Media |
    • Contact |

beta

Start new search

Back to search results

  1. FromDate
  2. Stefano Cherchi2010-02-04 11:00
  3. Stefano Cherchi2010-02-08 07:26
  4. Julien Nioche2010-02-08 08:24
  5. Stefano Cherchi2010-02-09 09:51
  6. Julien Nioche2010-02-09 11:01

[nutch-user] Nutch + Solr: filtering URL while indexing

Subject:
Re: Nutch + Solr: filtering URL while indexing
From:
Stefano Cherchi <stefanocherchi@...>
Date:
2010-02-08 07:26
Is there nobody out there who can provide some kind of hint?

I'm really stuck with this problem and I cannot figure out what else I can do.

Thanks

S







----- Messaggio originale -----
Da: Stefano Cherchi <stefanocherchi@yahoo.it> A: nutch-user@lucene.apache..org Inviato: Gio 4 febbraio 2010, 17:00:35 Oggetto: Nutch + Solr: filtering URL while indexing Hi everybody. I've been struggling for three days now with a quite trivial problem, without solution. I need to index a few web sites with the following structure: Page type 1: List of posts (http://www.website.com/list.html?page=XXx) where XXx is a progressive number from 00 to 999. Each page has links to the following and the previous list page. Page type 2: the actual post page (http://www.website.com/post--x_y_z.html) where xyz is an arbitrary string of letters and numbers representing the post title.. Page type 3: other contents like statical pages, external links, and other unwanted and useless stuff. I need to crawl pages of both type 1 and 2 but I want to index only type 2. Crawling pages of type 1 is the only way to reach type 2 because pages of type 2 have unpredictable URLs. So I'm performing a step-by-step indexing this way: I set the following regular expressions in regex-urlfilter.txt +^http://www.website.com/list.html[?]page[=][0-9]{2,3}$ +^http://www.website.com/post-- -. inject (http://www.website.com/list.html?page=00) then I cycle N times generate fetch parse updatedb and I can see that only type 1 and type 2 pages are actually crawled and fetched. Great. Then I edit the regex-urlfilter.txt leaving only +^http://www.website.com/post-- -. and perform invertlinks (with filtering on) solrindex Now I would expect that all type 1 pages are stripped away from the linkdb and only type 2 pages are added to Solr index, but when I browse the indexed documents I still found both 1 and 2 page types. Can someone please explain why? Thank you. S ---------------------------------- "Anyone proposing to run Windows on servers should be prepared to explain what they know about servers that Google, Yahoo, and Amazon don't." Paul Graham "A mathematician is a device for turning coffee into theorems." Paul Erdos (who obviously never met a sysadmin)

Solr Powered

Give us your feedback

  • Lucene
  • Solr
  • Nutch
  • Tika
  • Mahout
  • Droids
  • PyLucene
  • Lucene.Net
  • Lucy
  • Lucene4c
  • Open Relevance Project
  • How We Can Help:
    • Getting Started |
    • Support Subscriptions |
    • White Papers |
    • Training |
    • Consulting |
    • Contact Us |
  • Developers:
    • Blog |
    • Documentation |
    • Tech Articles |
    • Podcasts and Videos |
    • Community |
  • Downloads:
    • LucidWorks for Solr |
    • LucidWorks for Lucene |
    • LucidGaze for Solr |
    • LucidGaze for Lucene |
  • Products:
  • Services:

Contact | Privacy Policy | Legal Terms of Use | Copyrights and Disclaimers | Logout

Apache Solr, Apache Lucene, ApacheCon and their logos are trademarks of the Apache Software Foundation.

© 2010 Lucid Imagination. All Right reserved.