Is there nobody out there who can provide some kind of hint?
I'm really stuck with this problem and I cannot figure out what else I can do.
Thanks
S
----- Messaggio originale -----
Da: Stefano Cherchi <stefanocherchi@yahoo.it>
A: nutch-user@lucene.apache..org
Inviato: Gio 4 febbraio 2010, 17:00:35
Oggetto: Nutch + Solr: filtering URL while indexing
Hi everybody. I've been struggling for three days now with a quite trivial
problem, without solution.
I need to index a few web sites with the following structure:
Page type 1: List of posts (http://www.website.com/list.html?page=XXx) where XXx
is a progressive number from 00 to 999. Each page has links to the following and
the previous list page.
Page type 2: the actual post page (http://www.website.com/post--x_y_z.html)
where xyz is an arbitrary string of letters and numbers representing the post
title..
Page type 3: other contents like statical pages, external links, and other
unwanted and useless stuff.
I need to crawl pages of both type 1 and 2 but I want to index only type 2.
Crawling pages of type 1 is the only way to reach type 2 because pages of type 2
have unpredictable URLs. So I'm performing a step-by-step indexing this way:
I set the following regular expressions in regex-urlfilter.txt
+^http://www.website.com/list.html[?]page[=][0-9]{2,3}$
+^http://www.website.com/post--
-.
inject (http://www.website.com/list.html?page=00)
then I cycle N times
generate
fetch
parse
updatedb
and I can see that only type 1 and type 2 pages are actually crawled and
fetched. Great.
Then I edit the regex-urlfilter.txt leaving only
+^http://www.website.com/post--
-.
and perform
invertlinks (with filtering on)
solrindex
Now I would expect that all type 1 pages are stripped away from the linkdb and
only type 2 pages are added to Solr index, but when I browse the indexed
documents I still found both 1 and 2 page types.
Can someone please explain why?
Thank you.
S
----------------------------------
"Anyone proposing to run Windows on servers should be prepared to explain
what they know about servers that Google, Yahoo, and Amazon don't."
Paul Graham
"A mathematician is a device for turning coffee into theorems."
Paul Erdos (who obviously never met a sysadmin)