Hi everybody. I've been struggling for three days now with a quite trivial problem, without solution.
I need to index a few web sites with the following structure:
Page type 1: List of posts (http://www.website.com/list.html?page=XXx) where XXx is a progressive number from 00 to 999. Each page has links to the following and the previous list page.
Page type 2: the actual post page (http://www.website.com/post--x_y_z.html) where xyz is an arbitrary string of letters and numbers representing the post title..
Page type 3: other contents like statical pages, external links, and other unwanted and useless stuff.
I need to crawl pages of both type 1 and 2 but I want to index only type 2. Crawling pages of type 1 is the only way to reach type 2 because pages of type 2 have unpredictable URLs. So I'm performing a step-by-step indexing this way:
I set the following regular expressions in regex-urlfilter.txt
+^http://www.website.com/list.html[?]page[=][0-9]{2,3}$
+^http://www.website.com/post--
-.
inject (http://www.website.com/list.html?page=00)
then I cycle N times
generate
fetch
parse
updatedb
and I can see that only type 1 and type 2 pages are actually crawled and fetched. Great.
Then I edit the regex-urlfilter.txt leaving only
+^http://www.website.com/post--
-.
and perform
invertlinks (with filtering on)
solrindex
Now I would expect that all type 1 pages are stripped away from the linkdb and only type 2 pages are added to Solr index, but when I browse the indexed documents I still found both 1 and 2 page types.
Can someone please explain why?
Thank you.
S
----------------------------------
"Anyone proposing to run Windows on servers should be prepared to explain
what they know about servers that Google, Yahoo, and Amazon don't."
Paul Graham
"A mathematician is a device for turning coffee into theorems."
Paul Erdos (who obviously never met a sysadmin)