Found 36,204 results in 0.143 seconds. Displaying page 7 of 3,621, sorted by
Sent 2010-08-23 by Scott Gonyea <me@...>
Were I to guess, the md5 hash isn't a hash of the content but, rather, of
the CrawlDatum object that Nutch stores.
Scott
On Mon, Aug 23, 2010 at 9:11 AM, Andre Pautz wrote:
> Dear list,
>
> i have a problem with removing duplicates from my nutch index. If i
> understood it rig...
Sent 2010-08-23 by Andre Pautz <a-pautz@...>
Dear list,
i have a problem with removing duplicates from my nutch index. If i understood it right, then the dedup option should do the work for me, i.e. remove entries with the same URL or same content (MD5 hash). But unfortunately it doesn't.
The strange thing is, that if i check the index wi...
Sent 2010-08-23 by Scott Gonyea <me@...>
Not to my knowledge. You may want to look for where the
"regex-normalize.xml" is being used and can write a plugin there. It would
be useful, certainly. I'm looking to eventually do the same, but at index
time.
Scott
On Mon, Aug 23, 2010 at 8:11 AM, Ahmad Al-Amri wrote:
...
Sent 2010-08-23 by Ahmad Al-Amri <amri_jo@...>
hello;
I want to check if the web-page contains certain words; and DON'T index it -
while crawling -, and to prevent the url to added to my carwldb ...
I just want to ask if there is a plug-in to do such a thing or similar to it; to
start from it.
thank you;
Sent 2010-08-23 by Israel <wegols2@...>
Great Volly .. thank you very much, saludos...Israel
Sent 2010-08-23 by "Nemani, Raj" <Raj.Nemani@...>
They are intranet Urls. So I went with a generic description. They are
not avaialble outside
I start with http://Mydomain.com/guidance/wiki/index.php/sylebook
I think +^http://Mydomain\.com/guidance/ will work for me.
Thank you so much for such a detailed explanation.
Thanks again
Raj
---...
Sent 2010-08-23 by Volli <illov@...>
I can't identify your urls.
"http://mysite .
Mydomain.com/guidance/wiki/index.php/sylebook." ??
"http://mysite .
Mydomain.com/guidance/........" ????
What's the url you start with. Is it
http://Mydomain.com/guidance/
or
http://Mydomain.com/guidance/wiki/index...
Sent 2010-08-22 by "Nemani, Raj" <Raj.Nemani@...>
All,
I am currently using Nutch to crawl an intranet site. I start the crawl
with one seed url as shown below.
http://mysite .
Mydomain.com/guidance/wiki/index.php/sylebook.
What I would like to do is to tell Nutch to skip all that URLS that do
not conform to the fol...
Sent 2010-08-22 by Volli <illov@...>
Addendum to my last post:
After, i've read my own post: All crawls worked with parser
parse-html. I think, you don't need to update Nutch.
If not:
==>TODO1<==
In conf/parse-plugins.xml:
--FIND:
Sent 2010-08-22 by Volli <illov@...>
I use Nutch version 1.1 (Released 06 June 2010).
I didn't install any additional plugin!
I think your xml-plugin at NUTCH-185 is outdated:
"Resolution:Won't Fix" and "Affects Version/s: 0.7.2, 0.8,
0.8.1".
Check your nutch version (and update).
Check in "nutch-site.xml" at "plugin.inc...