Found 36,202 results in 0.014 seconds. Displaying page 8 of 3,621, sorted by
Sent 2010-08-21 by Israel <wegols2@...>
2010/8/21 Israel
>
> Thanks for your help, plese help me with this
>
> Hello, i download the parse plugin from: "
> https://issues.apache.org/jira/browse/NUTCH-185", and i don't know where
> put this:
>
>
>> Added to "parse-plugins.xml"
>>
>> ...
Sent 2010-08-21 by Israel <wegols2@...>
Thanks for your help, plese help me with this
Hello, i download the parse plugin from: "
https://issues.apache.org/jira/browse/NUTCH-185", and i don't know where put
this:
>
> Added to "parse-plugins.xml"
>
>
>
>
>
> Sent 2010-08-20 by Volli <illov@...>
Nutch 1.1.
I tested just with
"http://cnx.org/lenses/ccotp/endorsements/atom"
I added to property "plugin.includes" in "nutch-site.xml"
"...parse-(text|html|js|tika|pdf|rss)|feed|..."
(see added "rss" and "feed"; I don't know which one did it).
Added to "parse-plugins.xml"
Sent 2010-08-20 by Israel <wegols2@...>
Hello, I tried to indexer these pages that use xml, rss, atom or inclusive
rdf or the respective format ..... but errors occur, I download the "parse
xml " plugin but I don't how to use this.
I index this pages:
http://cnx.org/lenses/ccotp/endorsements/atom
http://ocw.nd.edu/courselist/rss
http...
Sent 2010-08-20 by Volli <illov@...>
I found this post. I didn't read it in detail. So, just a maybe.
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg12429.html
Am 20.08.2010 20:08, schrieb Israel:
> Hello, anyone knows how I can do a search on these rss:
>
> http://ocw.mit.edu/rss/all/mit-allcourses-1.xml
>
> how do I...
Sent 2010-08-20 by Scott Gonyea <scott@...>
I haven't really focused my time on subdomains. I think I saw some in my crawl data, but can't confirm ATM. One question is, are you putting "www." in your injected urls... Or just http://[domain]?
If that doesnt make a difference, then it would seem to me that the regex handler should be the ta...
Sent 2010-08-20 by AJ Chen <ajchen@...>
It may seem slow if you put 5000 domains or paths in regex-urlfilter. But,
after you try it, you may find the performance acceptable. It works for me
anyway.
-aj
On Fri, Aug 20, 2010 at 12:12 PM, Sonal Goyal wrote:
> Hi,
>
> I have a list of about 5000 URLs which I need...
Sent 2010-08-20 by Sonal Goyal <sonalgoyal4@...>
Hi,
I have a list of about 5000 URLs which I need to crawl and fetch using
Nutch. I want to do a very deep crawl on each and I want subdomains, but I
dont want external links. If I set db.ignore.external.links, I dont get the
subdomains. So I cant use that. If I set the domain in regex-urlfilter...
Sent 2010-08-20 by Israel <wegols2@...>
Hello, anyone knows how I can do a search on these rss:
http://ocw.mit.edu/rss/all/mit-allcourses-1.xml
how do I configure the "crawl-urlfilter" and if I should add plugins to
"nutch-site."
Sent 2010-08-20 by Volli <illov@...>
Hello, this is my first message to nutch mailing list. I
hope I send it to right receiver.
In nutch-1.1 I checked nutch-default.xml for new properties.
There I found "crawl.gen.delay" with default value "604800000".
Description says "This value, expressed in days ... Default
value of this is ...