Found 36,204 results in 0.019 seconds. Displaying page 2 of 3,621, sorted by
Sent 2010-09-02 by Markus Jelsma <markus.jelsma@...>
http://wiki.apache.org/nutch/FAQ#How_can_I_recover_an_aborted_fetch_process.3F
On Thursday 02 September 2010 11:25:51 Nayanish Hinge wrote:
> Hi,
> I have a doubt, not sure if anybody has already thought about it
> What if nutch crawler fails during its crawling cycles, could we restart
> the c...
Sent 2010-09-02 by Nayanish Hinge <nayanish.hinge@...>
Hi,
I have a specific use case where I need to know at which level (depth) I
fetched the current url.
Currently the depth could be figured out from the for loop index in the
crawl.java.
But my use case necessitate me to have this information stored in
crawl-datum. Currently Nutch does not have an...
Sent 2010-09-02 by Nayanish Hinge <nayanish.hinge@...>
Hi,
I have a doubt, not sure if anybody has already thought about it
What if nutch crawler fails during its crawling cycles, could we restart the
crawling right from where we left?
I mean, starting with only the unfetched urls.
Thanks
--
Nayanish
Hyderabad
Sent 2010-09-02 by Markus Jelsma <markus.jelsma@...>
In small crawls, you could parse the documentright away. For large crawls, however, there may not be enough resources to fetch and parse at the same time.
-----Original message-----
From: Nayanish Hinge
Sent: Thu 02-09-2010 07:39
To: user@nutch.apache.org;
Subject: ...
Sent 2010-09-02 by Nayanish Hinge <nayanish.hinge@...>
Hi,
I was wondering, why nutch has an option of parsing
1. right within the fetcher and
2. also as a separate map-reduce job
In Crawl.java, There is a separate step for crawling. But also based on
"fetcher.parse" property in nutch-default.xml, Fetcher will also parse the
content.
Thanks
--
Nay...
Sent 2010-09-01 by Mark Stephenson <mstephen@...>
Hi,
I am new to Nutch and I'm trying to understand how it handles
redirects. Let's say I want to fetch the following article from the
New York Times:
http://www.nytimes.com/2010/08/30/opinion/30mon1.html
That is the only URL I put in my 'urls' directory. Then I issue the
following comm...
Sent 2010-09-01 by "onlinespending@...>
Hi,
I'd like to use Nutch to crawl a very limited set of pages. But as it's
crawling I'd like for it to only fetch particular pages and files that
match certain criteria. I'd also like that I am somehow alerted when
any of these fetched files have been modified (modify date of the file
or ...
Sent 2010-09-01 by AJ Chen <ajchen@...>
in distributed mode, "generate -topN 1000000 -maxNumSegments 3" creates 3
segments, but the size is very uneven: 1.7M, 0.8M, 0.5M.
I also tried fetcher.timelimit.mins=240 in distributed mode. but the fetcher
did not stop after 4 hours. any idea?
-aj
On Tue, Aug 31, 2010 at 4:24 PM, AJ Chen
Sent 2010-09-01 by "Nemani, Raj" <Raj.Nemani@...>
All,
I am crawling a site that is heavy in rtf, txt and pdf documents in
addition to pages that embed a lot of images. I am using Nutch 1.1 and
running on Windows 7. I am seeing the following errors in my hadoop
logs.
2010-09-01 15:01:26,509 INFO parse.ParserFactory - The parsing p...
Sent 2010-09-01 by Volli <illov@...>
in my bookmarks I found this:
http://efreedom.com/Question/1-3310050/
very short, but author is mentioned.
i'm (yet?) no java-developper. so, further questions to the
community ;-))
Am 01.09.2010 15:20, schrieb jitendra rajput:
> Hi,
>
> I have gone through the tutorial about writing plugin i...