Found 36,204 results in 0.143 seconds. Displaying page 1 of 3,621, sorted by
Sent 2010-09-02 by Julien Nioche <lists.digitalpebble@...>
Hi,
You could track the depth of a URL from the seeds by implementing a custom
ScoringFilter. ScoringFilters are called at various points of the workflow,
including when outlinks have been found for a page. The logic would be to
simply increment the depth of the current page and generate a metad...
Sent 2010-09-02 by AJ Chen <ajchen@...>
The other option for reducing time in fetching the last 1% urls may be using
a smaller queue size, I think.
In Fetcher class, the queue size is magically determined as threadCount *
50.
feeder = new QueueFeeder(input, fetchQueues, threadCount * 50);
Is there any good reason for factor 50? If...
Sent 2010-09-02 by AJ Chen <ajchen@...>
Thanks Ken for the tips. -aj
On Wed, Aug 18, 2010 at 9:17 AM, Ken Krugler wrote:
> Hi AJ,
>
>
> On Aug 18, 2010, at 7:26am, AJ Chen wrote:
>
> Thanks for the explanation. I'm using hdfs. what config parameters may
>> help
>> speed up shuffling, merging, sorting a...
Sent 2010-09-02 by Sonal Goyal <sonalgoyal4@...>
Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal
---------- Forwarded message ----------
From: Sonal Goyal
Date: Thu, Sep 2, 2010 at 10:33 PM
Subject: Re: Selective Fetching and Notifying When Files Have Been Modifed
Since Last Fetch
To: us...
Sent 2010-09-02 by Gingras Jean-François <Jean-Francois.Gingras@...>
Hi,
You may want to look for the db.max.outlinks.per.page property in your nutch-[default|site].xml configuration file. The default is 100 outlinks in nutch 1.0. So, if your a index page contains more than 100 link to PDF file, then only a maximum of 100 will be process for each index page.
Als...
Sent 2010-09-02 by Nayanish Hinge <nayanish.hinge@...>
Hi,
Some website return HTTP 503 when they throttle hits.
I see that I need to re-implement the HttpBase.java to handle this as a
special case and put a retry logic (with some exponential back-off).
But in order to get HttpBase used by protocol-http and protocol-httpclient,
we need to override th...
Sent 2010-09-02 by "Nemani, Raj" <Raj.Nemani@...>
As part the following problem (I have posted this already and would
appreciate any help), I am trying to apply timeout.patch using patch.exe
(from Unix Utils) on Windows 7 64 bit.
Both patch.exe and timeout.patch files are in the top level folder of
the 1.1 source files (i.e the top level folder ...
Sent 2010-09-02 by Julien Nioche <lists.digitalpebble@...>
Hi David,
I haven't used the Hbase backend with GORA for quite some time but from what
I can remember you'll need the following things :
* conf/hbase-site.xml => this should correspond to your local configuration
* conf/gora-hbase-mapping.xml => see below
* conf/gora.properties => don't think t...
Sent 2010-09-02 by David Stuart <david.stuart@...>
Hey All,
I have setup the latest version nutch from trunk and am running into a few issues with hbase and injecting urls. when I run the command
runtime/local/bin/nutch inject runtime/local/seed/
I get
InjectorJob: java.lang.RuntimeException: Could not create datastore
at org.apache....
Sent 2010-09-02 by Andrzej Bialecki <ab@...>
On 2010-09-02 02:45, Mark Stephenson wrote:
> Hi,
>
> I am new to Nutch and I'm trying to understand how it handles redirects.
> Let's say I want to fetch the following article from the New York Times:
>
> http://www.nytimes.com/2010/08/30/opinion/30mon1.html
>
> That is the only URL I put in my ...