Found 36,202 results in 0.132 seconds. Displaying page 6 of 3,621, sorted by
Sent 2010-08-24 by Henry Noerdlinger <hnoerdlinger@...>
Thank you for response.
I ran a simple test where I constructed a QueryParams object and have field / value of "url" and "http://blahblah.com/"
and then added this to a Query object and passed this to my beloved NutchBean to search for like this:
String urlVal = "http://domain.com/webapp/conten...
Sent 2010-08-24 by "emmanuel.csantana" <emmanuel.csantana@...>
"... don't you achieve the same
functionality using the db.ignore.external.links property in
nutch-default.xml?"
I have a similar doubt.
using db.ignore.external.links won't keep it from reaching external domains
that it can get
from a redirection.
as extracted from :
http://lucene.472066.n3.n...
Sent 2010-08-24 by Ahmad Al-Amri <amri_jo@...>
thanks;
In my case I don't want to save the content of the page in segments,
,, to save the disk space from save unneeded data !!
I guess it's simpler while indexing, by implement an index-filter to skip the
document that include that words !!
Regards;
________________________________
F...
Sent 2010-08-24 by CatOs Mandros <cat.os.mandros@...>
Hi Henry,
If i'm not mistaken, the correct way to handle this is to query your
index . It should have the information about what segment is the URL
located. Then you should only have to run your code on the segment
returned to get the content.
On Tue, Aug 24, 2010 at 12:24 AM, Henry Noerdlinge...
Sent 2010-08-23 by Israel <wegols2@...>
Hello volley. please help me one more time, i want to crawl this page, but
don't generate nothing...is posible?
http://uc.princeton.edu/main/index.php?option=com_vodcast&view=feed&format=raw
or:
This page is available .. rss, but leave type plain text, and the nutch
search results page shows...
Sent 2010-08-23 by Henry Noerdlinger <hnoerdlinger@...>
I want to loop through URLs which have been crawled / indexed.
I have a (known) subset of URLs that I want to get the (raw) content for
if I know the segment, I can do something like this:
String segName = "20100817162607";
String url = "http://adomain.com/awebappOfInterest/someCont...
Sent 2010-08-23 by reinhard schwab <reinhard.schwab@...>
use another signature.
it is tolerant against small changes.
db.signature.class
org.apache.nutch.crawl.TextProfileSignature
The default implementation of a page signature. Signatures
created with this implementation will be used for dup...
Sent 2010-08-23 by Andrzej Bialecki <ab@...>
On 2010-08-23 18:11, Andre Pautz wrote:
> Dear list,
>
> i have a problem with removing duplicates from my nutch index. If i understood it right, then the dedup option should do the work for me, i.e. remove entries with the same URL or same content (MD5 hash). But unfortunately it doesn't.
>
> Th...
Sent 2010-08-23 by Scott Gonyea <me@...>
Were I to guess, the md5 hash isn't a hash of the content but, rather, of
the CrawlDatum object that Nutch stores.
Scott
On Mon, Aug 23, 2010 at 9:11 AM, Andre Pautz wrote:
> Dear list,
>
> i have a problem with removing duplicates from my nutch index. If i
> understood it rig...
Sent 2010-08-23 by Andre Pautz <a-pautz@...>
Dear list,
i have a problem with removing duplicates from my nutch index. If i understood it right, then the dedup option should do the work for me, i.e. remove entries with the same URL or same content (MD5 hash). But unfortunately it doesn't.
The strange thing is, that if i check the index wi...