Lucid Imagination

Secondary links

  • Contact Us
  • Log out
  • Downloads
  • Solutions
    • Partners |
    • Blog |
    • Software |
    • Services |
    • Training |
    • Case Studies |
    • Webinars |
  • Developers
    • Blog |
    • Tech Articles |
    • Community |
    • Docs |
    • Downloads |
    • Whitepapers |
    • Podcasts |
  • About
    • Market Overview |
    • Management |
    • Company News |
    • In the Media |
    • Contact |

beta

Start new search

Options

  • results per page

Clear all facets

  • Project clear projects

  • Source clear sources

  • Author clear authors

Search Results for

Results loading...

Found 36,202 results in 0.132 seconds. Displaying page 6 of 3,621, sorted by

  1. [nutch-user] RE: find segment for an url

    Sent 2010-08-24 by Henry Noerdlinger <hnoerdlinger@...>

    Thank you for response. I ran a simple test where I constructed a QueryParams object and have field / value of "url" and "http://blahblah.com/" and then added this to a Query object and passed this to my beloved NutchBean to search for like this: String urlVal = "http://domain.com/webapp/conten...

  2. [nutch-user] Re: Staying in Domain

    Sent 2010-08-24 by "emmanuel.csantana" <emmanuel.csantana@...>

    "... don't you achieve the same functionality using the db.ignore.external.links property in nutch-default.xml?" I have a similar doubt. using db.ignore.external.links won't keep it from reaching external domains that it can get from a redirection. as extracted from : http://lucene.472066.n3.n...

  3. [nutch-user] Re: nutch plugin to filter indexing by content!

    Sent 2010-08-24 by Ahmad Al-Amri <amri_jo@...>

    thanks; In my case I don't want to save the content of the page in segments, ,, to save the disk space from save unneeded data !! I guess it's simpler while indexing, by implement an index-filter to skip the document that include that words !! Regards; ________________________________ F...

  4. [nutch-user] Re: find segment for an url

    Sent 2010-08-24 by CatOs Mandros <cat.os.mandros@...>

    Hi Henry, If i'm not mistaken, the correct way to handle this is to query your index . It should have the information about what segment is the URL located. Then you should only have to run your code on the segment returned to get the content. On Tue, Aug 24, 2010 at 12:24 AM, Henry Noerdlinge...

  5. [nutch-user] Re: Crawl atom, rss, xml .... I need any plugin extra?

    Sent 2010-08-23 by Israel <wegols2@...>

    Hello volley. please help me one more time, i want to crawl this page, but don't generate nothing...is posible? http://uc.princeton.edu/main/index.php?option=com_vodcast&view=feed&format=raw or: This page is available .. rss, but leave type plain text, and the nutch search results page shows...

  6. [nutch-user] find segment for an url

    Sent 2010-08-23 by Henry Noerdlinger <hnoerdlinger@...>

    I want to loop through URLs which have been crawled / indexed. I have a (known) subset of URLs that I want to get the (raw) content for if I know the segment, I can do something like this: String segName = "20100817162607"; String url = "http://adomain.com/awebappOfInterest/someCont...

  7. [nutch-user] Re: obvious duplicates with different hash-values

    Sent 2010-08-23 by reinhard schwab <reinhard.schwab@...>

    use another signature. it is tolerant against small changes. db.signature.class org.apache.nutch.crawl.TextProfileSignature The default implementation of a page signature. Signatures created with this implementation will be used for dup...

  8. [nutch-user] Re: obvious duplicates with different hash-values

    Sent 2010-08-23 by Andrzej Bialecki <ab@...>

    On 2010-08-23 18:11, Andre Pautz wrote: > Dear list, > > i have a problem with removing duplicates from my nutch index. If i understood it right, then the dedup option should do the work for me, i.e. remove entries with the same URL or same content (MD5 hash). But unfortunately it doesn't. > > Th...

  9. [nutch-user] Re: obvious duplicates with different hash-values

    Sent 2010-08-23 by Scott Gonyea <me@...>

    Were I to guess, the md5 hash isn't a hash of the content but, rather, of the CrawlDatum object that Nutch stores. Scott On Mon, Aug 23, 2010 at 9:11 AM, Andre Pautz wrote: > Dear list, > > i have a problem with removing duplicates from my nutch index. If i > understood it rig...

  10. [nutch-user] obvious duplicates with different hash-values

    Sent 2010-08-23 by Andre Pautz <a-pautz@...>

    Dear list, i have a problem with removing duplicates from my nutch index. If i understood it right, then the dedup option should do the work for me, i.e. remove entries with the same URL or same content (MD5 hash). But unfortunately it doesn't. The strange thing is, that if i check the index wi...

  1. <<
  2. 1
  3. 2
  4. 3
  5. 4
  6. 5
  7. 6
  8. 7
  9. 8
  10. 9
  11. 10
  12. >>

Solr Powered

Give us your feedback

  • Lucene
  • Solr
  • Nutch
  • Tika
  • Mahout
  • Droids
  • PyLucene
  • Lucene.Net
  • Lucy
  • Lucene4c
  • Open Relevance Project
  • How We Can Help:
    • Getting Started |
    • Support Subscriptions |
    • White Papers |
    • Training |
    • Consulting |
    • Contact Us |
  • Developers:
    • Blog |
    • Documentation |
    • Tech Articles |
    • Podcasts and Videos |
    • Community |
  • Downloads:
    • LucidWorks for Solr |
    • LucidWorks for Lucene |
    • LucidGaze for Solr |
    • LucidGaze for Lucene |
  • Products:
  • Services:

Contact | Privacy Policy | Legal Terms of Use | Copyrights and Disclaimers | Logout

Apache Solr, Apache Lucene, ApacheCon and their logos are trademarks of the Apache Software Foundation.

© 2010 Lucid Imagination. All Right reserved.