Lucid Imagination

Secondary links

  • Contact Us
  • Log out
  • Downloads
  • Solutions
    • Partners |
    • Blog |
    • Software |
    • Services |
    • Training |
    • Case Studies |
    • Webinars |
  • Developers
    • Blog |
    • Tech Articles |
    • Community |
    • Docs |
    • Downloads |
    • Whitepapers |
    • Podcasts |
  • About
    • Market Overview |
    • Management |
    • Company News |
    • In the Media |
    • Contact |

beta

Start new search

Back to search results

  1. FromDate
  2. "Withanage, Dulip"2010-02-03 06:08
  3. Ken Krugler2010-02-03 13:52
  4. Alexander Aristov2010-02-03 14:59
  5. "Withanage, Dulip"2010-02-04 04:58
  6. Alexander Aristov2010-02-04 06:11

[nutch-user] PDF Parsing

Subject:
Re: PDF Parsing
From:
Alexander Aristov <alexander.aristov@...>
Date:
2010-02-04 06:11
Your problem has nothing to do with PDFs. Do you have messages/exceptions
where you are merging indexes?

Best Regards
Alexander Aristov


On 4 February 2010 12:58, Withanage, Dulip <
withanage@asia-europe.uni-heidelberg.de> wrote:

Thanks for the initial ideas.
do they really corrupt or they get corrupted when they are downloaded?
Sorry for my false assumption at the beginning. I am absolutely new to lucene and nutch both. I think the index is not corrupt. It gets corrupted in the mergecrawl process. These are my steps 1. I have a pdf index of around 2000 documents in web server. 2. I generate one index for each 100 documents. 3. Then I use a modified mergecrawl_script to merge the indexes http://wiki.apache.org/nutch/MergeCrawl 4. I add each directory one after other to make a complete index. 5. The merged lucene index is corrupt after I encounter a index directory of about 400 mb. -----Original Message----- From: Alexander Aristov [mailto:alexander.aristov@gmail.com] Sent: Wednesday, February 03, 2010 9:00 PM To: nutch-user@lucene.apache.org Subject: Re: PDF Parsing hi do they really corrupt or they get corrupted when they are downloaded? There is a parameter in Nutch which limits downloaded content size. it just cuts files and they became corrupted. check this setting Best Regards Alexander Aristov On 3 February 2010 21:52, Ken Krugler <kkrugler_lists@transpac.com> wrote:
On Feb 3, 2010, at 3:08am, Withanage, Dulip wrote: I parse a pdf collection using the web crawler.
Some PDFs are corrupt and it makes the whole lucene index unusable. Does anybody have any idea, how to go around this problem.
How does it make the "whole Lucene index unusable"? Normally a corrupt PDF can cause an exception to be thrown during
parsing,
or it can cause the parser to hang. It might output a bunch of garbage, but that shouldn't cause the index to become invalid. -- Ken Best regards,
Dulip Withanage, M.Sc Cluster of Excellence Karl Jaspers Centre Heidelberg e-mail: withanage@asia-europe.uni-heidelberg.de
-------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g

Solr Powered

Give us your feedback

  • Lucene
  • Solr
  • Nutch
  • Tika
  • Mahout
  • Droids
  • PyLucene
  • Lucene.Net
  • Lucy
  • Lucene4c
  • Open Relevance Project
  • How We Can Help:
    • Getting Started |
    • Support Subscriptions |
    • White Papers |
    • Training |
    • Consulting |
    • Contact Us |
  • Developers:
    • Blog |
    • Documentation |
    • Tech Articles |
    • Podcasts and Videos |
    • Community |
  • Downloads:
    • LucidWorks for Solr |
    • LucidWorks for Lucene |
    • LucidGaze for Solr |
    • LucidGaze for Lucene |
  • Products:
  • Services:

Contact | Privacy Policy | Legal Terms of Use | Copyrights and Disclaimers | Logout

Apache Solr, Apache Lucene, ApacheCon and their logos are trademarks of the Apache Software Foundation.

© 2010 Lucid Imagination. All Right reserved.