Lucid Imagination

Secondary links

  • Contact Us
  • Log out
  • Downloads
  • Solutions
    • Partners |
    • Blog |
    • Software |
    • Services |
    • Training |
    • Case Studies |
    • Webinars |
  • Developers
    • Blog |
    • Tech Articles |
    • Community |
    • Docs |
    • Downloads |
    • Whitepapers |
    • Podcasts |
  • About
    • Market Overview |
    • Management |
    • Company News |
    • In the Media |
    • Contact |

beta

Start new search

Back to search results

  1. FromDate
  2. Chris Hostetter2009-12-07 15:51
  3. Grant Ingersoll2009-12-07 18:29
  4. Noble Paul നോബിള്‍ नोब्ळ्2009-12-08 00:22
  5. Grant Ingersoll2009-12-08 06:17
  6. Noble Paul നോബിള്‍ नोब्ळ्2009-12-08 10:03
  7. Zacarias2010-01-05 10:48
  8. Zacarias2010-01-05 10:49
  9. Zacarias2010-01-05 13:53
  10. Grant Ingersoll2010-01-05 13:58
  11. Chris Hostetter2010-01-05 15:18
  12. Jan Høydahl / Cominvent2010-01-22 17:37
  13. Jan Høydahl / Cominvent2010-02-08 15:48

[solr-dev] Solr Cell revamped as an UpdateProcessor?

Subject:
Re: Solr Cell revamped as an UpdateProcessor?
From:
Jan Høydahl / Cominvent <jan.asf@...>
Date:
2010-02-08 15:48
I created an issue for this improvement idea to make sure it doesn't just die away:
https://issues.apache.org/jira/browse/SOLR-1763

--
Jan Høydahl  - search architect
Cominvent AS - http://www.cominvent.com

On 22. jan. 2010, at 23.37, Jan Høydahl / Cominvent wrote:

On 8. des. 2009, at 00.29, Grant Ingersoll wrote:
On Dec 7, 2009, at 3:51 PM, Chris Hostetter wrote:
ASs someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields. Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields. Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial?
It probably could, but am not sure how it works in a processor chain. However, I'm not sure I understand how they work all that much either. I also plan on adding, BTW, a SolrJ client for Tika that does the extraction on the client. In many cases, the ExtrReqHandler is really only designed for lighter weight extraction cases, as one would simply not want to send that much rich content over the wire.
Good match. UpdateProcessors is the way to go for functionality which modifiy documents prior to indexing. With this, we can mix and match any type of content source with other processing needs. I think it can be neneficial to have the choice to do extration on the SolrJ side. But you don't always have that choice, if your source is a crawler without built-in Tika, some base64 encoded field in an XML or some other random source, you want to do the extraction at an arbitrary place in the chain. Examples: Crawler (httpheaders, binarybody) -> TikaUpdateProcessor (+title, +text, +meta...) -> index XML (title, pdfurl) -> GetUrlProcessor (+pdfbin) -> TikaUpdateProcessor (+text, +meta) -> index DIH (city, street, lat, lon) -> LatLon2GeoHashProcessor (+geohash) -> index I propose to model the document processor chain more after FAST ESP's flexible processing chain, which must be seen as an industry best practice. I'm thinking of starting a Wiki page to model what direction we should go. -- Jan Høydahl - search architect Cominvent AS - http://www.cominvent.com

Solr Powered

Give us your feedback

  • Lucene
  • Solr
  • Nutch
  • Tika
  • Mahout
  • Droids
  • PyLucene
  • Lucene.Net
  • Lucy
  • Lucene4c
  • Open Relevance Project
  • How We Can Help:
    • Getting Started |
    • Support Subscriptions |
    • White Papers |
    • Training |
    • Consulting |
    • Contact Us |
  • Developers:
    • Blog |
    • Documentation |
    • Tech Articles |
    • Podcasts and Videos |
    • Community |
  • Downloads:
    • LucidWorks for Solr |
    • LucidWorks for Lucene |
    • LucidGaze for Solr |
    • LucidGaze for Lucene |
  • Products:
  • Services:

Contact | Privacy Policy | Legal Terms of Use | Copyrights and Disclaimers | Logout

Apache Solr, Apache Lucene, ApacheCon and their logos are trademarks of the Apache Software Foundation.

© 2010 Lucid Imagination. All Right reserved.