• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
    • Certified Distributions
      • Certified Solr
      • Certified Lucene
    • Apache Releases
      • Apache Solr
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Training
    • Solr/Lucene Certification
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Technical Leadership
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . Blog

September 14, 2009

Posting Rich Documents to Apache Solr using SolrJ and Solr Cell (Apache Tika)

Posted by Grant Ingersoll

Solr Cell, a new feature in the soon to be released Solr 1.4, allows users to send in rich documents such as MS Word and Adobe PDF directly into Solr and have them indexed for search.  All of the examples on the Solr Cell wiki page, however only demonstrate how to send in the documents using the curl command line utility, while many Solr users rely on SolrJ, Solr’s Java-based client.  Thus, I thought I would throw up a quick example here (and I’ll update the Wiki) demonstrating how to do this.

For this example, I’m using the standard Solr example and the Solr trunk version from this morning which I got using SVN:

svn co https://svn.apache.org/repos/asf/lucene/solr/trunk apache-solr

Next, after changing into the directory I checked out to, I built the example using Apache Ant:

ant clean example  //slight overkill, but I did it nonetheless

I then changed into the example directory (trunk/example) and ran:

java -jar start.jar

Solr is now up and running.  See the Solr Tutorial for more info on these steps.

On the code side, I used SolrJ by creating a new SolrServer and then constructed the appropriate request, containing a ContentStream (essentially a wrapper around a file) and sent it to Solr:

public class SolrCellRequestDemo {
  public static void main(String[] args) throws IOException, SolrServerException {
    SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr");
    ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
    req.addFile(new File("apache-solr/site/features.pdf"));
    req.setParam(ExtractingParams.EXTRACT_ONLY, "true");
    NamedList<Object> result = server.request(req);
    System.out.println("Result: " + result);
}

In this code, I did an extraction only, but you can easily substitute your parameters based on the Wiki page descriptions on Solr Cell.  The key class to use is the ContentStreamUpdateRequest, which makes sure the ContentStreams are set properly, SolrJ takes care of the rest.

I hope that gives people a quick idea of how they can send files to Solr Cell via SolrJ.  Also, note, that the ContentStreamUpdateRequest is not just Solr Cell specific, you can send CSV to the CSV Update handler and any other Request Handler that works with Content Streams for updates.

For completeness, the output from the code above looks like (some results reformatted for screen width):

Result: {responseHeader={status=0,QTime=1692},null=
Introduction to The Solr Enterprise Search Server Table of contents 1 Solr in a Nutshell... 2 2 Solr Uses the Lucene Search Library and Extends it!... 2 3 Detailed Features..2 3.1 Schema... 2 3.2 Query... 3 3.3 Core.... 3 3.4 Caching...3 3.5 Replication...4 3.6 Admin Interface...4 Copyright © 2007 The Apache Software Foundation. All rights reserved.
1. Solr in a Nutshell Solr is a standalone enterprise search server with a web-services like API. You put documents in it (called "indexing") via XML over HTTP. You query it via HTTP GET and receive XML results. • Advanced Full-Text Search Capabilities • Optimized for High Volume Web Traffic • Standards Based Open Interfaces - XML and HTTP • Comprehensive HTML Administration Interfaces • Server statistics exposed over JMX for monitoring • Scalability - Efficient Replication to other Solr Search Servers • Flexible and Adaptable with XML configuration • Extensible Plugin Architecture 2. Solr Uses the Lucene Search Library and Extends it! • A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys • Powerful Extensions to the Lucene Query Language • Support for Dynamic Faceted Browsing and Filtering • Advanced, Configurable Text Analysis • Highly Configurable and User Extensible Caching • Performance Optimizations • External Configuration via XML • An Administration Interface • Monitorable Logging • Fast Incremental Updates and Snapshot Distribution • Distributed search with sharded index on multiple hosts • XML and CSV/delimited-text update formats • Easy ways to pull in data from databases and XML files from local disk and HTTP sources • Multiple search indices 3. Detailed Features 3.1. Schema • Defines the field types and fields of documents • Can drive more intelligent processing • Declarative Lucene Analyzer specification Introduction to The Solr Enterprise Search Server Page 2 Copyright © 2007 The Apache Software Foundation. All rights reserved.
• Dynamic Fields enables on-the-fly addition of new fields • CopyField functionality allows indexing a single field multiple ways, or combining multiple fields into a single searchable field • Explicit types eliminates the need for guessing types of fields • External file-based configuration of stopword lists, synonym lists, and protected word lists • Many additional text analysis components including word splitting, regex and sounds-like filters 3.2. Query • HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby) • Sort by any number of fields • Advanced DisMax query parser for high relevancy results from user-entered queries • Highlighted context snippets • Faceted Searching based on unique field values and explicit queries • Spelling suggestions for user queries • More Like This suggestions for given document • Constant scoring range and prefix queries - no idf, coord, or lengthNorm factors, and no restriction on the number of terms the query matches. • Function Query - influence the score by a function of a field's numeric value or ordinal • Date Math - specify dates relative to "NOW" in queries and updates • Performance Optimizations 3.3. Core • Pluggable query handlers and extensible XML data format • Document uniqueness enforcement based on unique key field • Batches updates and deletes for high performance • User configurable commands triggered on index changes • Searcher concurrency control • Correct handling of numeric types for both sorting and range queries • Ability to control where docs with the sort field missing will be placed • "Luke" request handler for corpus information 3.4. Caching • Configurable Query Result, Filter, and Document cache instances • Pluggable Cache implementations • Cache warming in background • When a new searcher is opened, configurable searches are run against it in order to Introduction to The Solr Enterprise Search Server Page 3 Copyright © 2007 The Apache Software Foundation. All rights reserved.
warm it up to avoid slow first hits. During warming, the current searcher handles live requests. • Autowarming in background • The most recently accessed items in the caches of the current searcher are re-populated in the new searcher, enabing high cache hit rates across index/searcher changes. • Fast/small filter implementation • User level caching with autowarming support 3.5. Replication • Efficient distribution of index parts that have changed via rsync transport • Pull strategy allows for easy addition of searchers • Configurable distribution interval allows tradeoff between timeliness and cache utilization 3.6. Admin Interface • Comprehensive statistics on cache utilization, updates, and queries • Text analysis debugger, showing result of every stage in an analyzer • Web Query Interface w/ debugging output • parsed query output • Lucene explain() document score detailing • explain score for documents outside of the requested range to debug why a given document wasn't ranked higher. Introduction to The Solr Enterprise Search Server Page 4 Copyright © 2007 The Apache Software Foundation. All rights reserved.
,null_metadata={stream_source_info=[null],stream_content_type=[null],stream_size=[13242],\ producer=[FOP 0.20.5],stream_name=[null],Content-Type=[application/pdf]}}
  • Share this:
  • Email
  • Facebook
  • Digg
  • Share
  • Print
  • Reddit
  • StumbleUpon

Category: apache, Solr

20 Responses to “Posting Rich Documents to Apache Solr using SolrJ and Solr Cell (Apache Tika)”

  1. Excellent stuff, I’ve been wanting to use the ExtractingRequestHandler like this with SolrJ for awhile. One question: Is there a way to batch requests? I’m used to building a collection of SolrInputDocument objects using SolrJ and then firing them off in batches of several hundred. Is something like that possible when using the ExtractingRequestHandler in SolrJ? Or is it necessary to open a request for each file when processing a large batch of files?

    September 14, 2009 15:43 — Jay Hill

  2. The ContentStream stuff is all put into a list that is sent to Solr Cell. So, you should be able to put multiple at a time. How many is going to depend on size, etc.

    September 15, 2009 03:55 — Grant Ingersoll

  3. [...] Lucid Imagination » Posting Rich Documents to Apache Solr using SolrJ and Solr Cell (Apache Tika) Solr Cell, a new feature in the soon to be released Solr 1.4, allows users to send in rich documents such as MS Word and Adobe PDF directly into Solr and have them indexed for search. All of the examples on the Solr Cell wiki page, however only demonstrate how to send in the documents using the curl command line utility, while many Solr users rely on SolrJ, Solr’s Java-based client. Thus, I thought I would throw up a quick example here (and I’ll update the Wiki) demonstrating how to do this. (tags: todo solr tika) [...]

    September 15, 2009 14:03 — Webhamer Weblog: Search & ICT-related blogging » links for 2009-09-15

  4. I am brand new to Solr. Downloaded the 1.4 srcs from subversion.
    Was able to build from sources and integrate into Tomcat 6. Admin page comes up nicely.
    Then tried to post a pdf into the index with some client code as above. It seems to parse it fine but never makes it into the index.
    When I check in Solr. it shows 0 docs. Even tried an explicit commit.
    ie server.commit()
    but still 0 docs in index.. Any thoughts on what is missing.?

    Best regards
    Peter William

    October 4, 2009 10:57 — Peter William

  5. The key to the code above is that it does an Extract Only. Remove that line and I believe it should work.

    October 5, 2009 13:38 — Grant Ingersoll

  6. Grant- Thanks for your note. Yes exactly.. I tweaked it a bit more a couple of days ago..after asking the question and got it to work looking at the SolarCell stuff. This is what worked for me.
    just includng relevant section….

    SolrServer server = new CommonsHttpSolrServer(“http://localhost:8983/solr”);
    //SolrServer server = new CommonsHttpSolrServer(“http://localhost:8080/apache-solr-1.4-dev”);
    ContentStreamUpdateRequest req = new ContentStreamUpdateRequest(“/update/extract”);

    req.addFile(new File(“data/ch00.pdf”));
    req.setParam(“literal.id”, “data/ch00.pdf”);
    //req.setParam(“uprefix”, “attr_”);
    // req.setParam(“fmap.content”, “attr_content”);

    req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);

    —

    Thanks
    Peter

    October 7, 2009 13:26 — Peter William

  7. Hmmm. All I get from this is a lazy loading exception coming from
    org.apache.solr.core.RequestHandlers.java line 249
    (apache-solr-1.4-dev\src\java)

    Anyone else?

    November 12, 2009 06:06 — jayp

  8. Ok, have traced it down to classnotfound: ExtractingRequestHandler. This error also occurs when submitting the update/extract request with curl.

    November 12, 2009 06:28 — jayp

  9. What do I need to do to configure SolrJ with Tomcat 6.0?

    November 23, 2009 05:09 — William Jackson

  10. How do I configure Tomcat 6.0 to use SolrJ?

    November 23, 2009 05:09 — William Jackson

  11. SolrJ is a client side technology. I don’t believe you have to do anything with Tomcat.

    November 23, 2009 07:05 — Grant Ingersoll

  12. Grant, Could you also provide an example of how to use ContentStreamUpdateRequest and remote-stream a local file (this is marked as a TODO on the wiki page). I tried the following 3 options with Solr 1.4 with mixed results:

    Option 1:
    ———
    Specifying both calls:
    req.addFile(new File(“/tmp/features.pdf”));
    req.setParam(“stream.file”, “/tmp/features.pdf”);
    This works, but the dump output shows the ContentStream being added twice, i.e:
    INFO: add {[/tmp/features.pdf, /tmp/features.pdf]}

    Option 2
    ——–
    Leaving out the req.setParam call and only using req.addFile(new File(“/tmp/features.pdf”));
    This works, and the dump info shows:
    INFO: add [/tmp/features.pdf]
    i.e. the ContentStream only once – but does the Solr Server read the local file directly versus the SolrJ client sending the content over HTTP?

    Option 3:
    Leaving out the req.addFile call and only specifying:
    req.setParam(“stream.file”, “/tmp/features.pdf”);
    This causes:
    - a Java NullPointerException at: CommonsHTTPSolrServer.java:381
    -no dump output and
    - indexing fails

    December 2, 2009 12:42 — dorai

  13. [...] http://www.lucidimagination.com/blog/2009/09/14/posting-rich-documents-to-apache-solr-using-solrj-an... This entry was posted in tips and tricks and tagged tips and tricks. Bookmark the permalink. ← Puppy Arcade: 超强游戏模拟器合集 [...]

    June 3, 2010 22:23 — 好记性,不如烂博客! | 做人要豁達大道

  14. Is there a way to merge/associate indexed documents with another document that has been indexed using DIH? For instance, a PDF is associated with a document in a database through a Content Management System – I want to search on a term in the PDF but in the results I want the associated document in the database, *not* the PDF ?!?!?! Any way to do this? SOLRJ ?

    Thanks!

    August 16, 2010 05:39 — Eric Mose

  15. Eric,

    I haven’t tried it yet, but DIH now fully supports Tika as well, so you may be able to deal with it solely from DIH.

    August 16, 2010 06:11 — Grant Ingersoll

  16. Grant – I completely see how to do this … I didn’t realize you could nest the entities in the DIH config XML file. But – Now I get an error :
    org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to load EntityProcessor implementation for entity:2859509962269735

    when I try to do:

    
              
    
              
    

    Any ideas?

    August 16, 2010 09:59 — Eric Mose

  17. oops here … you can’t see the code duh

          <entity processor="TikaEntityProcessor" url="NAManual.pdf" dataSource="docDataSource" format="text">
              <!--Do appropriate mapping here  meta="true" means it is a metadata field -->
              <field column="Author" meta="true" name="author"/>
              <field column="title" meta="true" name="docTitle"/>
              <!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately-->
              <field column="text" name="articleText" />
            </entity>
    

    August 16, 2010 10:00 — Eric Mose

  18. grant – Ah ! TikaEntityProcessor.class is not in my solr 1.4 war ! Where/how can I get it?

    August 16, 2010 10:12 — Eric Mose

  19. Hi, I am trying to make simple readings over Lucene indexes
    by using SolrJ.When I am trying to post my “csv” file to solrj it
    is throwing this exception even though i specified this filed in
    schema.xml And here is the code that i wrote SolrServer server =
    new CommonsHttpSolrServer(“http://localhost:8983/solr”);
    ContentStreamUpdateRequest req = new
    ContentStreamUpdateRequest(“/update/csv”); req.addFile(new
    File(“c:/Sample.csv”)); req.setParam(“uprefix”, “attr_”);
    req.setParam(“fmap.content”, “attr_content”);
    req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
    NamedList result = server.request(req); System.out.println(“Result:
    ” + result); when i execute this file I am getting this exception.
    Exception in thread “main” org.apache.solr.common.SolrException:
    undefined field isbn request:
    http://localhost:8983/solr/update/csv?commit=true&waitFlush=true&waitSearcher=true&wt=javabin&version=1
    at
    org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
    at
    org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)at
    Test.main(Test.java:23) Here is the schema that I wrote isbn isbn
    Can anyone please please help me out on this? Thanks.

    December 16, 2010 09:50 — Archana

  20. Hi,

    I have a code as mentioned above.

    SolrServer server = new StreamingUpdateSolrServer(“http://localhost:8983/solr”,100,100);
    ContentStreamUpdateRequest req = new ContentStreamUpdateRequest(“/update/extract”);
    String fileName = “C:\\test.pdf”;
    req.addFile(new File(fileName));
    req.setParam(ExtractingParams.LITERALS_PREFIX+”contentid”, “test”);
    UpdateResponse resp = req.process(server);
    System.out.println(“Result: ” + resp.getStatus());
    resp = server.commit();
    System.out.println(“Commit: ” + resp.getStatus());

    I get both the status as 0, but still i m not able to search. The number shows 0.
    Can u please let me know what i have missed?

    Thanks,
    geeta

    March 8, 2011 14:21 — Geeta

Leave a Reply

Go to Blog Front Page

  • Recent Posts

    • Lucene Revolution 2012 – Call for Participation now open!
    • SolrCloud is Coming (and looking to mix in even more ‘NoSQL’)
    • Our Solr Reference Guide updated for v3.5
    • Enhancing Discovery with Solr and Mahout – session slides now available!
    • Solr and LucidWorks feature matrix available
    • LucidWorks Enterprise latest version 2.0.1 released!
    • Why Not AND, OR, And NOT?
    • Options to tune document’s relevance in Solr
    • Dallas JavaMUG December 14th 2011
    • Apache Mahout user meeting – session slides and videos are now available!
  • Archives

    • January 2012
    • December 2011
    • November 2011
    • October 2011
    • September 2011
    • August 2011
  • Tags

    acts_as_solr apache Apache Mahout best practices chump code4lib dismax drupal enterprise search Erik Hatcher field collapsing function query Grant Ingersoll hoss image isfdb local params Lucene lucene revolution LucidGaze lucid imagination Mahout Marc Krellenstein Mark Miller nested queries nutch Open Source Open Source Search qparser query parser queryparser Rails release result grouping Richmond Ruby schema design sint Solr solr 3.1 solr 4.0 solr cloud sortable Tika VA
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Log in

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2011 Lucid Imagination. All Right reserved.

loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.