Solr Cell, a new feature in the soon to be released Solr 1.4, allows users to send in rich documents such as MS Word and Adobe PDF directly into Solr and have them indexed for search. All of the examples on the Solr Cell wiki page, however only demonstrate how to send in the documents using the curl command line utility, while many Solr users rely on SolrJ, Solr’s Java-based client. Thus, I thought I would throw up a quick example here (and I’ll update the Wiki) demonstrating how to do this.
For this example, I’m using the standard Solr example and the Solr trunk version from this morning which I got using SVN:
svn co https://svn.apache.org/repos/asf/lucene/solr/trunk apache-solr
Next, after changing into the directory I checked out to, I built the example using Apache Ant:
ant clean example //slight overkill, but I did it nonetheless
I then changed into the example directory (trunk/example) and ran:
java -jar start.jar
Solr is now up and running. See the Solr Tutorial for more info on these steps.
On the code side, I used SolrJ by creating a new SolrServer and then constructed the appropriate request, containing a ContentStream (essentially a wrapper around a file) and sent it to Solr:
public class SolrCellRequestDemo { public static void main(String[] args) throws IOException, SolrServerException { SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr"); ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract"); req.addFile(new File("apache-solr/site/features.pdf")); req.setParam(ExtractingParams.EXTRACT_ONLY, "true"); NamedList<Object> result = server.request(req); System.out.println("Result: " + result); }
In this code, I did an extraction only, but you can easily substitute your parameters based on the Wiki page descriptions on Solr Cell. The key class to use is the ContentStreamUpdateRequest, which makes sure the ContentStreams are set properly, SolrJ takes care of the rest.
I hope that gives people a quick idea of how they can send files to Solr Cell via SolrJ. Also, note, that the ContentStreamUpdateRequest is not just Solr Cell specific, you can send CSV to the CSV Update handler and any other Request Handler that works with Content Streams for updates.
For completeness, the output from the code above looks like (some results reformatted for screen width):
Result: {responseHeader={status=0,QTime=1692},null=
Introduction to The Solr Enterprise
Search Server
Table of contents
1 Solr in a Nutshell... 2
2 Solr Uses the Lucene Search Library and Extends it!... 2
3 Detailed Features..2
3.1 Schema... 2
3.2 Query... 3
3.3 Core.... 3
3.4 Caching...3
3.5 Replication...4
3.6 Admin Interface...4
Copyright © 2007 The Apache Software Foundation. All rights reserved.
1. Solr in a Nutshell
Solr is a standalone enterprise search server with a web-services like API. You put
documents in it (called "indexing") via XML over HTTP. You query it via HTTP GET and
receive XML results.
• Advanced Full-Text Search Capabilities
• Optimized for High Volume Web Traffic
• Standards Based Open Interfaces - XML and HTTP
• Comprehensive HTML Administration Interfaces
• Server statistics exposed over JMX for monitoring
• Scalability - Efficient Replication to other Solr Search Servers
• Flexible and Adaptable with XML configuration
• Extensible Plugin Architecture
2. Solr Uses the Lucene Search Library and Extends it!
• A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys
• Powerful Extensions to the Lucene Query Language
• Support for Dynamic Faceted Browsing and Filtering
• Advanced, Configurable Text Analysis
• Highly Configurable and User Extensible Caching
• Performance Optimizations
• External Configuration via XML
• An Administration Interface
• Monitorable Logging
• Fast Incremental Updates and Snapshot Distribution
• Distributed search with sharded index on multiple hosts
• XML and CSV/delimited-text update formats
• Easy ways to pull in data from databases and XML files from local disk and HTTP
sources
• Multiple search indices
3. Detailed Features
3.1. Schema
• Defines the field types and fields of documents
• Can drive more intelligent processing
• Declarative Lucene Analyzer specification
Introduction to The Solr Enterprise Search Server
Page 2
Copyright © 2007 The Apache Software Foundation. All rights reserved.
• Dynamic Fields enables on-the-fly addition of new fields
• CopyField functionality allows indexing a single field multiple ways, or combining
multiple fields into a single searchable field
• Explicit types eliminates the need for guessing types of fields
• External file-based configuration of stopword lists, synonym lists, and protected word
lists
• Many additional text analysis components including word splitting, regex and
sounds-like filters
3.2. Query
• HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby)
• Sort by any number of fields
• Advanced DisMax query parser for high relevancy results from user-entered queries
• Highlighted context snippets
• Faceted Searching based on unique field values and explicit queries
• Spelling suggestions for user queries
• More Like This suggestions for given document
• Constant scoring range and prefix queries - no idf, coord, or lengthNorm factors, and no
restriction on the number of terms the query matches.
• Function Query - influence the score by a function of a field's numeric value or ordinal
• Date Math - specify dates relative to "NOW" in queries and updates
• Performance Optimizations
3.3. Core
• Pluggable query handlers and extensible XML data format
• Document uniqueness enforcement based on unique key field
• Batches updates and deletes for high performance
• User configurable commands triggered on index changes
• Searcher concurrency control
• Correct handling of numeric types for both sorting and range queries
• Ability to control where docs with the sort field missing will be placed
• "Luke" request handler for corpus information
3.4. Caching
• Configurable Query Result, Filter, and Document cache instances
• Pluggable Cache implementations
• Cache warming in background
• When a new searcher is opened, configurable searches are run against it in order to
Introduction to The Solr Enterprise Search Server
Page 3
Copyright © 2007 The Apache Software Foundation. All rights reserved.
warm it up to avoid slow first hits. During warming, the current searcher handles live
requests.
• Autowarming in background
• The most recently accessed items in the caches of the current searcher are
re-populated in the new searcher, enabing high cache hit rates across index/searcher
changes.
• Fast/small filter implementation
• User level caching with autowarming support
3.5. Replication
• Efficient distribution of index parts that have changed via rsync transport
• Pull strategy allows for easy addition of searchers
• Configurable distribution interval allows tradeoff between timeliness and cache
utilization
3.6. Admin Interface
• Comprehensive statistics on cache utilization, updates, and queries
• Text analysis debugger, showing result of every stage in an analyzer
• Web Query Interface w/ debugging output
• parsed query output
• Lucene explain() document score detailing
• explain score for documents outside of the requested range to debug why a given
document wasn't ranked higher.
Introduction to The Solr Enterprise Search Server
Page 4
Copyright © 2007 The Apache Software Foundation. All rights reserved.
,null_metadata={stream_source_info=[null],stream_content_type=[null],stream_size=[13242],\
producer=[FOP 0.20.5],stream_name=[null],Content-Type=[application/pdf]}}


Excellent stuff, I’ve been wanting to use the ExtractingRequestHandler like this with SolrJ for awhile. One question: Is there a way to batch requests? I’m used to building a collection of SolrInputDocument objects using SolrJ and then firing them off in batches of several hundred. Is something like that possible when using the ExtractingRequestHandler in SolrJ? Or is it necessary to open a request for each file when processing a large batch of files?
September 14, 2009 15:43 — Jay Hill
The ContentStream stuff is all put into a list that is sent to Solr Cell. So, you should be able to put multiple at a time. How many is going to depend on size, etc.
September 15, 2009 03:55 — Grant Ingersoll
[...] Lucid Imagination » Posting Rich Documents to Apache Solr using SolrJ and Solr Cell (Apache Tika) Solr Cell, a new feature in the soon to be released Solr 1.4, allows users to send in rich documents such as MS Word and Adobe PDF directly into Solr and have them indexed for search. All of the examples on the Solr Cell wiki page, however only demonstrate how to send in the documents using the curl command line utility, while many Solr users rely on SolrJ, Solr’s Java-based client. Thus, I thought I would throw up a quick example here (and I’ll update the Wiki) demonstrating how to do this. (tags: todo solr tika) [...]
September 15, 2009 14:03 — Webhamer Weblog: Search & ICT-related blogging » links for 2009-09-15
I am brand new to Solr. Downloaded the 1.4 srcs from subversion.
Was able to build from sources and integrate into Tomcat 6. Admin page comes up nicely.
Then tried to post a pdf into the index with some client code as above. It seems to parse it fine but never makes it into the index.
When I check in Solr. it shows 0 docs. Even tried an explicit commit.
ie server.commit()
but still 0 docs in index.. Any thoughts on what is missing.?
Best regards
Peter William
October 4, 2009 10:57 — Peter William
The key to the code above is that it does an Extract Only. Remove that line and I believe it should work.
October 5, 2009 13:38 — Grant Ingersoll
Grant- Thanks for your note. Yes exactly.. I tweaked it a bit more a couple of days ago..after asking the question and got it to work looking at the SolarCell stuff. This is what worked for me.
just includng relevant section….
SolrServer server = new CommonsHttpSolrServer(“http://localhost:8983/solr”);
//SolrServer server = new CommonsHttpSolrServer(“http://localhost:8080/apache-solr-1.4-dev”);
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest(“/update/extract”);
req.addFile(new File(“data/ch00.pdf”));
req.setParam(“literal.id”, “data/ch00.pdf”);
//req.setParam(“uprefix”, “attr_”);
// req.setParam(“fmap.content”, “attr_content”);
req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
—
Thanks
Peter
October 7, 2009 13:26 — Peter William
Hmmm. All I get from this is a lazy loading exception coming from
org.apache.solr.core.RequestHandlers.java line 249
(apache-solr-1.4-dev\src\java)
Anyone else?
November 12, 2009 06:06 — jayp
Ok, have traced it down to classnotfound: ExtractingRequestHandler. This error also occurs when submitting the update/extract request with curl.
November 12, 2009 06:28 — jayp
What do I need to do to configure SolrJ with Tomcat 6.0?
November 23, 2009 05:09 — William Jackson
How do I configure Tomcat 6.0 to use SolrJ?
November 23, 2009 05:09 — William Jackson
SolrJ is a client side technology. I don’t believe you have to do anything with Tomcat.
November 23, 2009 07:05 — Grant Ingersoll
Grant, Could you also provide an example of how to use ContentStreamUpdateRequest and remote-stream a local file (this is marked as a TODO on the wiki page). I tried the following 3 options with Solr 1.4 with mixed results:
Option 1:
———
Specifying both calls:
req.addFile(new File(“/tmp/features.pdf”));
req.setParam(“stream.file”, “/tmp/features.pdf”);
This works, but the dump output shows the ContentStream being added twice, i.e:
INFO: add {[/tmp/features.pdf, /tmp/features.pdf]}
Option 2
——–
Leaving out the req.setParam call and only using req.addFile(new File(“/tmp/features.pdf”));
This works, and the dump info shows:
INFO: add [/tmp/features.pdf]
i.e. the ContentStream only once – but does the Solr Server read the local file directly versus the SolrJ client sending the content over HTTP?
Option 3:
Leaving out the req.addFile call and only specifying:
req.setParam(“stream.file”, “/tmp/features.pdf”);
This causes:
- a Java NullPointerException at: CommonsHTTPSolrServer.java:381
-no dump output and
- indexing fails
December 2, 2009 12:42 — dorai