Solr Cell, a new feature in the soon to be released Solr 1.4, allows users to send in rich documents such as MS Word and Adobe PDF directly into Solr and have them indexed for search. All of the examples on the Solr Cell wiki page, however only demonstrate how to send in the documents using the curl command line utility, while many Solr users rely on SolrJ, Solr’s Java-based client. Thus, I thought I would throw up a quick example here (and I’ll update the Wiki) demonstrating how to do this.
For this example, I’m using the standard Solr example and the Solr trunk version from this morning which I got using SVN:
svn co https://svn.apache.org/repos/asf/lucene/solr/trunk apache-solr
Next, after changing into the directory I checked out to, I built the example using Apache Ant:
ant clean example //slight overkill, but I did it nonetheless
I then changed into the example directory (trunk/example) and ran:
java -jar start.jar
Solr is now up and running. See the Solr Tutorial for more info on these steps.
On the code side, I used SolrJ by creating a new SolrServer and then constructed the appropriate request, containing a ContentStream (essentially a wrapper around a file) and sent it to Solr:
public class SolrCellRequestDemo { public static void main(String[] args) throws IOException, SolrServerException { SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr"); ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract"); req.addFile(new File("apache-solr/site/features.pdf")); req.setParam(ExtractingParams.EXTRACT_ONLY, "true"); NamedList<Object> result = server.request(req); System.out.println("Result: " + result); }
In this code, I did an extraction only, but you can easily substitute your parameters based on the Wiki page descriptions on Solr Cell. The key class to use is the ContentStreamUpdateRequest, which makes sure the ContentStreams are set properly, SolrJ takes care of the rest.
I hope that gives people a quick idea of how they can send files to Solr Cell via SolrJ. Also, note, that the ContentStreamUpdateRequest is not just Solr Cell specific, you can send CSV to the CSV Update handler and any other Request Handler that works with Content Streams for updates.
For completeness, the output from the code above looks like (some results reformatted for screen width):
Result: {responseHeader={status=0,QTime=1692},null=
Introduction to The Solr Enterprise
Search Server
Table of contents
1 Solr in a Nutshell... 2
2 Solr Uses the Lucene Search Library and Extends it!... 2
3 Detailed Features..2
3.1 Schema... 2
3.2 Query... 3
3.3 Core.... 3
3.4 Caching...3
3.5 Replication...4
3.6 Admin Interface...4
Copyright © 2007 The Apache Software Foundation. All rights reserved.
1. Solr in a Nutshell
Solr is a standalone enterprise search server with a web-services like API. You put
documents in it (called "indexing") via XML over HTTP. You query it via HTTP GET and
receive XML results.
• Advanced Full-Text Search Capabilities
• Optimized for High Volume Web Traffic
• Standards Based Open Interfaces - XML and HTTP
• Comprehensive HTML Administration Interfaces
• Server statistics exposed over JMX for monitoring
• Scalability - Efficient Replication to other Solr Search Servers
• Flexible and Adaptable with XML configuration
• Extensible Plugin Architecture
2. Solr Uses the Lucene Search Library and Extends it!
• A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys
• Powerful Extensions to the Lucene Query Language
• Support for Dynamic Faceted Browsing and Filtering
• Advanced, Configurable Text Analysis
• Highly Configurable and User Extensible Caching
• Performance Optimizations
• External Configuration via XML
• An Administration Interface
• Monitorable Logging
• Fast Incremental Updates and Snapshot Distribution
• Distributed search with sharded index on multiple hosts
• XML and CSV/delimited-text update formats
• Easy ways to pull in data from databases and XML files from local disk and HTTP
sources
• Multiple search indices
3. Detailed Features
3.1. Schema
• Defines the field types and fields of documents
• Can drive more intelligent processing
• Declarative Lucene Analyzer specification
Introduction to The Solr Enterprise Search Server
Page 2
Copyright © 2007 The Apache Software Foundation. All rights reserved.
• Dynamic Fields enables on-the-fly addition of new fields
• CopyField functionality allows indexing a single field multiple ways, or combining
multiple fields into a single searchable field
• Explicit types eliminates the need for guessing types of fields
• External file-based configuration of stopword lists, synonym lists, and protected word
lists
• Many additional text analysis components including word splitting, regex and
sounds-like filters
3.2. Query
• HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby)
• Sort by any number of fields
• Advanced DisMax query parser for high relevancy results from user-entered queries
• Highlighted context snippets
• Faceted Searching based on unique field values and explicit queries
• Spelling suggestions for user queries
• More Like This suggestions for given document
• Constant scoring range and prefix queries - no idf, coord, or lengthNorm factors, and no
restriction on the number of terms the query matches.
• Function Query - influence the score by a function of a field's numeric value or ordinal
• Date Math - specify dates relative to "NOW" in queries and updates
• Performance Optimizations
3.3. Core
• Pluggable query handlers and extensible XML data format
• Document uniqueness enforcement based on unique key field
• Batches updates and deletes for high performance
• User configurable commands triggered on index changes
• Searcher concurrency control
• Correct handling of numeric types for both sorting and range queries
• Ability to control where docs with the sort field missing will be placed
• "Luke" request handler for corpus information
3.4. Caching
• Configurable Query Result, Filter, and Document cache instances
• Pluggable Cache implementations
• Cache warming in background
• When a new searcher is opened, configurable searches are run against it in order to
Introduction to The Solr Enterprise Search Server
Page 3
Copyright © 2007 The Apache Software Foundation. All rights reserved.
warm it up to avoid slow first hits. During warming, the current searcher handles live
requests.
• Autowarming in background
• The most recently accessed items in the caches of the current searcher are
re-populated in the new searcher, enabing high cache hit rates across index/searcher
changes.
• Fast/small filter implementation
• User level caching with autowarming support
3.5. Replication
• Efficient distribution of index parts that have changed via rsync transport
• Pull strategy allows for easy addition of searchers
• Configurable distribution interval allows tradeoff between timeliness and cache
utilization
3.6. Admin Interface
• Comprehensive statistics on cache utilization, updates, and queries
• Text analysis debugger, showing result of every stage in an analyzer
• Web Query Interface w/ debugging output
• parsed query output
• Lucene explain() document score detailing
• explain score for documents outside of the requested range to debug why a given
document wasn't ranked higher.
Introduction to The Solr Enterprise Search Server
Page 4
Copyright © 2007 The Apache Software Foundation. All rights reserved.
,null_metadata={stream_source_info=[null],stream_content_type=[null],stream_size=[13242],\
producer=[FOP 0.20.5],stream_name=[null],Content-Type=[application/pdf]}}
Excellent stuff, I’ve been wanting to use the ExtractingRequestHandler like this with SolrJ for awhile. One question: Is there a way to batch requests? I’m used to building a collection of SolrInputDocument objects using SolrJ and then firing them off in batches of several hundred. Is something like that possible when using the ExtractingRequestHandler in SolrJ? Or is it necessary to open a request for each file when processing a large batch of files?
September 14, 2009 15:43 — Jay Hill
The ContentStream stuff is all put into a list that is sent to Solr Cell. So, you should be able to put multiple at a time. How many is going to depend on size, etc.
September 15, 2009 03:55 — Grant Ingersoll
[...] Lucid Imagination » Posting Rich Documents to Apache Solr using SolrJ and Solr Cell (Apache Tika) Solr Cell, a new feature in the soon to be released Solr 1.4, allows users to send in rich documents such as MS Word and Adobe PDF directly into Solr and have them indexed for search. All of the examples on the Solr Cell wiki page, however only demonstrate how to send in the documents using the curl command line utility, while many Solr users rely on SolrJ, Solr’s Java-based client. Thus, I thought I would throw up a quick example here (and I’ll update the Wiki) demonstrating how to do this. (tags: todo solr tika) [...]
September 15, 2009 14:03 — Webhamer Weblog: Search & ICT-related blogging » links for 2009-09-15
I am brand new to Solr. Downloaded the 1.4 srcs from subversion.
Was able to build from sources and integrate into Tomcat 6. Admin page comes up nicely.
Then tried to post a pdf into the index with some client code as above. It seems to parse it fine but never makes it into the index.
When I check in Solr. it shows 0 docs. Even tried an explicit commit.
ie server.commit()
but still 0 docs in index.. Any thoughts on what is missing.?
Best regards
Peter William
October 4, 2009 10:57 — Peter William
The key to the code above is that it does an Extract Only. Remove that line and I believe it should work.
October 5, 2009 13:38 — Grant Ingersoll
Grant- Thanks for your note. Yes exactly.. I tweaked it a bit more a couple of days ago..after asking the question and got it to work looking at the SolarCell stuff. This is what worked for me.
just includng relevant section….
SolrServer server = new CommonsHttpSolrServer(“http://localhost:8983/solr”);
//SolrServer server = new CommonsHttpSolrServer(“http://localhost:8080/apache-solr-1.4-dev”);
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest(“/update/extract”);
req.addFile(new File(“data/ch00.pdf”));
req.setParam(“literal.id”, “data/ch00.pdf”);
//req.setParam(“uprefix”, “attr_”);
// req.setParam(“fmap.content”, “attr_content”);
req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
—
Thanks
Peter
October 7, 2009 13:26 — Peter William
Hmmm. All I get from this is a lazy loading exception coming from
org.apache.solr.core.RequestHandlers.java line 249
(apache-solr-1.4-dev\src\java)
Anyone else?
November 12, 2009 06:06 — jayp
Ok, have traced it down to classnotfound: ExtractingRequestHandler. This error also occurs when submitting the update/extract request with curl.
November 12, 2009 06:28 — jayp
What do I need to do to configure SolrJ with Tomcat 6.0?
November 23, 2009 05:09 — William Jackson
How do I configure Tomcat 6.0 to use SolrJ?
November 23, 2009 05:09 — William Jackson
SolrJ is a client side technology. I don’t believe you have to do anything with Tomcat.
November 23, 2009 07:05 — Grant Ingersoll
Grant, Could you also provide an example of how to use ContentStreamUpdateRequest and remote-stream a local file (this is marked as a TODO on the wiki page). I tried the following 3 options with Solr 1.4 with mixed results:
Option 1:
———
Specifying both calls:
req.addFile(new File(“/tmp/features.pdf”));
req.setParam(“stream.file”, “/tmp/features.pdf”);
This works, but the dump output shows the ContentStream being added twice, i.e:
INFO: add {[/tmp/features.pdf, /tmp/features.pdf]}
Option 2
——–
Leaving out the req.setParam call and only using req.addFile(new File(“/tmp/features.pdf”));
This works, and the dump info shows:
INFO: add [/tmp/features.pdf]
i.e. the ContentStream only once – but does the Solr Server read the local file directly versus the SolrJ client sending the content over HTTP?
Option 3:
Leaving out the req.addFile call and only specifying:
req.setParam(“stream.file”, “/tmp/features.pdf”);
This causes:
- a Java NullPointerException at: CommonsHTTPSolrServer.java:381
-no dump output and
- indexing fails
December 2, 2009 12:42 — dorai
[...] http://www.lucidimagination.com/blog/2009/09/14/posting-rich-documents-to-apache-solr-using-solrj-an... This entry was posted in tips and tricks and tagged tips and tricks. Bookmark the permalink. ← Puppy Arcade: 超强游戏模拟器合集 [...]
June 3, 2010 22:23 — 好记性,不如烂博客! | 做人要豁達大道
Is there a way to merge/associate indexed documents with another document that has been indexed using DIH? For instance, a PDF is associated with a document in a database through a Content Management System – I want to search on a term in the PDF but in the results I want the associated document in the database, *not* the PDF ?!?!?! Any way to do this? SOLRJ ?
Thanks!
August 16, 2010 05:39 — Eric Mose
Eric,
I haven’t tried it yet, but DIH now fully supports Tika as well, so you may be able to deal with it solely from DIH.
August 16, 2010 06:11 — Grant Ingersoll
Grant – I completely see how to do this … I didn’t realize you could nest the entities in the DIH config XML file. But – Now I get an error :
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to load EntityProcessor implementation for entity:2859509962269735
when I try to do:
Any ideas?
August 16, 2010 09:59 — Eric Mose
oops here … you can’t see the code duh
<entity processor="TikaEntityProcessor" url="NAManual.pdf" dataSource="docDataSource" format="text"> <!--Do appropriate mapping here meta="true" means it is a metadata field --> <field column="Author" meta="true" name="author"/> <field column="title" meta="true" name="docTitle"/> <!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately--> <field column="text" name="articleText" /> </entity>August 16, 2010 10:00 — Eric Mose
grant – Ah ! TikaEntityProcessor.class is not in my solr 1.4 war ! Where/how can I get it?
August 16, 2010 10:12 — Eric Mose
Hi, I am trying to make simple readings over Lucene indexes
by using SolrJ.When I am trying to post my “csv” file to solrj it
is throwing this exception even though i specified this filed in
schema.xml And here is the code that i wrote SolrServer server =
new CommonsHttpSolrServer(“http://localhost:8983/solr”);
ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest(“/update/csv”); req.addFile(new
File(“c:/Sample.csv”)); req.setParam(“uprefix”, “attr_”);
req.setParam(“fmap.content”, “attr_content”);
req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
NamedList result = server.request(req); System.out.println(“Result:
” + result); when i execute this file I am getting this exception.
Exception in thread “main” org.apache.solr.common.SolrException:
undefined field isbn request:
http://localhost:8983/solr/update/csv?commit=true&waitFlush=true&waitSearcher=true&wt=javabin&version=1
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)at
Test.main(Test.java:23) Here is the schema that I wrote isbn isbn
Can anyone please please help me out on this? Thanks.
December 16, 2010 09:50 — Archana
Hi,
I have a code as mentioned above.
SolrServer server = new StreamingUpdateSolrServer(“http://localhost:8983/solr”,100,100);
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest(“/update/extract”);
String fileName = “C:\\test.pdf”;
req.addFile(new File(fileName));
req.setParam(ExtractingParams.LITERALS_PREFIX+”contentid”, “test”);
UpdateResponse resp = req.process(server);
System.out.println(“Result: ” + resp.getStatus());
resp = server.commit();
System.out.println(“Commit: ” + resp.getStatus());
I get both the status as 0, but still i m not able to search. The number shows 0.
Can u please let me know what i have missed?
Thanks,
geeta
March 8, 2011 14:21 — Geeta