• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
    • Certified Distributions
      • Certified Solr
      • Certified Lucene
    • Apache Releases
      • Apache Solr
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Training
    • Solr/Lucene Certification
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Technical Leadership
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . Blog

January 8, 2010

Library Love

Posted by Erik Hatcher

I’ve long had a passion for improving the findability within libraries.  The richness of the cultural artifacts that one can find with a bit of foraging astonishes the imagination.  I had the pleasure of working with the Applied Research in Patacriticism group at the University of Virginia.  While building the first version of Collex (collect/exhibit) for NINES I was approached by Bess Sadler asking about the viability of using Solr for searching and faceting on library data.  The library world was just seeing scalable faceting take stage with NCSU’s Endeca installation, but the price tag prohibited most every other institution from enjoying faceting.  With the prodding from Bess, I learned a bit about MARC, created some Ruby scripts, invented Solr Flare, and was able to pretty much match what NCSU was doing with only a handful of evenings of hacking.   I presented my work with an all-day preconference class on Solr and keynote at the 2007 code4lib conference.   A lot of things have happened since, and in a large part because of, this initial work.  Solr Flare spun off into Blacklight, a Ruby on Rails front-end being used by Stanford’s SearchWorks effort, UVa’s “VIRGO Beta”, and a number of other institutions.  VUFind, a PHP-based front-end, is also a popular front-end technology, and there are several other OPACs (online public access catalog, fancy name for “website with a search box”) that reside on top of Solr.

VUFind and Blacklight both share a common indexer, SolrMarc.  SolrMarc provides a flexible, extensible tool for mapping the complex standard library MARC format into Solr documents.

Recently it was reported that that SolrMarc indexing performance needed help (Stanford reported 12 hours to index 6M records).  I couldn’t help but want to help.  So I grabbed the latest SolrMarc (version 2.1, in development), a publicly available MARC file containing 5.7M records, and gave it a try.  First I ran SolrMarc against the file, and I killed the job after 9 hours.  Rather than looking too deep into the code to see what might be wrong, I decided to get a baseline on how fast indexing MARC could be using the simplest thing that could possibly work.  I created a custom MarcEntityProcessor, a hook into Solr’s DataImportHandler.  Using the MARC4J library directly, only indexing the id and a toString() of the entire MARC record, I was able to index the same dataset in 55 minutes.  It went from 22 docs/s to 1,745 docs/s! To be fair, my implementation didn’t do the fancy mappings needed in the real world, and there is still a bit of work to do in order to fully flesh out a DataImportHandler refactoring, but hopefully this new approach will be embraced by the library Solr community.

This was a long-winded way of saying… I’m devoting a chunk of my next couple of months to the needs of the Solr using library community.  My favorite conference of all time is coming up, code4lib conference, and I’m getting ramped up.   Naomi Dushay (of Stanford) and I are leading a Solr Blackbelt preconference where we’ll be going through heavy topics like query parsing and improving relevancy.

Stay tuned for lots more light being shed on Solr in the Library!

  • Share this:
  • Email
  • Facebook
  • Digg
  • Share
  • Print
  • Reddit
  • StumbleUpon

Category: Libraries, Open Source, Solr

Tags: library

3 Responses to “Library Love”

  1. [...] p. 236: EmbeddedSolrServer – Streaming locally available content into Solr is not, IMO, an attractive use of embedded Solr.  It misses being able to run an indexer on a separate machine and multiprocess/thread easily.  I’ll grant that both the rich client application and upgrading from a pure Lucene-based solution make sense for embedded, but these are pretty exceptional use cases.  And the advantages to using Solr over HTTP are pretty extensive (scaling, separation, replication, distributed search).  Also, SolrMarc is not a good example, IMO, of using embedded Solr server.  See my recent foray back into this world. [...]

    January 11, 2010 09:15 — Lucid Imagination » Book Review: Solr 1.4 Enterprise Search Server (Packt) by David Smiley and Eric Pugh

  2. [...] will make your head spin, and unforgiving relevancy for the masses (our own Erik Hatcher hails from that space and is still quite active in it).  Chances are if you’re running into a tough search [...]

    March 8, 2010 08:12 — Lucid Imagination » A billion here, a billion there… pretty soon

  3. Hi,

    Have you profiled SolrMarc to see why it is slow? Did you run it in Embedded or remote mode?

    Checking the source, it appears that SolrMarc does not use the StreamingUpdateSolrServer. It would be a nice test to simply swap out CommonsHttpSolrServer with StreamingUpdateSolrServer with 5 threads in SolrRemoteProxy (http://code.google.com/p/solrmarc/source/browse/tags/release-2.1.1/lib/solrmarc/src/org/solrmarc/solr/SolrRemoteProxy.java) and see whether it helps.

    July 1, 2010 04:52 — Jan Høydahl

Leave a Reply

Go to Blog Front Page

  • Recent Posts

    • Lucene Revolution 2012 – Call for Participation now open!
    • SolrCloud is Coming (and looking to mix in even more ‘NoSQL’)
    • Our Solr Reference Guide updated for v3.5
    • Enhancing Discovery with Solr and Mahout – session slides now available!
    • Solr and LucidWorks feature matrix available
    • LucidWorks Enterprise latest version 2.0.1 released!
    • Why Not AND, OR, And NOT?
    • Options to tune document’s relevance in Solr
    • Dallas JavaMUG December 14th 2011
    • Apache Mahout user meeting – session slides and videos are now available!
  • Archives

    • January 2012
    • December 2011
    • November 2011
    • October 2011
    • September 2011
    • August 2011
  • Tags

    acts_as_solr apache Apache Mahout best practices chump code4lib dismax drupal enterprise search Erik Hatcher field collapsing function query Grant Ingersoll hoss image isfdb local params Lucene lucene revolution LucidGaze lucid imagination Mahout Marc Krellenstein Mark Miller nested queries nutch Open Source Open Source Search qparser query parser queryparser Rails release result grouping Richmond Ruby schema design sint Solr solr 3.1 solr 4.0 solr cloud sortable Tika VA
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Log in

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2011 Lucid Imagination. All Right reserved.

loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.