• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
    • Certified Distributions
      • Certified Solr
      • Certified Lucene
    • Apache Releases
      • Apache Solr
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Training
    • Solr/Lucene Certification
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Apache Lucene/Solr Committers
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . Blog

January 27, 2010

Search by the Book

Posted by Lance Norskog

By now, many of you have had the opportunity to use the online, searchable version of the LucidWorks Certified Distribution Reference Guide for Solr 1.4. In this post, I’ll describe how we took the original document version of the Reference Guide (LWCDRG), and transformed it into an online resource searched by Solr.

I hope that you might find this useful if you are faced with creating a similar, online searchable service from existing documents.

The Content

The LWCDRG itself was composed and edited in OpenOffice Writer (OOW), chosen not least owing to its open source provenance. While the primary design goals was to create a single downloadable PDF of the full LWCDRG, the use of OOW gave us a helpful starting point for the transformation into its searchable resource counterpart.

Each chapter in the guide was written as a single OOW writer (e.g., Chapter1.sxw, Chapter2.sxw, etc.). To create a truly useful index of the book, and simplify navigation of the content, we decided that rather than indexing each chapter as a single document, we instead set out index each section as independent text documents. Thus, Chapter 2 was to be indexed as the following 12 documents:

  • 2 Getting Started
  • 2.1 Installing LucidWorks for Solr
  • 2.1.1 Got Java?
  • 2.1.2 Downloading the LucidWorks for Solr Installer
  • 2.1.3 Running the Installer
  • 2.2 Running LucidWorks for Solr
  • 2.2.1 Fire Up the Server
  • 2.2.2 Add Documents
  • 2.2.3 Ask Questions
  • 2.2.4 Clean Up
  • 2.3 A Quick Overview
  • 2.4 A Step Closer

 

 

Document pipeline

The first step of the conversion sequence was to create HTML files. We used writer2latex, an open source program, to generate xhtml files. What writer2latex does is turn a LaTex or OpenOffice document into a navigable web page sequence. The pages include html page titles, previous/next/up buttons, and basic formatting. We created separate xhtml pages for every section in the book.

Next, we had to spend some time fooling with the configuration of writer2latex to match our purpose; it makes various default assumptions about its output based on its mission to create web pageflows that didn’t square with our intent. In the end, we had to post-process each page to match our purposes. (More on the final output format later).

With the post-processed output in hand, we indexed it using basic Solr. We now had a simple solr index that could return desired documents accurately based on standard solr queries.

The Search Application

Next, we had some work to do to turn the Solr application into a full-fledged web application that would fetch the indexed html and renders it within the presentation layer you now see on the search display page (our internal name for this application is LucidFind). Next, we had some work to do to turn the Solr application into a full-fledged web application that would fetch the indexed html and renders it within the presentation layer you now see on the search display page (our internal name for this application is LucidFind). And, since LucidFind already included faceting, we used Solr’s basic faceting capabilities to add facets for the LWCDRG along with our website content and blog posts.

Now, we were into some challenges with the formatting of the documents, which required some work in post processing. Some heavy scripting and HTML tweaking were required in order to get the desired, visually consistent behavior for table formatting, image framing, and CSS creation; problems included paragraph spacing, font sizing, and similar HTML presentation issues. We also encountered a bug in Writer2LaTeX, in which the tool didn’t capture properly chapter numbers from the OOW metadata; on contacting the author, he fixed it for us (gotta love open source!).

Some Content Management Challenges

We also placed the book’s images in a static content directory; this took a little extra work as the original LWCDRG design did not account for this centralized approach; for editing the book and creating the PDF it is more convenient to bind the images into each chapter, though in retrospect we might have been able to do this with some mechanism using external links.

The id for the document includes the chapter number and section number. The actual document text is xhtml body element text without the surrounding

<body></body>

This makes it easy to pull the parent sections and include them for context, wrapping the sections in one large html page. For testing, an xsl script did exactly this.

Finally, we tackled the matter of permanent links for search output. Today, the link for each section is is formed using the chapter number and the section, like so:

http://www.lucidimagination.com/search/document/CDRG_ch02_2.3

In future editions of the book, it is readily possible that section numbers would change; it’s less likely the chapter numbers will change. The link might be a search string for the subsection, presumably the title. It might include the chapter and parent sections as low-boost additions.

Lance Norskog is a search engineer at Lucid Imagination.

  • Share this:
  • Email
  • Facebook
  • Digg
  • Share
  • Print
  • Reddit
  • StumbleUpon

Category: Uncategorized

One Response to “Search by the Book”

  1. [...] Lucid Imagination В» Search by the Book Jan 27, 2010 … 2.2.1 Fire Up the Server; 2.2.2 Add Documents … We used writer2latex, an open source program, … [...]

    April 16, 2011 23:43 — the books program 2.2.1 | My book blog

Leave a Reply

Go to Blog Front Page

  • Recent Posts

    • Lucene Revolution 2012 – Call for Participation now open!
    • SolrCloud is Coming (and looking to mix in even more ‘NoSQL’)
    • Our Solr Reference Guide updated for v3.5
    • Enhancing Discovery with Solr and Mahout – session slides now available!
    • Solr and LucidWorks feature matrix available
    • LucidWorks Enterprise latest version 2.0.1 released!
    • Why Not AND, OR, And NOT?
    • Options to tune document’s relevance in Solr
    • Dallas JavaMUG December 14th 2011
    • Apache Mahout user meeting – session slides and videos are now available!
  • Archives

    • January 2012
    • December 2011
    • November 2011
    • October 2011
    • September 2011
    • August 2011
  • Tags

    acts_as_solr apache Apache Mahout best practices chump code4lib dismax drupal enterprise search Erik Hatcher field collapsing function query Grant Ingersoll hoss image isfdb local params Lucene lucene revolution LucidGaze lucid imagination Mahout Marc Krellenstein Mark Miller nested queries nutch Open Source Open Source Search qparser query parser queryparser Rails release result grouping Richmond Ruby schema design sint Solr solr 3.1 solr 4.0 solr cloud sortable Tika VA
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Log in

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2011 Lucid Imagination. All Right reserved.

loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.