• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
      • LucidWorks Big Data
    • Apache Releases
      • Apache Solr 4.0-dev
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Lucid University
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Lucene Revolution
      • Tradeshows & Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Board of Directors
    • Apache Lucene/Solr Committers
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Sign Up or Log In
Home . Products . LucidWorks Search Platform . LucidWorks and Solr

  • Overview
  • LucidWorks Search Platform
    • Features and Benefits
    • Technical Overview
    • Only with LucidWorks
    • LucidWorks and Solr
    • White Papers
    • LucidWorks Enterprise
    • LucidWorks Cloud
    • LucidWorks Big Data
  • Apache Releases
    • Apache Solr 4.0-dev
    • Apache Lucene

Solr and LucidWorks: How to Choose the Right Platform

If the LucidWorks Platform is built on Solr, how do you know which one to use when for your own circumstances? This article describes the difference between using straight Solr, using the LucidWorks user interface, and using the LucidWorks ReST API for accomplishing various common tasks so you can see which fits your situation at a given moment.

In today's world, building the perfect product is a lot like trying to repair a set of train tracks while the train is barreling down on you. The world just keeps moving, with great ideas and new possibilities tempting you every day. And to make things worse, innovation doesn't just show its face for you; it regularly visits your competitors as well.

That's why you use open source software in the first place. You have smart people; does it make sense to have them building search functionality when Apache Solr already provides it? Of course not. You'd rather rely on the solid functionality that's already been built by the community of Solr developers, and let your people spend their time building innovation into your own products. It's simply a more efficient use of resources.

But what if you need search-related functionality that's not available in straight Solr? In some cases, you may be able to fill those holes and lighten your load with the LucidWorks Platform. Built on Solr, the LucidWorks Platform starts by simplifying the day-to-day use tasks involved in using Solr, and then moves on to adding additional features that can help free up your development team for work on your own applications. But how do you know which path would be right for you?

The answers depend on a number of different factors, including the size and scope of the project, your available resources, and even the types of data involved, so let's look at some of the more common tasks involved in running a search-related application. This way, you can make up your own mind as to what's best for you. Let's start by getting the lay of the land.

A search application is basically the same no matter what product you're using, involving the following steps:

  1. Define your indexed content
  2. Define your data
  3. Define user/data domains
  4. Index your data
  5. Build your application (including facets, autocomplete, etc.)
    or
    Issue queries through the built-in query interface
  6. Keep your indexes updated
  7. Control access
  8. Monitor the system

Solr-based applications (including those built as an extension to LucidWorks, rather than using the built-in query page) get their data by issuing an HTTP call and parsing the results. That has the advantage that they can be used in virtually any programming environment, using virtually any programming language, so that won't be a factor in your decision, but a single programmer looking to index a single source of documents will likely have different needs than a large corporation with heterogeneous data and multiple stakeholders with different data requirements.

At the most basic level, we have three different options to look at:

  • Straight Solr: In this case, you've got the engine and all of the flexibility that provides, but there's a lot you will need to do yourself. You may get more fine-grained control, but it comes at a price for ongoing maintenance, and might require higher-skilled resources to invest their attention in more mundane work.
  • LucidWorks web-based user interface: The LucidWorks UI provides easy ways to accomplish common tasks, but there are times you might need to go beyond that to the ReST API. A UI is not scriptable, but it's easy to point and click, particularly if some of your administrators are not comfortable with scripting.
  • LucidWorks ReST API: All of the capabilities the LucidWorks UI offers are also available via ReST API. Simply put, the ReST API enables you to accomplish tasks programmatically using HTTP calls, with some scripting; the scripting, in turn, enables creation of repeatable processes and automation.

So let's look at some of the more common tasks you'll need to accomplish when building search into your applications, and what they mean in each of these three environments.

 

Installation

Before you do anything else, of course you're going to have to install the software.

Apache Solr is largely a roll-your-own system, a web application that you need to configure yourself. The actual installation is pretty simple -- just unzip the software. But while the standard distribution includes several example environments, they're not meant for production use, so you'll need to decide how you want your particular instance configured, and run it in a proper web application server.

LucidWorks Platform comes with both a GUI installer and a command-line installer, both of which can be run with the default inputs to create a production system.

Solr LW GUI LW ReST
  1. Unzip the download
  2. If necessary, install within a servlet-capable web server
  3. Configure the Solr instance using solrconfig.xml
  4. Start the application
  • Run the installer
  • Included in standard installation

 

Defining fields

The actual engine behind both Solr and LucidWorks, which is built on Apache Lucene, includes the ability to create various document "fields", each of which can have their own requirements and can be treated differently when they're indexed or searched.

Adding new fields and field types to Solr involves editing the installation's schema.xml file. In this file, you can add new field types, and create fields that use those types.

The LucidWorks Platform includes a web-based user interface that enables you to add additional field types and fields and specify properties such as whether they should be used for spell-checking.

The LucidWorks Platform ReST API includes the ability to add, edit, and delete field types and fields programmatically, so if your requirements change, you can dynamically alter field definitions, for example to add a new field needed for additional data.

Solr LW GUI LW ReST
  1. Determine appropriate filters and tokenizers
  2. Define the field type
  3. Add the type to the schema.xml file
  4. Add the field to the schema.xml file, referencing the type
  5. Validate schema.xml
  6. Restart Solr
  1. Go to Indexing -> Field Types
  2. Add the field type with appropriate filters and tokenizers or modify an existing field type
  3. Go to Indexing -> Fields
  4. Add the field using the provided web form and specify the appropriate attributes
  1. Create the field type with a POST request to the FieldType API
  2. Create the field with a POST request to the Fields API

 

Click any image to enlarge


Adding a new field using the LucidWorks UI


Editing a field type using the LucidWorks UI

 

Indexing Solr XML data

Originally, the only way to add data for indexing to Solr was to convert it into a very specific XML form, called Solr XML data. This is a very specific format, in which each doc element has a series of field elements with name attributes that identify the field.

Adding Solr XML documents to Solr is pretty straightforward, involving sending documents to the server using an HTTP POST request. The standard download includes post.jar, which you can run to post individual files (or multiple files, using wild cards) to Solr. (You can also add comma-delimited data this way.)

The LucidWorks Platform includes a user interface for adding Solr XML data by the file or by the directory by creating a "data source". It also includes a scheduler that enables you to set the system to re-index the data source on a regular basis to pick up changes or additions.

The LucidWorks ReST API lets you create a Solr XML data source and schedule it programmatically.

Solr LW GUI LW ReST
  1. Open a command prompt
  2. Run post.jar on the file(s)
  3. Add functionality to your application to periodically re-index the data to keep it current
  1. Create a new data source pointing to the file or directory
  2. Schedule the new source to run periodically
  1. Create the data source with POST request to the Data Sources API
  2. Update the indexing schedule with PUT request to the Data Source Schedules API


Adding a Solr data source using the LucidWorks UI

 

Indexing non-XML data

A large percentage of data isn't XML, and it's not easily converted to XML (such as that large collection of Word documents and PDF files you've got to search). Solr includes Apache Tika, which enables you to index the content of Microsoft Office files, PDFs, text formats, HTML, and many more. This includes images, audio, and video, though in that case it's mostly the metadata that's indexed.

Indexing non-Solr XML content with Solr involves using cURL to make HTTP POST requests to the Solr engine for each document you want to index.

The LucidWorks Platform includes a number of pre-configured connectors to give you the ability to create a data source that includes entire directories and subdirectories, and the ability to schedule them to be updated periodically to pick up changes and additions.

The LucidWorks ReST API lets you create a filesystem data source and schedule it programmatically.

Solr LW GUI LW ReST
  1. Install cURL
  2. Open a command-line window
  3. For each document you want to index, use cURL to issue a POST request that:
    • Creates a unique identifier for the document
    • Sends the file to Solr to be indexed
  4. Add functionality to your application to periodically re-index the data to keep it current
  1. Create a new data source pointing to the directory of files
  2. Schedule the new source to run periodically
  1. Create the data source with POST request to the Data Sources API
  2. Update the indexing schedule with PUT request to the Data Source Schedules API


Adding local files of various formats using the LucidWorks UI

 

Crawling web data

The actual indexing of a web page is easy; all three environments make it possible to easily index an HTML file. It's the retrieval of that web page, and all of the web pages linked from it (and linked from them, and so on) that's difficult.

Solr does not actually come with the ability to crawl the web. Instead, you will need to integrate it with another tool, such as Apache Nutch, which will crawl the web for you and retrieve the files, which you can then pass to Solr for indexing.

The LucidWorks Platform has web-crawling already built in, so all you will need to do is create a web data source and specify where you want it to start and how deep you want it to go.

The LucidWorks ReST API lets you create a web data source and schedule it programmatically.

Solr LW GUI LW ReST
  1. Install Nutch
  2. Copy the Nutch schema.xml file to the Solr installation
  3. Edit schema.xml to set the content field as "stored"
  4. Validate the schema.xml file
  5. Add the nutch request handler to the solrconfig.xml file
  6. Validate the solrconfig.xml file
  7. Restart Solr
  8. Configure Nutch
  9. Create a seed list
  10. Inject the seed URLs into the Nutch databases
  11. Generate the fetch list
  12. Set the appropriate environment variables so that Nutch can find the segments directory
  13. Launch the fetcher
  14. Parse the content
  15. Update the Nutch databases
  16. Create linkdb
  17. Send the content from Nutch to Solr for indexing
  18. Add functionality to your application to periodically re-index the data to keep it current
  1. Create a new data source specifying the URLs to crawl
  2. Schedule the new source to run periodically
  1. Create the data source with POST request to the Data Sources API
  2. Update the indexing schedule with PUT request to the Data Source Schedules API


Crawling the web indexing content using the LucidWorks UI

 

Indexing database data

These days much of the data you want to index isn't in a file at all. Instead, it's in a database. The advantage here is that it's already identified; you don't need to parse it into fields. The disadvantage is that you have to go and get it.

Solr ships with the DataImportHandler, which enables you to connect to a database and index the content by query. You just need to configure it and create a response handler to call.

The LucidWorks Platform has the DataImportHandler preconfigured, so you just need to create a database data source and specify the content you want.

The LucidWorks ReST API lets you create a database data source and schedule it programmatically.

Solr LW GUI LW ReST
  1. Create the data configuration XML file
  2. Validate the data configuration XML file
  3. Create a new request handler in the solrconfig.xml file
  4. Validate the solrconfig.xml file
  5. Restart Solr
  6. Run the new request handler
  7. Have your application monitor the output and parse status messages
  8. Add functionality to your application to periodically re-index the data to keep it current
  1. Create a new data source specifying a query in the database
  2. Schedule the new source to get updated data
  1. Create the data source with POST request to the Data Sources API
  2. Update the indexing schedule with PUT request to the Data Source Schedules API


Adding a database data source using the LucidWorks UI

 

Provide Spell-checking

One of the advantages of using Solr or an engine built on Solr is that you get the ability to access some of Solr's advanced features, such as spell-checking and MoreLikeThis.

When a Solr request references the SpellCheckComponent, Solr looks for words in the query that aren't in the index, and if it finds any, it looks at the index and provides likely substitutes. The results are returned as part of the query response so you can build them into your application.

The LucidWorks user interface allows you to simply turn on spell-checking and test it with the built-in search UI. Enabling fields to be used as the basis of the spell-check index can also be done by editing field properties.

The LucidWorks ReST API enables you to turn spell-checking on and off or modify field properties (if necessary) programmatically.

Solr LW GUI LW ReST
  1. Create a request handler to reference the SpellCheckComponent
  2. Validate solrconfig.xml
  3. Restart Solr
  4. Issue a query to build the spell check index
  5. For each standard query:
    • Issue query
    • Determine whether any words were found to be wrong
    • Display a message offering suggestions
    • Build link to alternate query
  1. Edit the appropriate field(s) and check "Index for Spell Checking"
  2. Reindex the collection (if content had been previously crawled)
  3. Turn on spell-checking in Querying -> Settings
  1. Edit fields to be used for spell checking with a PUT request to the Fields API
  2. Turn on spell checking with a PUT request to the Settings API


Turning on spell check using the LucidWorks UI

 

MoreLikeThis

Another advantage of using Solr or the LucidWorks Platform is the ability to provide users with similar results that might not include the actual query they entered by using Solr's MoreLikeThis functionality.

Enabling MoreLikeThis can be done on a query-by-query basis in Solr by adding the appropriate parameters to each query and parsing the results to extract the appropriate links and information to your users.

The default search interface in the LucidWorks Enterprise user interface has MoreLikeThis built-in; you just have to turn it on and make sure you have the appropriate fields marked "Show 'find similar' links", and it will add a "Find Similar" link below each response for which it sees similar results.

The LucidWorks ReST API enables you to turn "Find Similar" on and off programmatically for the default search interface.

Solr LW GUI LW ReST
  1. Determine appropriate MoreLikeThis parameters
  2. Format request to include MoreLikeThis parameters
  3. For each result document:
    • Display result document
    • Correlate MoreLikeThis information to the document
    • Add MoreLikeThis information to the display
  1. Edit the appropriate field(s) and check "Use in 'Find Similar'"
  2. Turn on "Show 'find similar' links" in Querying -> Settings
  1. Edit fields to be used for "find similar" with a PUT request to the Fields API
  2. Turn on "show_similar" with a PUT request to the Settings API


Enabling a field to be used for MoreLikeThis using the LucidWorks UI

 

Autocomplete

Another much-appreciated feature in search engines is the "autocomplete" function, in which the user starts to type, and the application provides a drop-down list of potential terms in order to make their lives easier.

Solr includes the ability to build this kind of feature, in that you can request the likely terms using the TermsComponent, but there's no existing infrastructure to implement it in your application, so you'll need to wire together the Javascript yourself.

The default search user interface in the LucidWorks Platform has autocomplete built-in. As long as it's turned on, users using the default search interface will be able to see likely choices based on what they've typed. If you're building your own application on LucidWorks data, the initial steps of configuring the fields and building the autocomplete index are handled by the user interface.

The LucidWorks ReST API enables you to turn "Autocomplete" on and off programmatically for the default search interface .

Solr LW GUI LW ReST
  1. Create a servlet that takes the entered terms and makes a Solr request for TermsComponent data
  2. Have the servlet return the data in a form your Javscript expects
  3. Create Javascript that monitors your input box and sends an Ajax query to the servlet
  4. Have the Javascript parse the returned data and display the suggestions
  5. Make the suggestions clickable so users can complete their query by clicking a suggestion
  1. Edit the appropriate field(s) and check "Index for Autocomplete"
  2. Turn on Autocomplete in the Querying -> Settings screen
  3. Create the Autocomplete index
  1. Edit fields to be used for autocomplete with a PUT request to the Fields API
  2. Turn on "auto_complete" with a PUT request to the Settings API
  3. Create the autocomplete index with a POST request to the Activities API

 

Notify users of new content

Enterprise Alerts enable users to request notification by email when new data has been added to the system.

Solr doesn't support Enterprise Alerts.

The default LucidWorks Platform search interface includes Enterprise Alerts right on the query page; users can enter an email address and determine the frequency with which they want to be notified. You can also use this capability in your own applications built on LucidWorks data.

The LucidWorks ReST API is the primary means by which Enterprise Alerts can be configured or used from an external application.

Solr LW GUI LW ReST
  • NOT AVAILABLE
  • Configure a mail server to send email notifications
  • Update System Settings screen with mail server details
  • Users configure their own alerts
  1. Configure a mail server to send email notifications
  2. Update System Settings screen with mail server details
  3. Create a new alert by sending a POST request to the Alerts API
  4. Get alert information by sending a GET request to the Alerts API


Creating a new alert using the LucidWorks UI

 

Improve relevance with Click Scoring

Much of the fuss behind search applications is the difficulty of determining just what document best matches a particular query. There are all kinds of heuristics we can use, but possibly the best method is to use the human brain. We can do that by watching what links are consistently clicked for a particular query. Once we have that information, we can use it to "boost" those documents for that query. In the LucidWorks Platform, this is known as Click Scoring.

Solr does not support click scoring.

Using Click Scoring from LucidWorks can be pretty straightforward; simply turn it on and schedule the Click Scoring task to run periodically so that the logged data is integrated into the document boost calculations. It's also possible to supply a manually created file of boost data if user click information is gathered in another way.

The LucidWorks ReST API provides the means to turn click scoring on and off, and to schedule analysis of the logs.

Solr LW GUI LW ReST
  • NOT AVAILABLE
  1. Turn on Click Scoring
  2. Schedule the Click Scoring task ("Process click logs") to run periodically
  1. Turn on click scoring with a PUT request to the Settings API
  2. Schedule analysis with a PUT request to the Activities API


Turning on Click Scoring using the LucidWorks UI


Running the Click Scoring logs using the LucidWorks UI

 

Manage Users

If you've just got a simple web application with data that anyone can access, you probably don't worry too much about managing users. If, on the other hand, you need to control who has access to what data, or to what functionality, then you really do need to think about it.

Solr is focused on the content, and doesn't look at user information at all.

The LucidWorks Platform provides two ways to handle user information. You can either integrate your LucidWorks installation with an LDAP database, or you can manually add users via the user interface. In either case, you also have the ability to control what data each group of users sees by creating search filters.

The LucidWorks ReST API can provide all of your user management needs, including creating users, assigning them to groups, and validating their password on login. You can also use it to create search filters, so all of those tasks can be handled programmatically.

Solr LW GUI LW ReST
  • NOT AVAILABLE
  1. Manually managed users:
    1. Create users and groups manually using the user interface
    2. Create Search Filters
  2. LDAP Integration:
    1. Edit the ldap.yml file with the LDAP server information
    2. Configure the LDAP section of the System Settings screen
    3. Enable LDAP via the master.conf configuration file
    4. Create Search Filters
  1. Create users and groups with a POST request to the Users API
  2. Add Search Filters with a POST request to the Roles API


Creating filters using the LucidWorks UI

 

Monitor

Once your system is up and running, you'll want to keep an eye on it. How's performance? What kind of queries are users submitting? Are there particular things your users are looking for that they're not finding?

Solr does not include any monitoring tools but can be easily be integrated with Java Management Extensions (JMX), which can be used in a variety of clients to get necessary reporting.

The LucidWorks Platform user interface includes several sources of information about the system, including the Dashboard for an overall look at indexes and queries. LucidWorks also indexes all of its own log files, which can be then be queried either manually or with regular alerts.

The MBeans provided with Solr have been expanded to cover LucidWorks Platform activities, and these can be integrated with a variety of clients such as JConsole or JMXTerm. A Zabbix configuration file is also provided for local customization for integration with Zabbix.

The LucidWorks ReST API can provide you access to the raw information through the use of several APIs, including Data Source Status, Data Source History and Activity History.

Solr LWE GUI LWE/ReST
  1. Integrate MBeans with a JMX-compliant client
  1. Monitor the system using the Dashboard or the Queries Summary page
  2. Integrate MBeans with a JMX-compliant client
  3. Integrate with Zabbix or Nagios
  1. Monitor activity through GET requests to various APIs


The LucidWorks Platform collection dashboard

 

Managing multiple collections

In an ideal world, all of your data would be in one place, and in one format, with a single set of user requirements. Unfortunately, search applications rarely happen in an ideal world. In many cases, you will see diverse sets of information, each with its own idiosyncrasies and needs.

Solr uses cores to manage content that requires different schemas, fields, and other configuration options. Core creation is done manually by defining the location of essential files in a configuration file.

The LucidWorks Platform user interface provides the ability to create new collections and the configuration of the Solr core is done automatically. You can then assign different fields, data sources, filters, and so on to different collections, and manage and search them individually through the user interface. Customizations of one collection can be saved as a template for use in creating new collections if multiple collections need to be created with similar settings.

By default, LucidWorks is installed with a default collection, collection1, and a Logs collection where system logs are indexed.

The LucidWorks ReST API is based on the concept of collections, with most calls including the name of the collection so the request can be routed and handled properly.

Solr LW GUI LW ReST
  • Modify the solr.xml file to define the location of core schema and configuration files
  1. Create a new collection
  2. Start administration tasks from the Collections tab
  1. Create a new collection with a POST request to the Collections API
  2. Include proper collection name with other API requests


Create or manage a collection using the LucidWorks UI

 

Summary

All right, so what have we learned here?

We've learned that for small installations with simple requirements, straight Solr might be your best bet, particularly if you have the development resources to get through the additional configuration and programming requirements.

When, on the other hand, you have more complex requirements, such as complicated indexing, collection, or user management issues, you will want to use the LucidWorks Platform to take advantage of those features.

When you have limited programming resources, the LucidWorks user interface is likely to satisfy all, or at least most, of your needs. The default LucidWorks search interface also includes several search-related bells and whistles, such as autocomplete and Enterprise Alerts, which can make the search experience more pleasant for you users.

When you are building an application that has complex requirements, but you're not going to use the default search interface, you'll want to use a combination of the LucidWorks user interface for management and straight web programming and the ReST API to build your application.

Finally, when you need to script many of the configurational aspects of your search application -- for example, to dynamically alter field definitions or create new collections from your own application -- you will want to focus on the LucidWorks ReST API.

In other words, all three platforms provide the ability to add search to your application. Which one is best for you at any given time will depend on the requirements of your current project, the resources you have at your disposal, and where their time can be best spent.

  • Login or register to post comments

Search syntax comparisons

Submitted by bob.boeri on Wed, 2011-12-28 18:34.

This is a very persuasive case to use LucidWorks.  However, I have one question: Is the search query syntax you describe for LW the same as Solr? Thanks!

 

bboeri@guident.com

 

  • Login or register to post comments

Next Steps

Request more Info
Have Sales contact me
How to Buy: Subscriptions

Get Started

White Paper: Get Started with LucidWorks
Webcast: Migrate to Open Source Search

DevZone

Latest Blog Post

Lucene Revolution 2012: Presentation slides are...
Lucene Revolution 2012, held in Boston on May 7-10 was a huge success. The four-day conference began with two days of Lucene/Solr and Big Data training followed by two days of...
  • Tutorials
  • Blog
  • Whitepapers
  • Docs
  • Forums
  • Support
Share
Follow Facebook Twitter LinkedIn YouTube
RSS Feed
  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Website Feedback
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Sitemap
  • Admin

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2012 Lucid Imagination. All Right reserved.