Solr and LucidWorks: How to Choose the Right Platform
If the LucidWorks Platform is built on Solr, how do you know which one to use when for your own circumstances? This article describes the difference between using straight Solr, using the LucidWorks user interface, and using the LucidWorks ReST API for accomplishing various common tasks so you can see which fits your situation at a given moment.
In today's world, building the perfect product is a lot like trying to repair a set of train tracks while the train is barreling down on you. The world just keeps moving, with great ideas and new possibilities tempting you every day. And to make things worse, innovation doesn't just show its face for you; it regularly visits your competitors as well.
That's why you use open source software in the first place. You have smart people; does it make sense to have them building search functionality when Apache Solr already provides it? Of course not. You'd rather rely on the solid functionality that's already been built by the community of Solr developers, and let your people spend their time building innovation into your own products. It's simply a more efficient use of resources.
But what if you need search-related functionality that's not available in straight Solr? In some cases, you may be able to fill those holes and lighten your load with the LucidWorks Platform. Built on Solr, the LucidWorks Platform starts by simplifying the day-to-day use tasks involved in using Solr, and then moves on to adding additional features that can help free up your development team for work on your own applications. But how do you know which path would be right for you?
The answers depend on a number of different factors, including the size and scope of the project, your available resources, and even the types of data involved, so let's look at some of the more common tasks involved in running a search-related application. This way, you can make up your own mind as to what's best for you. Let's start by getting the lay of the land.
A search application is basically the same no matter what product you're using, involving the following steps:
- Define your indexed content
- Define your data
- Define user/data domains
- Index your data
- Build your application (including facets, autocomplete, etc.)
or
Issue queries through the built-in query interface - Keep your indexes updated
- Control access
- Monitor the system
Solr-based applications (including those built as an extension to LucidWorks, rather than using the built-in query page) get their data by issuing an HTTP call and parsing the results. That has the advantage that they can be used in virtually any programming environment, using virtually any programming language, so that won't be a factor in your decision, but a single programmer looking to index a single source of documents will likely have different needs than a large corporation with heterogeneous data and multiple stakeholders with different data requirements.
At the most basic level, we have three different options to look at:
- Straight Solr: In this case, you've got the engine and all of the flexibility that provides, but there's a lot you will need to do yourself. You may get more fine-grained control, but it comes at a price for ongoing maintenance, and might require higher-skilled resources to invest their attention in more mundane work.
- LucidWorks web-based user interface: The LucidWorks UI provides easy ways to accomplish common tasks, but there are times you might need to go beyond that to the ReST API. A UI is not scriptable, but it's easy to point and click, particularly if some of your administrators are not comfortable with scripting.
- LucidWorks ReST API: All of the capabilities the LucidWorks UI offers are also available via ReST API. Simply put, the ReST API enables you to accomplish tasks programmatically using HTTP calls, with some scripting; the scripting, in turn, enables creation of repeatable processes and automation.
So let's look at some of the more common tasks you'll need to accomplish when building search into your applications, and what they mean in each of these three environments.
Installation
Before you do anything else, of course you're going to have to install the software.
Apache Solr is largely a roll-your-own system, a web application that you need to configure yourself. The actual installation is pretty simple -- just unzip the software. But while the standard distribution includes several example environments, they're not meant for production use, so you'll need to decide how you want your particular instance configured, and run it in a proper web application server.
LucidWorks Platform comes with both a GUI installer and a command-line installer, both of which can be run with the default inputs to create a production system.
| Solr | LW GUI | LW ReST |
|---|---|---|
|
|
|
Defining fields
The actual engine behind both Solr and LucidWorks, which is built on Apache Lucene, includes the ability to create various document "fields", each of which can have their own requirements and can be treated differently when they're indexed or searched.
Adding new fields and field types to Solr involves editing the installation's schema.xml file. In this file, you can add new field types, and create fields that use those types.
The LucidWorks Platform includes a web-based user interface that enables you to add additional field types and fields and specify properties such as whether they should be used for spell-checking.
The LucidWorks Platform ReST API includes the ability to add, edit, and delete field types and fields programmatically, so if your requirements change, you can dynamically alter field definitions, for example to add a new field needed for additional data.
| Solr | LW GUI | LW ReST |
|---|---|---|
|
|
|
Click any image to enlarge

Adding a new field using the LucidWorks UI

Editing a field type using the LucidWorks UI
Indexing Solr XML data
Originally, the only way to add data for indexing to Solr was to convert it into a very specific XML form, called Solr XML data. This is a very specific format, in which each doc element has a series of field elements with name attributes that identify the field.
Adding Solr XML documents to Solr is pretty straightforward, involving sending documents to the server using an HTTP POST request. The standard download includes post.jar, which you can run to post individual files (or multiple files, using wild cards) to Solr. (You can also add comma-delimited data this way.)
The LucidWorks Platform includes a user interface for adding Solr XML data by the file or by the directory by creating a "data source". It also includes a scheduler that enables you to set the system to re-index the data source on a regular basis to pick up changes or additions.
The LucidWorks ReST API lets you create a Solr XML data source and schedule it programmatically.
| Solr | LW GUI | LW ReST |
|---|---|---|
|
|
|

Adding a Solr data source using the LucidWorks UI
Indexing non-XML data
A large percentage of data isn't XML, and it's not easily converted to XML (such as that large collection of Word documents and PDF files you've got to search). Solr includes Apache Tika, which enables you to index the content of Microsoft Office files, PDFs, text formats, HTML, and many more. This includes images, audio, and video, though in that case it's mostly the metadata that's indexed.
Indexing non-Solr XML content with Solr involves using cURL to make HTTP POST requests to the Solr engine for each document you want to index.
The LucidWorks Platform includes a number of pre-configured connectors to give you the ability to create a data source that includes entire directories and subdirectories, and the ability to schedule them to be updated periodically to pick up changes and additions.
The LucidWorks ReST API lets you create a filesystem data source and schedule it programmatically.
| Solr | LW GUI | LW ReST |
|---|---|---|
|
|
|

Adding local files of various formats using the LucidWorks UI
Crawling web data
The actual indexing of a web page is easy; all three environments make it possible to easily index an HTML file. It's the retrieval of that web page, and all of the web pages linked from it (and linked from them, and so on) that's difficult.
Solr does not actually come with the ability to crawl the web. Instead, you will need to integrate it with another tool, such as Apache Nutch, which will crawl the web for you and retrieve the files, which you can then pass to Solr for indexing.
The LucidWorks Platform has web-crawling already built in, so all you will need to do is create a web data source and specify where you want it to start and how deep you want it to go.
The LucidWorks ReST API lets you create a web data source and schedule it programmatically.
| Solr | LW GUI | LW ReST |
|---|---|---|
|
|
|

Crawling the web indexing content using the LucidWorks UI
Indexing database data
These days much of the data you want to index isn't in a file at all. Instead, it's in a database. The advantage here is that it's already identified; you don't need to parse it into fields. The disadvantage is that you have to go and get it.
Solr ships with the DataImportHandler, which enables you to connect to a database and index the content by query. You just need to configure it and create a response handler to call.
The LucidWorks Platform has the DataImportHandler preconfigured, so you just need to create a database data source and specify the content you want.
The LucidWorks ReST API lets you create a database data source and schedule it programmatically.
| Solr | LW GUI | LW ReST |
|---|---|---|
|
|
|

Adding a database data source using the LucidWorks UI
Provide Spell-checking
One of the advantages of using Solr or an engine built on Solr is that you get the ability to access some of Solr's advanced features, such as spell-checking and MoreLikeThis.
When a Solr request references the SpellCheckComponent, Solr looks for words in the query that aren't in the index, and if it finds any, it looks at the index and provides likely substitutes. The results are returned as part of the query response so you can build them into your application.
The LucidWorks user interface allows you to simply turn on spell-checking and test it with the built-in search UI. Enabling fields to be used as the basis of the spell-check index can also be done by editing field properties.
The LucidWorks ReST API enables you to turn spell-checking on and off or modify field properties (if necessary) programmatically.
| Solr | LW GUI | LW ReST |
|---|---|---|
|
|
|

Turning on spell check using the LucidWorks UI
MoreLikeThis
Another advantage of using Solr or the LucidWorks Platform is the ability to provide users with similar results that might not include the actual query they entered by using Solr's MoreLikeThis functionality.
Enabling MoreLikeThis can be done on a query-by-query basis in Solr by adding the appropriate parameters to each query and parsing the results to extract the appropriate links and information to your users.
The default search interface in the LucidWorks Enterprise user interface has MoreLikeThis built-in; you just have to turn it on and make sure you have the appropriate fields marked "Show 'find similar' links", and it will add a "Find Similar" link below each response for which it sees similar results.
The LucidWorks ReST API enables you to turn "Find Similar" on and off programmatically for the default search interface.
| Solr | LW GUI | LW ReST |
|---|---|---|
|
|
|

Enabling a field to be used for MoreLikeThis using the LucidWorks UI
Autocomplete
Another much-appreciated feature in search engines is the "autocomplete" function, in which the user starts to type, and the application provides a drop-down list of potential terms in order to make their lives easier.
Solr includes the ability to build this kind of feature, in that you can request the likely terms using the TermsComponent, but there's no existing infrastructure to implement it in your application, so you'll need to wire together the Javascript yourself.
The default search user interface in the LucidWorks Platform has autocomplete built-in. As long as it's turned on, users using the default search interface will be able to see likely choices based on what they've typed. If you're building your own application on LucidWorks data, the initial steps of configuring the fields and building the autocomplete index are handled by the user interface.
The LucidWorks ReST API enables you to turn "Autocomplete" on and off programmatically for the default search interface .
| Solr | LW GUI | LW ReST |
|---|---|---|
|
|
|
Notify users of new content
Enterprise Alerts enable users to request notification by email when new data has been added to the system.
Solr doesn't support Enterprise Alerts.
The default LucidWorks Platform search interface includes Enterprise Alerts right on the query page; users can enter an email address and determine the frequency with which they want to be notified. You can also use this capability in your own applications built on LucidWorks data.
The LucidWorks ReST API is the primary means by which Enterprise Alerts can be configured or used from an external application.
| Solr | LW GUI | LW ReST |
|---|---|---|
|
|
|

Creating a new alert using the LucidWorks UI
Improve relevance with Click Scoring
Much of the fuss behind search applications is the difficulty of determining just what document best matches a particular query. There are all kinds of heuristics we can use, but possibly the best method is to use the human brain. We can do that by watching what links are consistently clicked for a particular query. Once we have that information, we can use it to "boost" those documents for that query. In the LucidWorks Platform, this is known as Click Scoring.
Solr does not support click scoring.
Using Click Scoring from LucidWorks can be pretty straightforward; simply turn it on and schedule the Click Scoring task to run periodically so that the logged data is integrated into the document boost calculations. It's also possible to supply a manually created file of boost data if user click information is gathered in another way.
The LucidWorks ReST API provides the means to turn click scoring on and off, and to schedule analysis of the logs.
| Solr | LW GUI | LW ReST |
|---|---|---|
|
|
|

Turning on Click Scoring using the LucidWorks UI

Running the Click Scoring logs using the LucidWorks UI
Manage Users
If you've just got a simple web application with data that anyone can access, you probably don't worry too much about managing users. If, on the other hand, you need to control who has access to what data, or to what functionality, then you really do need to think about it.
Solr is focused on the content, and doesn't look at user information at all.
The LucidWorks Platform provides two ways to handle user information. You can either integrate your LucidWorks installation with an LDAP database, or you can manually add users via the user interface. In either case, you also have the ability to control what data each group of users sees by creating search filters.
The LucidWorks ReST API can provide all of your user management needs, including creating users, assigning them to groups, and validating their password on login. You can also use it to create search filters, so all of those tasks can be handled programmatically.
| Solr | LW GUI | LW ReST |
|---|---|---|
|
|
|

Creating filters using the LucidWorks UI
Monitor
Once your system is up and running, you'll want to keep an eye on it. How's performance? What kind of queries are users submitting? Are there particular things your users are looking for that they're not finding?
Solr does not include any monitoring tools but can be easily be integrated with Java Management Extensions (JMX), which can be used in a variety of clients to get necessary reporting.
The LucidWorks Platform user interface includes several sources of information about the system, including the Dashboard for an overall look at indexes and queries. LucidWorks also indexes all of its own log files, which can be then be queried either manually or with regular alerts.
The MBeans provided with Solr have been expanded to cover LucidWorks Platform activities, and these can be integrated with a variety of clients such as JConsole or JMXTerm. A Zabbix configuration file is also provided for local customization for integration with Zabbix.
The LucidWorks ReST API can provide you access to the raw information through the use of several APIs, including Data Source Status, Data Source History and Activity History.
| Solr | LWE GUI | LWE/ReST |
|---|---|---|
|
|
|

The LucidWorks Platform collection dashboard
Managing multiple collections
In an ideal world, all of your data would be in one place, and in one format, with a single set of user requirements. Unfortunately, search applications rarely happen in an ideal world. In many cases, you will see diverse sets of information, each with its own idiosyncrasies and needs.
Solr uses cores to manage content that requires different schemas, fields, and other configuration options. Core creation is done manually by defining the location of essential files in a configuration file.
The LucidWorks Platform user interface provides the ability to create new collections and the configuration of the Solr core is done automatically. You can then assign different fields, data sources, filters, and so on to different collections, and manage and search them individually through the user interface. Customizations of one collection can be saved as a template for use in creating new collections if multiple collections need to be created with similar settings.
By default, LucidWorks is installed with a default collection, collection1, and a Logs collection where system logs are indexed.
The LucidWorks ReST API is based on the concept of collections, with most calls including the name of the collection so the request can be routed and handled properly.
| Solr | LW GUI | LW ReST |
|---|---|---|
|
|
|

Create or manage a collection using the LucidWorks UI
Summary
All right, so what have we learned here?
We've learned that for small installations with simple requirements, straight Solr might be your best bet, particularly if you have the development resources to get through the additional configuration and programming requirements.
When, on the other hand, you have more complex requirements, such as complicated indexing, collection, or user management issues, you will want to use the LucidWorks Platform to take advantage of those features.
When you have limited programming resources, the LucidWorks user interface is likely to satisfy all, or at least most, of your needs. The default LucidWorks search interface also includes several search-related bells and whistles, such as autocomplete and Enterprise Alerts, which can make the search experience more pleasant for you users.
When you are building an application that has complex requirements, but you're not going to use the default search interface, you'll want to use a combination of the LucidWorks user interface for management and straight web programming and the ReST API to build your application.
Finally, when you need to script many of the configurational aspects of your search application -- for example, to dynamically alter field definitions or create new collections from your own application -- you will want to focus on the LucidWorks ReST API.
In other words, all three platforms provide the ability to add search to your application. Which one is best for you at any given time will depend on the requirements of your current project, the resources you have at your disposal, and where their time can be best spent.

Search syntax comparisons
This is a very persuasive case to use LucidWorks. However, I have one question: Is the search query syntax you describe for LW the same as Solr? Thanks!
bboeri@guident.com