Solr Searching with RDBMS
In many shops some of the most common queries used in large scale RDBMS systems such as Oracle are for pattern searches within ranges of criteria, typically targeted searches for data by users to answer and meet certain business needs. Writing standardized reports or simple relational queries can answer the questions, but such mechanisms can be inflexible and costly to maintain. One more efficient way to address these challenges is through the power of Solr.
By Altan Khendup
Introduction
In many shops some of the most common queries used in large scale RDBMS systems such as Oracle are for pattern searches within ranges of criteria. These are not the extensive searches or analytics typically used in Hadoop environments, nor are they the very constrained searches found in data warehouses. Rather, they are more targeted searches for data by users to answer and meet certain business needs. Often standardized reports or simple relational queries written to meet these needs can answer the questions, but can be inflexible and require a significant amount of effort to maintain and modify over time. One more efficient way to address these challenges is through the power of Solr.
Background
One of my most recent projects happened to be at a large North American telco; it involved providing a way to allow users of an existing legacy CRM and ordering application to find information quickly, efficiently and without adding any undue load to the existing infrastructure. While we had access to a variety of existing facilities such as data warehouses and reporting tools, every solution required a lot of extra resources for hardware, software licenses, and operational overhead. What's more, requests would take days to process, and that did not meet business requirements for information turnaround.
The existing CRM application was fairly large, with over 6TB worth of data stored in an Oracle RDBMS instance that was clustered. It contained all the operational and historical data that was needed by the business units. Typically, a business analyst would be looking for specific metrics to complete a report for her business unit, or to run a report of how many service requests were fulfilled in specific regions during the preceding month, quarter or past year. However, because the historical data was incremented relatively infrequently, using a transactional approach to keeping all relevant data synchronized turned out to be overkill. From an implementation perspective with the DBMS, it would involve creating an external warehouse, creating trigger-based update mechanisms to achieve low-level data changes, and synchronization techniques to insure data changes were properly captured.
With Solr and specifically the use of the DataImportHandler, however, it was possible to extract the static dump of the necessary data from the Oracle database, index it, and query the index. Initially the Solr implementatoin was designed to replace certain large query requests, but that would take a significant toll on the existing production. However, as time went by, capabilities in Solr such as the Schema Browser and faceted search quickly replaced many of the SQL queries -- due to the speed, ease of use, and ability to integrate the search functions and results into existing interfaces.
For example faceted search was tremendously useful to find, say, the regions with the most activity for certain orders. Or for a specific product, it could quickly narrow which geos and which promo codes were the most frequently used. This helped users get the top terms and occurrences of terms quickly, rather than incurring the overhead of writing SELECT statements and waiting for long periods of time for queries to be scheduled and/or run.
Getting the Data
After installing Solr onto a standalone server outside of the production complex, the next step involved actually configuring Solr so that we could get the data we needed. A few decisions were made at this point. The first decision happened to be about the data itself. I decided to target many of the existing information structures within the application which had been simplified to meet other business reporting needs. Additionally by using these structures it would make configuration easier later on. The second decision involved whether to store the data values in the index itself. While ideally the data would have been accessed from the production database instance, I decided instead to store the data within the index for easier retrieval and to reduce the queries against the production database instance itself. The final decision involved how much of the data could be safely retrieved via the DataImportHandler and stored within Solr. This actually turned out to be pretty simple. The Oracle constructs only held a week work's of data, per an agreement with the business users. I would start with that amount and from there determine how much further could be held within the Solr instance.
Configuring Solr
This part was actually pretty straightforward given the good instructions at the Solr DataImportHandler wiki page. Once the basic configurations were made, I made changes to the data-config.xml for my target Oracle data that resembled something found below:
I also insured that the schema.xml was also appropriately changed to reflect the fields that would be captured in the DataImportHandler:
Once all the configurations were in place I ran the dataimport process against the data. Overall the process was fairly painless other than the waiting period which had taken about 45 minutes. I had attempted several different sizes of data with the help of the Oracle DBAs and found that the large the dataset the longer the wait. For example while processing a weeks worth of data was manageable, 3 month and 6 month snapshots finished in hours as opposed to under the one hour timeframe. However this changed once the Solr instances were placed in the same data center complexes as the production instance. Due to their close proximity and larger hardware capabilities, it was possible to import the same amount of data within approximately 1 hour.
Searching with Solr
The data once imported was not very large, only 50GB worth of data overall. This again could be managed by adjusting the field types, whether data had to be stored or not, and the amount of historical information to be imported. Now that the data was available, searches could be executed on the data.
I also found the packaged Schema Browser was very handy. Admittedly, the Schema Browser takes a while to process all the fields in the index so if you have a lot of data this can take a while. However the benefit is that it can provide answers to some of the more common questions that could be asked such as: the number of documents per value which can help for groups of items such as types of orders; how many documents actually have parent accounts; how orders are provided by various sending systems;how many orders are for a given state or postal code; etc. The data can also yield additional insights from more advanced searches such as faceted searches, such as what postal codes are responding to which advertising or product promotions; which areas have the most activity for certain types of orders; or, how many domains are covered per type of account. And the list goes on.
Operationally speaking, the Solr instances were managed in one of two ways: periodic updates from the main production instances or continual updates with application code not only adding data to the Oracle database but inserting them into the Solr index as well. Hence the operations against the existing production instances could be managed to minimize impacts and eliminate any unnecessary processing.
Conclusion
With these new capabilities, answers to key questions can be found in seconds. Data can be mined quickly, efficiently and flexibly without a lot of specialized training for business users. Additionally, the indexes could be managed in such a way such that additional data could be added for to increase the scope of analysis, or subsets of data could be indexed and searched for specific business reasons such as service outages or legal reasons.
In the end the users were quite happy with the new capabilities provided by Solr which allowed them to address business needs much more quickly and explore new patterns that had eluded them before, and operations was happy since the new capabilities came with little additional hardware or operational costs.
Altan Khendup has worked professionally in the technology field for more than 20 years with companies including Toyota, GE Capital, KPMG, PeopleSoft, Sun Microsystems, AT&T, and several startups. His experience spans hands-on development and coding, architecture, scalability and performance, project management ranging from small projects global initiatives.
