Working at Lucid Imagination a customer once asked me about how they could modify the score of the documents in Solr in order to get most relevant results higher in the results list. While I was trying to respond the question I realized that there are too many different options, and that not all of them are very easy to understand, so I decided to write some notes summarizing the most common/most used ways to …
Read more
By yonikSeptember 15, 2011
Background
I needed a really good hash function for the distributed indexing we’re implementing for Solr. Since it will be used for partitioning documents, it needed to be really high quality (well distributed) since we don’t want uneven shards. It also needs to be cross-platform, so a client could calculate this hash value themselves if desired, to predict which node has a given document.
MurmurHash3
MurmurHash3 is one of the top favorite new hash function …
Read more
Many times, clients ask us to help them estimate memory usage or disk space usage or to share benchmarks as they build out there search system. Doing so is always an interesting process, as I’ve always been wary of claims about benchmarks (for instance, one of the old tricks of performance benchmark hacking is to “cat XXX > /dev/null” to load everything into memory first, which isn’t what most people do when running their system) …
Read more
By yonikSeptember 7, 2011
Solr took another step toward increasing its NoSQL datastore capabilities, with the addition of realtime get.
Background
As readers probably know, Lucene/Solr search works off of point-in-time snapshots of the index. After changes have been made to the index, a commit (or a new Near Real Time softCommit) needs to be done before those changes are visible. Even with Solr’s new NRT (Near Real Time) capabilities, it’s probably not advisable to reopen the …
Read more
Let’s install Solr as a service on Linux. I’m using Ubuntu 11.04.
First download the latest version of Solr from (3.3 as of this writing): http://www.apache.org/dyn/closer.cgi/lucene/solr/
Extract the compressed zip or tgz file to where you would like Solr to live.
Currently, I like using runit to run Linux services. http://smarden.org/runit/
Install runit with: sudo apt-get install runit

Create a new service directory.

Create a new shell …
Read more
| Monday, 15 August 2011 |
| 18:00 |
to |
21:00 |
If you’re in the central VA, or even in the northern VA / DC area, come join us for the inaugural “Charlottesville Solr and Lucene Meetup”. Charlottesville is home to the co-authors of Manning’s “Lucene in Action” and Packt’s Solr “Solr 1.4 Enterprise Search Server” books. This area is a hotbed of search activity thanks to NGIC and DIA calling Charlottesville home, and the many gov’t subcontractors …
Read more
| Tuesday, 12 July 2011 |
to |
Friday, 15 July 2011 |
I had the honor and pleasure of being invited to speak at Überconf last week in the Denver, CO area.
The annual conference is organized by Jay Zimmerman of No Fluff, Just Stuff fame. Überconf has the same top-notch quality, at a grander scale – 10 concurrent tracks (woah!), full day pre-conference trainings (mobile, anyone?), food (full breakfast! that’s a REAL hearty bonus!), and …
Read more
Back in the 1990′s, Carnegie Mellon University developed the Capability Maturity Model, a scale for determining how prepared a contractor’s processes were for a particular task. If you’ve ever written software for anyone but yourself, you’ll recognize some of these definitions, which call to mind the famous characterization of the evolution of software.
Sensis, “the search engine for Australians”, uses a modified version of this model to assess their own search processes. It …
Read more
Lucid Imagination founder Marc Krellenstein kicked off the Lucene Revolution yesterday with a keynote address covering the history of search. Here are the slides, followed by some highlights:
Much as we might think of search technology as a 21st century internet thing, its back to when IBM was sued by the US government. By the early days of the Internet, search—Lycos, Infoseek, Excite, and Alta Vista–began to accelerate the virtuous cycle of requirements and innovation. …
Read more
Imagine that you have to integrate and search data from 200 different sources, each of which uses a different structure (if they use a structure at all). Your data may be incomplete, the same information is represented in different ways by different sources, and it’s often vague. Oh, and if a user can’t find the correct result using a simple Google-like search, someone may literally get away with murder.
Welcome to Ronald Mayer’s world. In …
Read more