Apache Lucene Eurocon 2011 Barcelona Presentations
Keynotes Presentation Abstracts
GO TO DAY 1
GO TO DAY 2
LIGHTNING TALKS
Click on the slide thumbnail to download the presentation (some browsers require a right-click or a control-click).
Search + Big Data: It's (still) All About the User
Presented by Grant Ingersoll, Lucid Imagination
Apache Hadoop has rapidly become the primary framework of choice for enterprises that need to store, process and manage large data sets. It helps companies to derive more value from existing data as well as collect new data, including unstructured data from server logs, social media channels, call center systems and other data sets that present new opportunities for analysis. This keynote will provide insight into how Apache Hadoop is being leveraged today and how it evolving to become a key component of tomorrow's enterprise data architecture. This presentation will also provide a view into the important intersection between Apache Hadoop and search.
| Watch session video. |
Architecting the Future of Big Data & Search
Presented by Eric Baldeschwieler, Hortonworks
Thanks in no small part to Lucene, quality keyword search is easily obtainable. Likewise, tools like Apache Hadoop and its ecosystem have made it easier to store and process large quantities of data. Besides being fun for engineers to geek on and pundits to talk about, what does the big data movement mean for the real problem at hand: helping users find relevant content as quickly and cost effectively as possible? In this talk, we'll look at the opportunities the Lucene ecosystem provides to offer better search, discovery and analytics capabilities to developers in order to better enable users.
| Watch session video. |
Realtime Search at Twitter
Presented by Michael Busch, Twitter
At Twitter we serve more than 1.5 billion queries per day from Lucene indexes, while appending more than 200 million tweets per day in realtime. Additionally we recently launched image, video and relevance search on the same engine.
This talk will explain the changes we made to Lucene to support this high load and the changes and improvements we made in the last year.
| Watch session video. |
Day 1 Presentation Abstracts
KEYNOTES
GO TO DAY 2
LIGHTNING TALKS
Portable Lucene Index Format & Applications
Presented by Andrzej Bialecki, Lucid Imagination
This talk will present a design and implementation of a flexible, version-independent serialization format for Lucene indexes and its applications in index upgrades / downgrades, in distributed document analysis, in distributed indexing, and in integration with external indexing pipelines. This format enables submitting pre-analyzed documents to Lucene/Solr, and transferring parts of indexes between nodes in a distributed setup.
| Watch session video. |
Adapting Ajax-Solr to Compare Document sets
Presented by Joan Codina, Barcelona Media – Centre d’Innovació
One of the main features of Solr is Faceted Search. Facets are the top terms present in the results of a query. But facets do not indicate the most statistically relevant terms of a query, that is, these terms that are more present in the documents selected by the query than in the rest of the collection. A critical factor in making such statistical insights broadly useful is to make them visual i.e., using charts and graphs that display these quantitative relationships. We will present how to adapt Ajax-Solr to find the most prominent terms of a query compared to the full set or just another query. We are going to present and example on how this can be used to find current topics in the news, and extract that information into visually communicative charts and graphs.
| Watch session video. |
Configuring Mahout Clustering Jobs
Presented by Frank Scholten, JTeam
For more than a decade internet search engines have helped users find documents they are looking for. However, what if users aren't looking for anything specific but want a summary of a large document collection and want to be surprised? One solution to this problem is document clustering. Clustering algorithms group documents that have similar content. Real-life examples of clustering are clustered search results of Google news, or tag clouds which group documents under a shared label. Apache Mahout is a framework for scalable machine learning on top of Apache Hadoop and can be used for large scale document clustering. This talk introduces clustering in general and shows you step-by-step how to configure Mahout clustering jobs to create a tag cloud from a document collection. This talk is suitable for people who have some experience with Hadoop and perhaps Mahout. Knowledge of clustering is not required.
Topics include
- Clustering introduction
- Clustering in Mahout
- Text pre-processing & analysis
- Tag cloud demo
- Tips & tricks
| Watch session video. |
Solr 4 Highlights
Presented by Mark Miller, Lucid Imagination
In this talk, Lucene/Solr committer Mark Miller will discuss some of the new features and advancements that users can look forward to in Solr 4. The list of topics will include: performance optimizations, further support for near-realtime search, SolrCloud, DirectSolrSpellChecker, and more.
| Watch session video. |
Improving Solr's Update Chain
Presented by Jan Høydahl, Cominvent AS
Solr features a little known internal document processing pipeline called the UpdateRequestProcesssorChain or simply the UpdateChain.
In this talk we'll discuss the importance of document processing, when the UpdateChain works well and what limitations it's got. We'll then go on to propose a range of possible improvements.
Topics include:
- Examples of use with demo
- How to write your own UpdateProcessor, best practices
- Example: Tika as an UpdateProcessor
- A vision for future improvements
| Watch session video. |
Archive-It: Scaling Beyond a Billion Archival Webpages
Presented by Aaron Binns, Internet Archive
Description of Archive-It, the Internet Archive's subscription, self-serve web archiving service, focused on the full-text search system. With nearly 200 partners and over 2000 collections the custom Lucene-based system handles 3+ million index updates per day across an index that totals over 1.3 billion documents. This session will give a detailed description of the architecture and implementation of the Archive-It search system; highlighting many of the challenges due to the scale as well as complex use cases.
| Watch sesson video. |
Search Analytics: Business Value & BigData NoSQL Backend
Presented by Otis Gospodnetic, Sematext International
Search is increasingly the primary information access mechanism, so knowing how your search is doing often has direct business impact. You’ve indexed your data and people are searching it. But how do you know if they are happy with the results? How do you know if they are finding what they need? Regardless of whether you are using Solr, Lucene, Elastic Search or some other search solution, you should be paying attention what your users are telling you through their queries and clicks.
In the first part of this presentation we’ll talk about what Search Analytics is, why it's valuable, and how it can be used to answer questions like:
- Are too many users getting the dreaded “no matches” results?
- How deep into search results do people dig?
- Which hits are they clicking on, or what percentage of them don’t click on any hits?
- How much do they use the “Did You Mean” or “Auto-Complete” suggestions?
We’ll explore what specific search analytics reports tell us and what specific actions you should take based on those reports.
In the second part of the presentation we'll talk about how we've used Flume, Hadoop, MapReduce, and HBase to build a scalable Search Analytics service.
| Watch session video. |
Just the Job: Employing Solr for Recruitment Search
Presented by Charlie Hull, Flax
Using a case study on a major European executive recruitment company, we will show how we used Apache Lucene/Solr to build powerful, flexible, accurate and scalable search services over tens of millions of CVs and candidate records, allowing the company to completely restructure their IT provision for both local and national offices.
| Watch session video. |
Multilingual Search and Text Analytics with Solr
Presented by Steve Kearns, Basis Technology
The power of the Solr search engine has rapidly gained it acceptance as an alternative to commercial search solutions for many applications. There are many features required by organizations to serve their diverse communities, among these is the ability to deliver search excellence in users' native languages. Delivering quality multilingual search involves careful understanding of data and design of schemas, and selection of the best linguistic approaches for the supported languages.
This talk will explore the challenges of Multilingual search, including language-specific issues - like N-gram segmentation vs. morphological analysis, stemming vs. lemmatization, and language identification - and the various approaches to configuring your Solr schema. We will also discuss the integration strategies for common text analytics capabilities and the impact of multilingual content on application design.
| Watch session video. |
Relevance at Cengage: English & Non-English Content
Presented by Ivan Provalov, Cengage Learning
In the session we describe relevance improvements we have implemented in our Lucene-based search system for English and Chinese contents and the tests we have performed for Arabic and Spanish contents based on TREC data. We will also describe our relevance feedback web app for the end-users to rank results of various queries. The presentation will have information about the usage data we analyze to improve the relevance. We will also touch upon our OCR data indexing challenges for English and non-English content.
| Watch session video. |
Scaling Search at Trovit with Solr & Hadoop
Presented by Marc Sturlese, Trovit
Trovit is a global classified advertising service covering real estate, jobs and more in 27 countries worldwide. Until recently, our distributed Lucene/Solr search indexes used a customized Data Import Handler to draw data out of MySQL, but they no longer adequately handle our volumes with acceptable performance. We have moved Lucene/Solr indexes using MapReduce and came up with a new way to build indexes which is into production since months ago. Here at Trovit, we deal with many countries and different business categories, each with its own index -- and not all of them have similar size or structure.
I'll present our experience as a combined use case/tutorial, beginning with a brief introduction about the main Solr features we use at Trovit, and then move to the more complex part:
- Brief explanation of the data pipeline handled by Hadoop before our ads are indexed, with implementation details of the indexing process, deploying indexes from HDFS, etc.
- Tuning performance parameters to improve indexing speed as much as possible and keep good search performance
- Managing the effect of GC at search time as much as we can as we deal with shards
- Moving indexing time Solr features like DeDuplication to MapReduce.
- Using Solr analyzers to analyze large amounts of text outside of an indexing process
I'll also talk about how we used the phased indexing strategy to manage indexes across countries and verticals (jobs, autos, etc.) and working around limitations in SOLR-1301.
| Watch session video |
Lessons Learned: Refactoring a Solr-Based API App
Presented by Torsten Koester, smatch.com/ Shopping24
In this case study I'll discuss architectural lessons learned from refactoring an existing REST-API backed by Apache Solr. The initial goal of the refactoring was to speed up data access while scaling from 5m documents to 20-50m documents stored in Solr. Under consideration was the hosting infrastructure, the REST API Java code and the Solr documents and configuration. In this talk I'll give a brief review of the results.
"Pimping" the Solr configuration, the client access and the document structure achieved better results. But the elementary lesson learned was, that a significant increase of data access speed can only be realized with a functional redesign and a simplification of the REST API. NO CAPS ON CORES & SHARDS) I'll explain how this led us directly to distinct Solr cores and why we dropped the introduction of Solr shards or a breathing cloud infrastructure.
| Watch session video. |
Solr + Greenplum = MPP Solr
Presented by George Chitouras, EMC
The overall topic is computation on structured and unstructured data with "Big Data Text Analytics" as the goal. In this session we will look at Solr integrated with an MPP database which is a shared-nothing RDBMS architecture that scales out linearly. One feature of Greenplum (our MPP database) is "external tables", which allows the definition of RDBMS table based on files, processes/pipes, or HTTP accessible resources. These external tables feature parallel I/O utilizing the aggregate I/O (disk and network), memory, and CPU of the cluster. Apache Solr is a natural choice for such an integration due to its REST-like HTTP API, which is a clean fit for the framework. We have deployed Solr in a highly parallel manner to achieve high performance scalable text indexing and search, integrated with the MPP database to provide advanced analytics.
Natural Language Search in Solr
Presented by Tommaso Teofili, Sourcesense
This presentation aims to showcase how to build and implement a search engine which is able to understand a query written in a way much nearer to spoken language than to keyword-based search using Apache Lucene/Solr and Apache UIMA. A system which can recognize semantics in natural language can be very handy for non expert users, e-learning systems, customer care systems, etc. With such a system it's possible to submit queries such as "hotels near Rome" or "people working at Google" without having to manually transform a user entered natural language query to a Lucene/Solr query.
The Solr - UIMA integration (since Solr 3.1.0) can help on building such intelligent systems using NLP / Text mining algorithms on documents being indexed and on queries written by the user.
This module gives Solr the ability of calling UIMA pipelines when documents are indexed to trigger automatic extraction of metadata (i.e. named entities like people, places, organizations, etc.) using existing and custom algorithms as UIMA analysis engines. The talk will cover:
- The Solr - UIMA integration
- Introducing UIMA to Lucene's analysis phase
- Running existing open source NLP algorithms in Lucene/Solr
- Orchestrating blocks to build a sample system able to understand natural language queries
We'll introduce these points using examples (architectures & code) and a sample demo system.
| Watch session video. |
Securing Documents in Solr with Manifold CF
Presented by Karl Wright, Nokia, Inc
This talk combines a brief presentation with panel Q&A session with a number of key experts in the field of open source content acquisition and security. We'll start by familiarizing Solr users with the capabilities of Apache Manifold Connector Framework, concentrating on how Manifold CF (MCF) can be used to project a repository's security into Solr search results through the use of Manifold CF's Authority Service and a custom Solr search component. We'll then transition to a panel discussion designed to explore case studies of how this security architecture has worked out when deployed in the field, and take questions from the audience. If you have questions in advance you would like us to consider for the panel discussion, we'd welcome them. You may submit questions ranging from 'how-to' to the MCF roadmap to kwright(at)apache.org.
| Watch session video |
Improved Search with Lucene 4
Presented by Robert Muir, Lucid Imagination
This talk describes how you can practically apply some of Lucene 4's new features (such as flexible indexing, scoring improvements, column-stride fields) to improve your search application.
The talk will give a brief description of these new features and some example use-cases, to address practical use cases you can try yourself in and around the new features now available in Lucene 4. We'll cover application of functions where you can configure Solr to:
- Set up the schema to use Pulsing or Memory codec for a primary key field
- Not use a separate spellcheck index, controlling character-level swaps from the query processor
- Sorting with a different locale
- Per-field similarity configurations, such as using a non-vector-space algorithm
| Watch session video. |
Enterprise Search: FAST ESP to Lucene Solr
Presented by Michael McIntosh, TNR Global, LLC
This presentation will discuss migration from FAST ESP to a Lucene Solr search platform. Illustrated through actual case studies, the presentation will include challenges and concerns, and present solutions and work-arounds to overcome migration issues. There are many reasons that an IT department with a large scale search installation would want to move from a proprietary platform to Lucene Solr. In the case of FAST Search, the company's purchase by Microsoft and discontinuation of the Linux platform has created an urgency for FAST users.
| Watch session video. |
More Powerful Solr Search with Semaphore
Presented by Jeremy Bentley, Smartlogic
Metadata is widely understood to be a critical element of search, discovery and classification. But with the preponderance of unstructured data addressed by search technology, consistent native metadata is often in short supply. Organizations often find that the quality and depth of contextual metadata -- what documents are about – can maker or break search relevancy, precision and recall.
Semaphore is an enterprise semantic platform that uniquely captures an organization‘s subjects and topics into a taxonomy or ontology (model), in a manner that adds context for enhanced navigation and findability. Semaphore augments traditional information management systems like Solr search by adding advanced content classification, metadata and navigation capabilities to deliver a more complete, higher quality enterprise information management experience. This talk will focus on the following:
- Deep dive into the technical integration of Semaphore with Apache/ Solr (including the connection points between Semaphore and Solr)
- Discuss the Semaphore modules (Ontology Manager, Classification Server, Semantic Enhancement Server and Search Application Server) and how they provide better findability
- Share a demonstration of Solr in action
- Present a client case study (Nordyske).
| Watch session video |
Stump the chump
Presented by Chris Hostetter, Lucid Imagination
Do you have a tough problem with your Solr application? Facing challenges that you'd like some advice on?
Looking for new approaches to overcome a Solr issue? Not sure how to get the results you expected? Don't know where to get started? Then this session is for you. Get Chris Hostetter (aka Hoss) to come up with an immediate answer to your challenge or interesting problem.
During the session, Hoss will see the questions for the first time - and then will provide his approach to the problem. Our panel of judges will decide if he has provided an effective solution. Prizes will be awarded by the panel for the best question - and for those deemed to have "stumped the chump".
| Watch session video. |
Day 2 Presentation Abstracts
KEYNOTES
GO TO DAY 1
LIGHTNING TALKS
Lucene Today, Tomorrow & Beyond
Presented by Simon Willnauer, JTeam / Apache Lucene
Apache Lucene has grown to one of the most widely used Open Source search technologies. For more than a decade Lucene has been used to retrieve search results for millions of users from mobile phones to world scale applications with billions of queries every day. This talk introduces the current state of the Lucene eco-system from a technical perspective and tries to provide a future vision of the project even beyond the next revolutionary major release.
| Watch session video. |
Designing Mobile Search
Presented by Tyler Tate, TwigKit
About 15% of searches in 2011 have been performed on mobile devices, with an estimated rise to one in every four by next year. And people aren't just searching Google: restaurants, cars, electronics, and even enterprise content are all being searched by people on the move.
How should we design Lucene- and Solr-powered search experiences for phones and tablets? To be sure, very different rules apply; small screens, slow connections, limited attention, and location awareness all afford very different user interfaces between desktop and mobile devices.
This talk will examine design patterns for mobile search, including approaches to faceted navigation, autocomplete, sorting, breadcrumbs, recent history, and bookmarks, as well as how these design patterns fit together as a whole.
| Watch session video. |
Solr on Windows: Does it Work? Does it Scale?
Presented by Teun Duynstee, Funda
We will present a case study about running Solr on Windows and using Solr from Windows.
funda.nl is a Dutch household name. Our website is by far the largest real estate search engine in The Netherlands. Searching for homes for sale is the main functionality and used to be implemented as a home grown SQLServer based solution. This worked fine performancewise, but it was not very flexible in making changes to facets and searching/sorting. Over the last year we have migrated this solution to Solr.
funda serves 10M pageviews daily and most of these pages involve searching and faceting. Every month, 3M unique visitors visit the site (nearly 20% of the nationwide population of 16M). Our systems are built on the .NET platform in C# and this is also the skill set of the development and operations teams.
We'll discuss:
- What kind of problems did we encounter when connecting to the Solr service from .NET?
- About scalability: we will give many metrics about our solution: searches per second, indexing speed , effects of caching under load, indexer/searcher topology etc.
- How caching of results in memcached compared to using the internal caching in Solr
- Choices we made in running Solr on Windows. We use Tomcat and run it as a Windows system service. Results of stress and load tests we did
- How we introduced Solr into the organization, taking away risks and uncertainties by doing a phased transition
| Watch session video. |
The Many Facets of Apache Solr
Presented by Yonik Seeley, Lucid Imagination
Faceted Searching is a must have feature for enhancing findability and user engagement in enterprise search UI. The Faceted Searching features of Apache Solr have been a major factor in it's popularity, but many Solr users don't fully appreciate all of the capabilities that are available. In this session we will deep dive into the different types of data facets that Solr supports, discussing in detail the various options that can be used to explore them. We will also review some specific techniques for dealing with several complex use cases, and discuss some performance "gotchas" and how to avoid them.
| Watch session video. |
Search, REST & Play! Video Discovery
Presented by James Alexander, Open University
The Open University has been creating programmes in partnership with BBC for over forty years. The resulting video archive contains over 9,000 programmes and 70,000 tapes of raw footage. The Access to Video Assets (AVA) project has been making this collection accessible to facilitate reuse and digitally preserving content: about 100TB of data so far.
Making video discoverable presents challenges that go beyond text-based search, and AVA has used Solr in unlocking this archive and bringing order to a collection that contains a plethora of physical and digital formats: these include over twenty types of tape, BBC holdings records, library catalogue and transmission data, subtitles, transcripts, rights contracts and other documents.
The resulting interface is designed to make video and metadata held in a repository accessible in innovative and intuitive ways for use by non-specialists. Its features include tag clouds hyperlinked to play out video content as well as other visualization and exploration tools including image storyboards, dynamic menus and facets.
This presentation outlines how Solr's REST API has been used to rapidly develop a Drupal interface and how indexing can be used to produce 'in-video' search.
| Watch session video. |
Text Analytics in Enterprise Search
Presented by Daniel Ling, Findwise
Text analytics is a large and interesting subject, covering a wide range of topics. In the world of enterprise search however, the usual application of text analytics rarely ranges beyond extracting semi-structured information from the source data. As some of the more advanced concepts in text analytics, such as automatic text categorization, can be easily leveraged to bring a search installation from a search tool to a tool for discovery.
This talk will focus on how Findwise was recently able to combine entity extraction and categorization techniques with full text search in order to generate real business value for our customers. This was accomplished using a mix of technologies and readily available tools, from simple linguistic models to machine learning.
| Watch session video. |
Heavy Committing: Flexible Indexing in Lucene 4
Presented by Uwe Schindler, SD DataSolutions GmbH
Apache Lucene's next major release, 4, will introduce lots of flexibility into indexing, but also fundamental changes to the well-known APIs: It features a new and consistent, 4-dimensional iteration API on top of a low-level, pluggable codec API giving applications full control over the postings data. Terms are now arbitrary opaque bytes enabling users to store terms in any encoding, not necessarily UTF-8, natively in the index (e.g. numeric fields). Currently under development is a higher performance postings iteration API, enabling interesting codecs based on recent encoding algorithms to work effectively. Several codecs have already been created, including the default "standard" codec, which enables sizable RAM reduction for searchers, and a "pulsing" codec that inlines postings data directly into the terms dictionary, which provides a solid performance boost for primary key fields. A lot of new codecs are under development. In this talk, Uwe presents an overview of all of these exciting changes, as well as several concrete, real-world examples of how applications can tap into these new features.
| Watch session video. |
Understanding & Visualising Solr 'explain' Information
Presented by Rafal Kuc, Solr.pl
Talk and presentation about how to use, understand and visualize Solr 'explain' information—essential output from Solr that lets you better tune and debug your search application. In the talk, I'll show the free software that is in development right now, that visualize Solr 'explain' information, such as how the score of the documents were counted, from what it is taken, how it was counted,which tokens mattered the most, and so on.
| Watch session video. |
Solr at Virgin Money Giving
Presented by Robin Bramley, Ixxus
Virgin Money Giving is a UK-based not-for-profit business that was launched as a result of Virgin Money’s sponsorship of the London Marathon and raised over £10 million for charity in the first 6 months of operation. The aim of the business is to provide a better deal for charities than that provided by the for-profit companies that previously dominated the sector. Search is of critical importance to the business to help connect fundraisers with charities and fundraising events, as well as allowing donors to find charities to donate to or fundraisers to sponsor.
The architectural vision was to build a service-oriented architecture leveraging Open Source software for cost effectiveness and flexibility. Ixxus, a Lucid Imagination partner, helped Virgin Money Giving to realise their overall vision including designing and implementing a search architecture that met the following goals:
- Not polluting business logic or tightly coupling it to a search engine API
- Asynchronous ‘fire and forget’ indexing
- Read-only replica search nodes for scale out
- High Availability / Disaster Recovery
This presentation will describe how the combination of Solr, the Spring Framework and JMS was successfully used on Virgin Money Giving, a medal winning project in the British Computer Society 2010 Computing awards. This session is aimed at architects and will cover the event-driven approach employed, the Solr features utilised and some of the alternative solutions that might be considered now.
The application built for the Virgin Money giving was awarded a medal by the BCS (http://bcs.org) for an outstanding community IT project. See http://bit.ly/Virgin-Solr-BCS for details.
| Watch session video |
Randomized Continuous Testing: Solr & Lucene Use Case
Presented by Dawid Weiss, Carrot Search s.c
We have been taught that unit tests should be repeatable and most people (including the author) for a long time considered this an equivalent to "static", single-path execution. Solr and Lucene employ an interesting JUnit runner strategy where tests are randomized -- run with various data, various implementation of allowed interfaces, various configurations. The number of combinations makes running them all impossible, but execution randomization proves very successful at pinpointing implementation and regression bugs. This talk will provide an overview of this approach and practical considerations on when and how to port them to your own projects. Everything that stems from Lucene/Solr is not directly connected to search/ document retrieval and can still be useful and reused in other projects. This session is probably best suited to developers/ CTOs.
| Watch session video. |
Using Solr Cloud, For Real!
Presented by Jon Gifford, Loggly
Loggly is a cloud based logging service. It helps you collect, index, and store all your log data and then makes it accessible through search for analysis and reporting. All this is done without having to download or install anything on your servers. We have hundreds of customers, each of whom may have dozens of shards, quickly growing to thousands of individual indexes. To manage this explosion of indexes, I'll describe how we're using Solr Cloud to manage each and every index - from creation, through migration from box to box, and finally destruction. I'll describe some of the performance issues we had to deal with, especially with ZooKeeper.
| Watch session video. |
Solr on EC2
Presented by Erick Erickson, Lucid Imagination
"Cloud computing" is all the rage recently, and Amazon's EC2 is one of the major players. The idea of spinning up a new instance of Solr in seconds to accomodate increased load is very attractive, especially as it can be done on demand, without heavy infrastructure investment. But how does that actually work?
This talk will (very) briefly outline creating a ready-to-deploy image containing a Solr instance. From there we'll discuss various the considerations to keep in mind when running Solr on EC2, including; replication concerns, monitoring and integration with CloudWatch, indexing, and cost.
We'll also explore Autoscaling; automatically increasing search capacity in response to the current load, and some of the issues that need to be considered when planning for autoscaling that are specific to Solr.
Finally, we'll consider the possibilities that EC2 offers in terms of answering the persistently difficult-to-answer question: "how many documents can I put on my server".
| Watch session video. |
Better Search Engine Testing
Presented by Eric Pugh, OpenSource Connections LLC
"I know it when I see it".
This term was coined by a Supreme Court Justice in reference to obscenity, but he might as well been talking about relevancy and search engine results. Testing search engines is rarely a binary process of "it works, it doesn't work", instead it draws on our human skills to design tests that capture the intangibles that make up a great search engine implementation! The behavior of a search engine changes as the data changes, so a search that returns one set of results today will return a different set tomorrow. Is that a bug? Or just a finely tuned search engine responding to changes in the data it searches? Search Engine testing often focuses on the very first layer of functionality, "Do I get results?", without digging deeper into "Do I get great relevant results?".
Search Engine implementation projects are typically less about writing new code, and more about integrating disparate existing data sets, turning knobs and levers to tune relevancy, and really understanding your data. Testing Search Engines really is a holistic activity.
You will leave this session armed with an overview of what search engines are, and how they work, and with real life techniques to apply, both to Exploratory Testing based search as well as Automated Testing. Users will also leave with a good grasp of using SolrMeter to quickly test the impact of configuration changes.
| Watch session video. |
Solr @ Etsy
Presented by Giovanni Fernandez-Kincade, Etsy.com
Search at Etsy poses significant challenges. Our marketplace is filled with millions of unique, short-lived items and people trying to find them over 10 million times a day. In this session we'll discuss many of the solutions we've engineered to meet these challenges. These include:
- Infrastructure approaches like using Thrift as our interface to Solr and writing our own load-balancer.
- Writing custom code that inter-operates with Solr, from QParserPlugins to in-house Stemmers.
- Internationalization efforts, including tailoring the search experience to user language and location, and on-the-fly query translation – all a natural result of our efforts to create a global marketplace.
Finally, no talk about Etsy would be complete without some exploration of our incremental development strategy and the tools that empower it. We'll describe in detail how we continuously deploy our search stack, instantly change server configuration, and measure the impact of algorithmic changes using side-by-side user tests and live A/B tests.
| Watch session video. |
Lightning Talks
KEYNOTES
GO TO DAY 1
GO TO DAY 2
Morphological Analysis and Named Entity Recognition for your Lucene/Solr Search Applications
Presented by Christoph Goller, Intrafind AG
This talk will show how the relevance of search results can be improved by using morphological analysis and named entity recognition. After briefly explaining the purpose of morphological analysis and of named entity recognition we will demonstrate their potential advantages for search, faceting, and clustering of search results in a life demo.
| Watch session video. |
Java 7 and Lucene: the story behind the story
Presented by Uwe Schindler, SD DataSolutions GmbH
The recent release of Java 7 and its testing with Lucene exposed some significant problems with the released Java code. We'll give a brief synopsis of the technical state of play of these issues and tell you what you need to know. (A full accounting of the chronology is available at http://blog.thetaphi.de/2011/07/real-story-behind-java-7-ga-bugs.html).
| Watch session video. |
Navigating Subdocuments with Solr
Presented by Mikhail Khludnev, Grid Dynamics Consulting Services
Solr does great filtering and faceting, but often your documents are not flat at all and have some kind of structure i.e. items or sub-documents. Ignoring their identity leads to poor navigation experience. Lucene can model such sub-documents via its intrinsic abilities: TermPositions and SpanQueries. Unfortunately Solr doesn't support them for filtering, faceting and sorting. We easily extend Solr for the ultimate sub-documents navigation experience that includes filtering, faceting, and sorting. More details are http://blog.griddynamics.com/search/label/Solr
| Watch session video. |
Powered by Lucene: IBM Content Analytics with Enterprise Search
Presented by Wolfgang Jung, IBM
See and hear how IBM applies Lucence into their commercial software offerings. Hear about experience in development and advantages of this approach.
| Watch session video. |
Searching in more than 140 years Newspaper Article
Presented by Nicola Provenzano, Bassilichi Group
Bassilichi Group worked for the implementation of the oldest Italian newspaper historical archive of "La Stampa di Torino" from 1867 to 2006. Lucene technologies has powered this successed story to highlight the content of over 5.000.000 articles captured from 2.000.000 pages, printed in an unstructured layout and recognized with semantic analisys approach. An example of the implementation may be found at http://devlastampa.bdadoc.it/.
| Watch session video. |
Solr Performance Monitoring
Presented by Otis Gospodnetic, Sematext International
This talk shows how SPM for Solr, a Solr Performance Monitoring SaaS developed by Sematext International, can be used to monitor one or more Solr instances and clusters. This currently completely free service monitors all existing Solr metrics, the JVM garbage collection and memory metrics, OS-level metrics such as CPU, memory, load, disk and network IO, Lucene/Solr index size, number of deleted documents, files, and segments, as well as performance of key Lucene and Solr components, such as IndexWriter, QueryComponent, HighlighterComponent, etc. SPM for Solr can monitor Solr 1.4.*, 3.*, and 4.* versions of Solr. The service is currently completely free: http://sematext.com/spm/solr-performance-monitoring/index.html
| Watch session video. |
Using Lucene Payloads in Solr
Presented by Neil Hooey
Solr supports per-document boosts during index-time, which can improve relelvance and help reduce the presence of spammy documents in search results. To If you want to support per-value boosts for a multivalued field, you will need to support a Lucene feature called "payloads". In this talk I will show the value of this feature, and demonstrate its implementation in Solr.
Basically I'm going to demonstrate some queries using payloads and quickly go through the code and configuration necessary to access this feature from within Solr.
| Watch session video. |
