I've been talking a lot about search engine scoring and continuous improvement cycles, and we’ve been going through this process in detail during a client project. It's been completely fascinating. We get together every day or every other day and spend two hours laboriously modifying query structures, optimizing weights, reviewing results, looking at Solr Explain scores, and then scoring of the search engine performance every step of the way to track our progress.
Test your queries, measure your relevance score, then test again
The experience has been a revelation. First, because it’s working as a continuous improvement cycle in exactly the way we always envisioned, and also in how unforgiving the process is. When you are ruled by your metrics (as we are), then it's humbling because lots of great ideas may just amount to not much improvement.
Second, because the Query Processing Language (QPL) is such a star, you can do just about any query structure or weighting in QPL and see the results instantly. It's a pleasure to work with as we’ve been able to do some really sophisticated things with it relatively easily. With QPL, we have one-day sprints which would've been impossible otherwise.
So far, we've run almost 100 different experiments. Here's a graph of the search engine score for each test, showing the increase in the scores from test to test.
<<< Start >>>
Improving search engine scores from test to test
<<< End >>>
I think the process has felt a lot like being a baseball manager more than doing any other computer programming task I have ever experienced. It’s like you are playing the averages. You put your players into the game (new query structures) and see how they do. On average, do they win more games than they lose? What is the WAR (Wins above Average Replacement)?
Testing and measuring help us refine those query structures, tweaking, adjusting weights by double checking and looking at scores. This is a lot like working with each of your players and helping each one of them reach their full potential, so that when they're all in the game, they will all be working smoothly together and winning games for your team.
Be a surgeon when it comes to search relevancy tuning
Next come the daily calls which are a lot like being in an operating room theater: many people on the calls (as many as 10 people) who are watching as we rip open the search engine and operate on the query structures.
The other thing which makes this a lot like being a surgeon is just how careful you have to be about everything. You can’t leave anything to chance. You need to double check every single query, every single construction, and every single statistic. We found all sorts of problems and bugs throughout the system, including dozens of cases where the query structures weren’t quite right, the weights weren’t set, issues with date mathematics, a bug in the lemmatizer, acronym lemmatization, random thesaurus expansion issues, document security issues, and so on.
Paying attention to detail allows you to spot oddities that are normally undetectable. For example, we created an algorithm which can find the least frequent terms in a particular query and identify them as ‘key’ terms; then, proceed to boost titles with those key terms. After adding the algorithm, we got a modest increase in the engine score. Success!
But then we started looking at what was ‘key’ according to the algorithm for the most popular queries, and discovered that they often looked pretty odd. For example, with |powered hang gliders|, the key terms were “powered” and “hang.” For |world war| the key term was “world,” etc. So we modified the algorithm to look for phrase expansions from the thesaurus - if the phrase expansions exist, then those expansions became ‘key.’ Now, rather than just “world,” the key term for |world war| is actually the phrase “world war,” which makes a lot more sense.
You have to look at every little thing and constantly double check. We spend a lot of time looking at query results and asking ourselves: “Does this look good? Is it better than before? Why or why not?” Carefully looking at Solr Explain (to find out why documents are where they are in the results list) helps a lot, too.
Identify the Good, the Bad, and the Ugly through search engine scoring
Finally, search engine scoring has proven very effective at helping us find cases to fix. We can find cases where the old engine was good and the new engine is bad. Then we can individually look at documents and identify relevant documents which are missing or showing much too low in the search results. Sometimes, we look at the documents and think to ourselves, “Eh, the new results are better,” but sometimes we realize something is wrong, or can be made better. This process leads us to developing new query structures and other adjustments for a more intelligent search engine.
It seems clear: Old methods of search engine accuracy tuning are dead. You must use search engine scoring as a key component of your relevancy algorithms – this is the only way to do it. Once you have (the right) log data, this process is so powerful, and so much more fun to go through, for boosting your search engine performance! Getting incremental improvements in the score are a real accomplishment, backed up by real data (just how we like it). When you get a 0.02 improvement in the score, everyone celebrates!
We've entered the age of enlightenment for search engine relevancy, and it’s a wonderful thing.