Monday, June 16, 2008

How Google Measures Search Quality

This post continues my prior post Are Machine-Learned Models Prone to Catastrophic Errors. You can think of these as a two-post series based on my conversation with Peter Norvig. As that post describes, Google has not cut over to the machine-learned model for ranking search results, preferring a hand-tuned formula. Many of you wrote insightful comments on this topic; here I'll give my take, based on some other insights I gleaned during our conversation.

The heart of the matter is this: how do you measure the quality of search results? One of the essential requirements to train any machine learning model is a a set of observations (in this case, queries and results) that are tagged with "scores" that measure the goodness of the results. (Technically this requirement applies only to so-called "supervised learning" approaches, but those are the ones we are discussing here.) Where to get this data?

Given Google's massive usage, the simplest way to get this data is from real users. Try different ranking models on small percentages of searches, and collect data on how users interacted with the results. For example, how does a new ranking model affect the fraction of users who click on the first result? The second? How many users click to page 2 of results? Once a user clicks out to result page, how long before they click the back button to come back to the search results page?

Peter confirmed that Google does collect such data, and has scads of it stashed away on their clusters. However -- and here's the shocker -- these metrics are not very sensitive to new ranking models! When Google tries new ranking models, these metrics sometimes move, sometimes not, and never by much. In fact Google does not use such real usage data to tune their search ranking algorithm. What they really use is a blast from the past. They employ armies of "raters" who rate search results for randomly selected "panels" of queries using different ranking algorithms. These manual ratings form the gold-standard against which ranking algorithms are measured -- and eventually released into service.

It came as a great surprise to me that Google relies on a small panel of raters rather than harness their massive usage data. But in retrospect, perhaps it is not so surprising. Two forces appear to be at work. The first is that we have all been trained to trust Google and click on the first result no matter what. So ranking models that make slight changes in ranking may not produce significant swings in the measured usage data. The second, more interesting, factor is that users don't know what they're missing.

Let me try to explain the latter point. There are two broad classes of queries search engines deal with:

  • Navigational queries, where the user is looking for a specific uber-authoritative website. e.g., "stanford university". In such cases, the user can very quickly tell the best result from the others -- and it's usually the first result on major search engines.
  • Informational queries, where the user has a broader topic. e.g., "diabetes pregnancy". In this case, there is no single right answer. Suppose there's a really fantastic result on page 4, that provides better information any of the results on the first three pages. Most users will not even know this result exists! Therefore, their usage behavior does not actually provide the best feedback on the rankings.

Such queries are one reason why Google has to employ in-house raters, who have been instructed to look at a wider window than the first 10 results. But even such raters can only look at a restricted window of results. And using such raters also makes the training set much, much smaller than could be gathered from real usage data. This fact might explain Google's reluctance to fully trust a machine-learned model. Even tens of thousands of professionally rated queries might not be sufficient training data to capture the full range of queries that are thrown at a search engine in real usage. So there are probably outliers (i.e., black swans) that might throw a machine-learned model way off.

I'll close with an interesting vignette. A couple of years ago, Yahoo was making great strides in search relevance, while Google apparently was not improving as fast. Recall then that Yahoo trumpeted data showing their results were better than Google's. Well, the Google team was quite amazed, because their data showed just the opposite: their results were better than Yahoo's. They couldn't both be right -- or could they? It turns out that Yahoo's benchmark contained queries drawn from Yahoo search logs, and Google's benchmark likewise contained queries drawn from Google search logs. The Yahoo ranking algorithm performed better on the Yahoo benchmark and the Google algorithm performed better on the Google benchmark.

Two learnings from this story: one, the results depend quite strongly on the test set, which again speaks against machine-learned models. And two, Yahoo and Google users differ quite significantly in the kinds of searches they do. Of course, this was a couple of years ago, and both companies have evolved their ranking algorithms since then.

No comments: