forked from chromium/chromium
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Categorical search] Add documentation for score normalizer
The forthcoming score normalization algorithm, which is a tweak of adafang@'s previous work, is complicated enough that it warrants some documentation in the codebase. This CL adds a .md, and follow-up CLs will implement the algorithm itself. Bug: 1199206 Change-Id: I87a77e1579c65422c00d0f908916b56a7e35cb4a Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3189836 Commit-Queue: Tony Yeoman <[email protected]> Reviewed-by: Rachel Wong <[email protected]> Cr-Commit-Position: refs/heads/main@{#925647}
- Loading branch information
tby
authored and
Chromium LUCI CQ
committed
Sep 28, 2021
1 parent
88e218e
commit 3a090ad
Showing
1 changed file
with
137 additions
and
0 deletions.
There are no files selected for viewing
137 changes: 137 additions & 0 deletions
137
chrome/browser/ui/app_list/search/ranking/score_normalizer.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,137 @@ | ||
# Score normalizer | ||
|
||
The score normalizer is a heuristic algorithm that attempts to map a incoming | ||
stream of numbers drawn from some unknown distribution to a uniform | ||
distribution. This allows us to directly compare scores produced by different | ||
underlying distributions, namely, by different search providers. | ||
|
||
## Overview and terminology | ||
|
||
The input to the normalizer is a stream of scores, `x`, drawn from an | ||
underlying continuous distribution, `P(x)`. Its goal is to learn a function | ||
`f(x)` such that `f(x ~ P)` is a uniform distribution over `[0,1]`. | ||
|
||
## Reservoir sampling | ||
|
||
Reservoir sampling is a method for computing `f`. Its simplest form is: | ||
|
||
- Take the first `N` (eg. 100) scores as a 'reservoir'. | ||
- Define `f(x) = sum(y < x for y in reservoir) / N` | ||
|
||
In other words, we take the first `N` scores as a reservoir, and then | ||
normalize a new score by computing its quantile within the reservoir. | ||
|
||
We've constructed `f` by choosing a series of dividers (the reservoir scores) | ||
which we hope are a representative sample of `P`. Consequently, a new `x ~ P` | ||
will fall into any 'bucket' between two dividers with equal probability. So, | ||
the index of the bucket x falls into is the quantile of x within `P`. Thus, | ||
with some allowance for the fact that f is discrete, `f(x ~ P) ~= U(0,1)`. | ||
|
||
The drawback of this is that the reservoir needs to be quite large before | ||
this is effective. | ||
|
||
## Bin-entropy reservoir sampling | ||
|
||
We're going to trade off some accuracy for efficiency, and build on the core | ||
idea of reservoir sampling. We will store 'bins' that cover the real line, and | ||
try to keep the number of elements we've seen in each bin approximately equal. | ||
Then we can compute `f(x)` by using the bin index of `x` as a quantile. | ||
|
||
__Data structure__. We store a vector of N 'bins', each one with a count and | ||
a lower divider. The bottom bin always has a lower divider of negative | ||
infinity. `bins[i].count` roughly tracks the number of scores we've seen | ||
between `bins[i-1].lower_divider` and `bins[i].lower_divider`. | ||
|
||
__Algorithm overview.__ Each time a score is recorded, we increment the count | ||
of the appropriate bin. Then, we propose a repartition of the bins as follows: | ||
|
||
- Split the bin corresponding to the new score into two bins, divided by the | ||
new score. Halve the counts of the old bin into the two new bins. | ||
|
||
- Merge together the two smallest contiguous bins, so that the total number of | ||
bins stays equal. | ||
|
||
We accept this repartition if it increases the evenness across all bins, which | ||
we measure using entropy. For simplicity, we prevent the merged bins from | ||
containing either of the split bins. | ||
|
||
__Entropy.__ Given a collection of bins, its entropy is: | ||
|
||
``` | ||
H(p) = -sum( bin.count/total * log(bin.count/total) for bin in bins ) | ||
``` | ||
|
||
Entropy has two key properties: | ||
1. The discrete distribution with maximum entropy is the uniform distribution. | ||
2. Entropy is convex. | ||
|
||
As such, we can make our bins as uniform as possible by seeking to maximize | ||
entropy. This is implemented by only accepting a repartition if it increases | ||
entropy. | ||
|
||
__Fast entropy calculations.__ Suppose we are considering a repartition that: | ||
- splits bin `s` into `s_l` and `s_r`, | ||
- merges bins `m_l` and `m_r` into `m`. | ||
|
||
We don't need to compute the full entropy over the old and new sets of bins, | ||
because the entropy contributions of bins not related to the split or merge | ||
cancel out. As such, we can calculate this with only six terms as follows, | ||
where `p(bin) = bin.count / total`. | ||
|
||
``` | ||
H(new) - H(old) = [ -p(s_l) log p(s_l) - p(s_r) log p(s_r) - p(m) log p(m) ] | ||
- [ -p(s) log p(s) - p(m_l) log p(m_l) - p(m_r) log p(m_r) ] | ||
``` | ||
|
||
__Algorithm pseudocode.__ This leaves us with the following algorithm for | ||
an update. | ||
|
||
``` | ||
def update(score): | ||
if there aren't N bins yet: | ||
insert Bin(lower_divider=score, count=1) | ||
return | ||
s = score's bin | ||
s.count++ | ||
m_l, m_r = smallest contiguous pair of bins separate from s | ||
m = Bin(lower_divider=m_l.lower_divider, count=m_l.count + m_r.count) | ||
s_l = Bin(lower_divider=s.lower_divider, count=s.count/2) | ||
s_r = Bin(lower_divider=score, count=s.count/2) | ||
if H(new) > H(old): | ||
replace s with s_l and s_r | ||
replace m_l and m_r with m | ||
``` | ||
|
||
## Performance | ||
|
||
We did some benchmarking under these conditions: | ||
- Two tests: samples from `Beta(2,5)`, and a power-law distribution. | ||
- The bin-entropy algorithm targeting 5 bins. | ||
- The 'simple' algorithm from above storing `N` samples. | ||
- The 'better-simple' algorithm that stores the last `N` samples and uses | ||
them to decide the dividers for 5 bins. | ||
|
||
Results showed that the bin-entropy algorithm performs about as well as the | ||
better-simple algorithm when `N = 150`, and actually slightly outperforms it for | ||
`N < 150`. This may be due to the split/merge making it less sensitive to noise. | ||
It's difficult to compare the results of the simple algorithm because the number | ||
of bins is different, but the results were significantly noisier than the | ||
better-simple algorithm. | ||
|
||
# Ideas for improvement | ||
|
||
Here are some things that anecdotally improve performance: | ||
|
||
- The algorithm loses information when a bin is split, because splitting the | ||
counts evenly is uninformed. Tracking moments of the distribution within | ||
each bin can help make this a more informed choice. | ||
|
||
- We only perform one split/merge operation per insert, but it's possible | ||
several split-merges should be done after one insert. | ||
|
||
- Allowing the merged bins to contain one of the split bins provides a small | ||
improvement in performance. |