If you've read this blog so far, you're probably thinking "where's the maths?" So, let's get started. Today's paper is The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond.
We all know and love the multi-armed bandit problem, not least because Myna is based on it. The stochastic variant (typically) boils down to estimating a confidence region for the expected reward of each arm. Better algorithms have better ways of estimating this confidence region. The KL-UCB algorithm is a new algorithm that seems to have a particularly good way of estimating the confidence regions.
\( t = 1, 2, \ldots, n \)
\(t\)
, \(A_t \in \{1, \ldots, K \}\)
\( X_t \)
are i.i.d. conditional on \( A_t \)
with expectation \( \mu_{A_t} \)
\( (A_1, X_1, A_2, X_2, \ldots, A_{t-1}, X_{t-1}) \)
choses the next action \( A_t \)
\( a^* \)
has expected reward \( \mu_{a^*} \)
\( R_n \)
, the difference between the sum of reward up to horizon \( t = n \)
and the reward that could have been accumulated if only the optimal arm had been chosen.The definition of reward (i.i.d.) means we're dealing with a stochastic bandit problem (cf non-stochasitc and non-stationary problems). Within this class of bandit problemss there are parametric and non-parametric variants. In the parametric case we assume \(X_t\)
given \(A_t = a\)
belongs to some parametric family \( \{ p_\theta, \theta \in \Theta_a \} \)
. In this case a lower bound on performance, and an optimal policy, have been known since the dawn of time. Somewhat later, a lower bound was proven on the number of times a sub-optimal arm is chosen. Let \(a\)
be a sub-optimal arm with parameters \(p_{\theta_a}\)
, \(n\)
the time step, and \(N_a(n)\)
the number of times arm \(a\)
is chosen. Then,
\[ N_a(n) \geq \left( \inf_{\theta \in \Theta_a: E[p_\theta] > \mu_{a^*}} \frac{1}{KL(p_{\theta_a}, p_\theta)} + o(1) \right) log(n) \]
It's worth spending some time unpacking this statement. The infimum is, informally, the minimum value of the set. So the expression \(\inf_{\theta \in \Theta_a: E[p_\theta] > \mu_{a^*}}\)
is finding setting of the parameters that give an arm with expected reward as close as possible to (while still being greater than) the optimal arm. \( KL(p_1, p_2) \)
is the Kullback-Leibler divergence between distributions \(p_1\)
and \(p_2\)
. It's not a true metric but we can think of it measuring the distance between two distributions in some sense. So we're measuring the "distance" between this sub-optimal arm and our approximation to the optimal arm. So basically what this expression ends up saying is we'll play an arm with frequency proportional to its similarity to the optimal arm. Makes sense: obviously bad arms don't get played as much, but arms that look like the optimal arm do.
For this we can easily derive a lower bound on regret:
\[ \lim_{n \rightarrow \infty}\inf \frac{\mathbb{E}[R_n]}{log(n)} \geq \sum_{a : \mu_a < \mu_{a^*}} \inf_{\theta \in \Theta_a: E[p_\theta] > \mu_{a^*}} \frac{\mu_{a^*} - \mu_a}{KL(p_{\theta_a}, p_\theta)} \]
In the non-parametric case we only assume the rewards are bounded. KL-UCB is a non-parametric algorithm, yet still manages match the lower bound for the parametric case when the rewards are binary and can be extended to a large class of parametric policies.
KL-UCB is an upper confidence bound algorithm. As the names suggest these algorithms compute an upper confidence bound on the reward of each arm, and choose the arm that has the highest bound at each time step. This approach is sometimes called optimism in the face of uncertainty, which is a catchy phrase but we should remember that this phrase is derived from the development of optimal algorithms not vice versa.
We'll start by introducing the algorithm and then delve into how and why it works. The algorithm is simple to state. We define:
\(a\)
has been chosen is \(N[a]\)
\(a\)
has received is \(S[a]\)
\(d(p,q) = p \log \frac{p}{q} + (1 - p) \log \frac{1-p}{1-q} \)
\(c\)
, which is recommended to set to 0Then:
Play each arm once to initialise it
Once all arms are initialised, we calculate an upper confidence bound given by:
\[ \max \left\{ q \in [0,1] : N[a] d \left( \frac{S[a]}{N[a]}, q \right) \leq \log(t) + c\log(\log(t)) \right\} \]
and play the arm with the highest confidence bound