The Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms

By Mohsen BayatiNima HamidiRamesh JohariKhashayar Khosravi

March2022| Working Paper No. 4153

Download

We study a Bayesian k-armed bandit problem in many-armed regime, when k ≥ √ T, with T the time horizon. We first show that subsampling is critical for designing optimal policies. Specifically, the standard UCB algorithm is sub-optimal while a subsampled UCB (SS-UCB), which samples Θ(√ T) arms and executes UCB on that subset, is rate-optimal. Despite theoretically optimal regret, SS-UCB numerically performs worse than a greedy algorithm that pulls the current empirically best arm each time. These empirical insights hold in a contextual setting as well, using simulations on real data. These results suggest a new form of free exploration in the many-armed regime that benefits greedy algorithms. We theoretically show that this source of free exploration is deeply connected to the distribution of a tail event for the prior distribution of arm rewards. This is a fundamentally distinct phenomenon from free exploration due to variation in covariates, as discussed in the recent literature on contextual bandits. Building on this result, we prove that the subsampled greedy algorithm is rate-optimal for Bernoulli bandits in many armed regime, and achieves sublinear regret with more general distributions. Taken together, our results suggest that practitioners may benefit from using greedy algorithms in the many-armed regime.