Adaptive Sequential Experiments with Unknown Information Flows

Adaptive Sequential Experiments with Unknown Information Flows

By Yonatan Gur, Ahmadreza Momeni
2018Working Paper No. 3693

An agent facing sequential decisions that are characterized by partial feedback needs to strike a balance between maximizing immediate payoffs based on available information, and acquiring new information that may be essential for maximizing future payoffs. This trade-off is captured by the multi-armed bandit (MAB) framework that has been studied and applied under strong assumptions on the information collection process: at each time epoch a single observation is collected on the action that was selected at that epoch. However, in many practical settings additional information may become available between pulls. This information can be relevant to actions that were not selected recently, and might be essential for achieving good performance. We introduce a generalized MAB formulation that relaxes the strong assumptions on the information collection process, and in which auxiliary information on each arm may appear arbitrarily over time. By obtaining matching lower and upper bounds, we characterize the (regret) complexity of this family of MAB problems as a function of the information flows, and study how salient characteristics of the information flows impact policy design and achievable performance. We introduce a broad adaptive exploration approach for designing policies that, without any prior knowledge on the information arrival process, attain the best performance (in terms of regret rate) that is achievable when the information arrival process is a priori known. Our approach is based on adjusting MAB policies designed to perform well in the absence of auxiliary information by using dynamically customized virtual time indexes to endogenously control the exploration rate of the policy. We demonstrate the effectiveness of the adaptive exploration approach through establishing performance bounds and evaluating numerically the performance of adjusted well-known MAB policies. Our study demonstrates how decision-making policies designed to perform well with very little information can be adjusted to also guarantee optimality in more information-abundant settings.

Keywords
Data-driven decisions, learning, sequential optimization