Technology & AI

Experiments Inspired by Slot Machines Promise Bigger Research Payoffs

“Multi-armed bandits” can reduce some of the uncertainty and inefficiency of A/B testing.

May 01, 2024

| by Katie Gilbert

The concept for bandit experiments comes from the image of a gambler facing a row of slot machines choosing which “one-armed bandit” to play. | Reuters/Peter Cziborra

As researchers at Stanford Graduate School of Business have pushed past the limitations of A/B testing, another area of experimentation has focused on “multi-armed bandits.” Mohsen Bayati, a professor of operations, information, and technology who has been exploring these problems for the past 15 years, explains the basics of this approach and how it can pay off.

What are multi-armed bandits?

Mohsen Bayati: You can think of multi-armed bandits as a class of decision-making scenarios individuals or algorithms face when choosing between multiple options with uncertain outcomes. The name comes from the metaphor of a gambler facing a row of slot machines and choosing which arms to pull in order to maximize their total payout. They don’t know which arm is best in advance, so they need to experiment. Once they figure out which arm is a good one, they want to stick with it.

Now, in the context of experiment design — in marketing, healthcare, website optimization, et cetera — this offers a powerful framework for dynamically allocating resources among competing strategies.

Can you give an example of how this might work?

Bayati: Let’s say that the competing strategies are two designs for a website. Think of these two designs as two arms: one of them performs better when you integrate it into an organization’s workflow — but we don’t know in advance which it is. Let’s say we decide to run an experiment for two weeks. Traditionally, A/B testing solves this by randomly splitting the population into two groups; half of the users are assigned to design A, and the other half to design B. After the experiment ends, we’d pick the better design based on the data we collected.

But this approach can be inefficient because it commits resources to the less-effective option for the whole duration of the test. Either A or B is less effective, and imagine that for the two weeks I’m running an experiment, I’m assigning half of my users to that design. This represents a cost: a user-experience cost, a revenue cost, and a lot of other costs we refer to as the “opportunity cost” of experimenting.

In contrast, a multi-armed bandit approach adjusts the allocation of resources to these two different options, A and B, over time. The idea is that we are going to potentially benefit early in the experiment by expanding the use of the option that seems to be better — without giving up on the other option too early because we don’t want to compromise the experiment’s quality. It’s adaptive allocation rather than an experiment with fixed decisions at the outset. In fact, another term commonly used for multi-armed bandits is “adaptive experimentation.”

When did this experimentation design first emerge?

Bayati:This has a long history. Multi-armed bandits trace back to the 1930s. But the main research started in the 1970s and ’80s. The main motivation then was for clinical trial types of experiments. You can see that there is an opportunity cost for half of the patients in a traditional clinical drug trial, who are potentially getting an inferior version of a drug treatment. So there was this concern: what about these patients? Can we do anything for them?

How did you first start working with bandits?

Bayati: I do a lot of research in healthcare, and my motivation was to reduce the opportunity cost of experiments. I helped build a machine-learning algorithm to address a common issue: a hospital wants to figure out how to best assist patients when they are discharged. Education is an intervention that is needed. That intervention could be in the form of a one-hour patient education session, or it could be giving them a tablet to take home so they can perform video calls with nurses. Another intervention could be a health app that makes a nurse available with the press of a button.

“In my work with multi-armed bandits I’m trying to minimize randomization as much as possible.”

A hospital has these multiple interventions, and they want to figure out which one is best for patients. I was dealing with this question about 10 years ago and said, “Let’s just run an experiment.” Immediately, we ran into challenges getting approval because in the healthcare setting, the opportunity cost needs to be vetted: Is this going to cause any harm? The bureaucratic process was making it nearly impossible.

That was the moment that got me thinking about bandit problems. Bandits could mean I could experiment less. I wondered, “Is there any way that I can still run my experiment and at the same time reduce the opportunity cost to near zero?” The answer is yes — when the patient population is diverse enough.

Before the experiment began, I had already collected a small amount of data about the allocated interventions and the patient population. Then, early in the experiment, when a new patient arrived, I already knew a little bit about which intervention was better for which patient. For example, if it was a less tech-savvy patient, then maybe the in-person education session would be better for them. Because it was such a diverse patient population, when I kept repeating this, I ended up naturally experimenting with different interventions on different types of patients, and that meant that over time, I improved my estimates of the patient-specific benefits of each one. And then, over the course of my experiment, I had not only given everyone what was best for them based on my information up to that point, I was also able to update my information about what was best for everyone.

In other words, in my work with multi-armed bandits I’m trying to minimize randomization as much as possible — and this is all motivated by the healthcare setting.

Does the bandit design come with any drawbacks?

Bayati: Yes. It might seem obvious that this design is better. So why isn’t everybody using it? The number-one challenge when you use these techniques is the absence of a rigorous, statistic-backed framework for decision-making. Because classical randomized experiments have a very clean setup, you can do rigorous mathematical analysis once the experiment ends. You can generate careful statistics like P-values, type-one errors, and type-two errors (these tell you precisely how accurate the conclusions of the experiment are). But these bandit experiments are more complex, and the statistics and mathematics that justify their decision-making are still not fully developed.

For media inquiries, visit the Newsroom.

Explore More