Technology & AI

A/B Testing Gets an Upgrade for the Digital Age

GSB researchers are pioneering experimentation methods to keep up with technology and help solve real-world challenges.

May 01, 2024

| by Katie Gilbert

The increasing complexity of online platforms has revealed split testing’s limitations. | Fien Jorissen and Franz Lang

Which email subject line is more likely to persuade potential customers to open it: the one with the exclamation point or the one with the emoji? Which landing page converts to more click-throughs to a company’s website: the one with more text or less? The answers to these routine quandaries may not be immediately obvious. Yet there’s a simple solution: A/B testing.

The A/B testing model has helped shape the online world as we know it — and the way marketing, website design, and all kinds of user experiences work within it. “These types of experiments are the bread and butter of most tech companies; this is how pretty much every feature is vetted to decide whether to launch it or not platform-wide,” says Gabriel Weintraub, a professor of operations, information, and technology (OIT) at Stanford Graduate School of Business. Whenever you go online, you’re likely becoming an unwitting participant in an A/B test, as designers, engineers, and marketers throw different scenarios at you to see what most effectively persuades you to click, buy, or stream.

The concept behind this experimental design, also known as split testing, is straightforward: If you’re trying to hone your email’s subject line, for example, you randomly split your recipients into two groups. Group A receives the email with the exclamation point in the subject line, and Group B gets the one with the emoji. Compare the two groups’ average open rates, and there you have it: the subject line that gets more clicks.

“There’s been quite a lot of A/B testing going on,” says Guido Imbens, a professor of economics at Stanford GSB. That’s an understatement: Google, Microsoft, and other tech giants reportedly run more than 10,000 A/B tests apiece annually. Mountains of research and anecdotes emphasize the importance and effectiveness of A/B testing for marketing, advertising, and user experience, “conveying the message that really, experimentation is very easy,” Imbens says.

However, the increasing complexity of online platforms has revealed A/B testing’s limitations. A raft of research by GSB faculty members — many in collaboration with each other — is looking beyond traditional split testing and pushing the boundaries of what’s possible in experimental design and analysis both online and off. “There’s a very rich set of problems that call for more complex experiments where we don’t actually know what the optimal thing to do is,” says Imbens, who heads the Stanford Causal Science Center and split the 2021 Nobel Prize for economics for his work on experiment design and causality. “We should all be aware that there’s much more you can do beyond the standard experiments.”

From Plots to Platforms

Although A/B testing has thrived in the internet age as a tool for bringing clarity to decision-making, it predates computers by several decades. An A/B test is another term for a simple randomized controlled trial, or RCT, a concept codified by statistician and geneticist R. A. Fisher in his 1925 book Statistical Methods for Research Workers. Many of Fisher’s experiments focused on agriculture: He randomly allocated fertilizers throughout farm plots to see which one yielded the healthiest crops. At the time, the idea that an experimental treatment should be randomized — rather than managed as transparently and tightly as possible — was revolutionary.

Fien Jorissen and Franz Lang

RCTs quickly took hold in biomedical settings, where they became the go-to design for experiments testing the effectiveness of drugs. In such a trial, a group of subjects is randomly divided into two subgroups; one receives the drug (the treatment group) while the other (the control group) receives a placebo. None of the subjects know which group they’re assigned to. Then, the outcomes within both groups are observed, averaged, and compared.

By the early 2000s, RCTs had proved essential for drawing eyeballs and driving engagement online. Google ran its first A/B test in 2000 to figure out the optimal number of search results to show its users. By the time Susan Athey, PhD ’95, a professor of economics at Stanford GSB, became Microsoft’s chief economist in 2008, the engineers behind the firm’s Bing search engine were running thousands of A/B tests each year to guide decisions about, for example, which results should appear at the top of the page.

Yet, as A/B testing became ubiquitous, it became clear that it had to evolve to keep up with the intricacies of the applications it was evaluating. For example, Bing’s experiments were focused on how short-term changes affected users. Athey noticed that this user-focused experimentation was not well suited to studying advertisers, the main source of revenue for Microsoft’s search business. “Understanding the advertiser side of the market — how to model their behavior, how long it took for them to respond to changes — these challenges were really acute,” she says.

Admittedly, advertisers were much more difficult to experiment on. As a sample, they were diverse to the point of being unwieldy: Some were multibillion-dollar companies with teams working to optimize every pixel, while others were small businesses without the resources to obsess over their ad buys. What’s more, they were competing with each other. Over the rest of her time at Microsoft, Athey put together a set of ideas for addressing this and other obstacles to traditional randomized experiments. After leaving the company in 2013, she began collaborating with a number of coauthors, including GSB colleagues, to formalize the mathematics and theory around some of these new concepts, and to come up with novel ways to run ever more sophisticated experiments.

Not as Easy as A/B

One of the thorniest problems to emerge across all sorts of platforms is interference: When you run an experiment on, or “treat,” one group of users on an online platform, it’s likely to affect the untreated users, too.

Consider the example of a ride-sharing app: If its engineers want to test a policy that would give drivers higher tips, the A/B testing model would dictate that the change is applied to some drivers and not others. Yet if it turns out that, during the experiment, the new policy makes driving more lucrative and encourages the treated drivers to spend more time on the road, that will affect the untreated drivers, who suddenly face more competition in finding passengers. At this point, the experiment cannot accurately discern what would happen if the new tipping policy were applied to all drivers.

There’s a very rich set of problems that call for more complex experiments where we don’t actually know what the optimal thing to do is.

Guido Imbens

“It’s so important to understand how biases like interference are affecting experimental results and decision-making,” says Weintraub, who encountered this problem while advising Airbnb on solving market design puzzles, one of his areas of expertise.

Typically, he explains, market designers are seeking a specific objective — say, maximizing bookings. In experimenting with the best ways to do so, a platform has a dizzying array of levers it can control, such as tweaking fees or sharing more or less information about properties. What’s more, companies like Airbnb are two-sided platforms, enabling sellers and customers to interact directly to make deals. And that means there are two groups of users that can be observed making decisions.

Fien Jorissen and Franz Lang

Weintraub explains that the canonical online A/B tests run by two-sided marketplaces must choose between either randomizing listings or randomizing customers. However, when platforms ran experiments by, say, randomly including better photos for some listings, the treated pages “cannibalized” demand from the control group. This type of interference effect muddied the experimental results, Weintraub says. “That violates a key assumption in an A/B test: It’s assumed that the assignment of one unit to treatment or control doesn’t affect the outcome of any other unit.”

Experimenters also noticed interference on the customer side: For example, when they randomly assigned some customers to a group that saw cheaper prices, that exposed the control group to more competition for listings — because subjects in that group couldn’t select properties that the treated group had snatched up.

I remember a hallway conversation with Guido where we realized we totally independently and simultaneously came out with this multiple-side randomization idea.

Gabriel Weintraub

In the paper that emerged from this puzzle, Weintraub and his colleagues present a model to help experimenters determine which side of the marketplace to randomize to minimize interference and bias. And — crucially — they add that if supply and demand are mostly balanced, the answer is to randomize both listings and customers at the same time, using a novel experimental design they call “two-sided randomization.” The approach doesn’t eliminate competition between the treatment and control groups, but it allows its effects to be approximately observed and factored into the results.

What Weintraub didn’t know at the time was that, just down the hall, Imbens — who had been working as a consultant for Amazon — was independently percolating a similar idea for the same types of online platforms. In their paper, Imbens and his coauthors refer to these experimental structures as “multiple randomization designs.” But the idea is the same. “I remember a hallway conversation with Guido,” Weintraub says, “where we realized we totally independently and simultaneously came out with this multiple-side randomization idea.”

Imbens emphasizes that these new randomization designs could prove useful beyond the digital marketplace. He points to experiments in development economics that seek to track the spread of health education, for example. In these contexts, interference can confound results because of the difficulty of maintaining a control group. Imbens hopes this new type of experiment can be part of the solution.

The Other Side of the Equation

Much of the collaboration at Stanford GSB around new types of experimentation has been far from accidental — in fact, OIT professors Kuang Xu and Stefan Wager say that working together has been essential. Wager is a statistician who focuses on the intersection of causal inference, optimization, and statistical learning. Kuang is an operations researcher and a probabilist who uses stochastic modeling to capture the dynamics of real-world applications where information is scarce. Both say that building bridges between their disciplines has been imperative for tackling the types of problems they’re trying to solve.

Wager says this became obvious during the height of the pandemic. “During the lockdown, I felt like I was able to work on existing projects — but I didn’t have any new ideas,” he recalls. “So Kuang and I started going on semiregular ‘research hikes.’ And actually one of the recent papers with Kuang that’s bridging ideas between engineering and statistics started this way.”

A company or a scientist doesn’t just want to know that, on average, a given treatment worked. It’s really important to know if it helped some people and hurt other people.

Susan Athey

When Kuang and Wager consider how to improve experimental methods, they’re looking at a different side of experimentation design. They’re focused on ways to crunch the data gathered in experiments to yield clearer insights. “Drawing insight from data obviously has two elements: How do you collect data, and how do you analyze the data that you collect?” Kuang says. “When you innovate and try to crack new problems, you can attack both — or either, potentially.” After all, he says, “changing the way you run experiments can be difficult — so, maybe you collect data the way you used to, but you analyze them very differently.” Kuang and Wager have collaborated on topics including experimental interference caused by congestion in online marketplaces.

Another fertile area of collaboration on this side of the equation is the difficulty of identifying the types of individuals who benefit (or not) from an experimental treatment. “A company or a scientist doesn’t just want to know that, on average, a given treatment worked,” says Athey, the director of the Golub Capital Social Impact Lab. “It’s really important to know if it helped some people and hurt other people. And if you can analyze that, then you can give the treatment to the people it helped and not give it to the people it hurt.” In 2016, she and Imbens introduced a data-driven method for grouping individuals who experience different “treatment effects.” In the process, they laid some of the foundations for connecting traditional machine learning, which focused on prediction, with the challenge of estimating the outcomes of randomized experiments.

Around that time, Wager — then “a star PhD student in the Stanford statistics department,” in Athey’s words — was getting interested in this area. Working with Athey, he developed a more flexible approach to understanding variations in treatment effects, proving theoretical results about algorithms known as random forests that had proved elusive for several decades. Their papers on “causal forests,” published in 2018 and 2019, are some of the most cited statistics papers of the past few years. Their methods have been widely adopted in academia and industry, including by technology companies such as Airbnb and Uber.

Wager presented new applications for this research in a recent paper that examines the impact of psychiatric hospitalization. With his coauthors, Wager examined five years of data from the U.S. Department of Veterans Affairs about more than 100,000 vets who had arrived at emergency departments because of suicidal ideation or suicide attempts. Focusing on those patients who were subsequently hospitalized for psychiatric treatment, the researchers sought to determine how effective a hospital stay was at preventing suicide attempts over the following year.

However, it was essential that their findings were not averaged across the group so as not to overlook those vets who experienced increased suicidality after hospitalization. Instead, the results were broken down into granular subgroups based on factors like psychiatric diagnosis, past medical history, and family situations.

“We showed that you can reliably find groups of patients who benefit from hospitalization and others who seem to be hurt by hospitalization,” Wager says. Using machine learning tools to help synthesize the results, his team found that an individualized approach to treatment could reduce future suicide attempts in the 12 months following a hospital visit by 16% and hospitalization by 13%. “In order to do this, we really had to go beyond the kind of classical causal inference methods where you just look at whether a treatment works on average for everyone and identify a few subgroups,” Wager says.

He’s hopeful about the potential of moving beyond a one-size-fits-all approach toward more personalized outcomes. “We see this paper as an early proof-of-concept, showing that something could be done, and we’re hoping we’ll be able to work with the VA to actually build a tool that they could use. That’s the end goal of this.”

A Community of Collaborators

Fien Jorissen and Franz Lang

All of these researchers agree that Stanford is flourishing as a hub of research on experimental design and analysis methods. The campus’s proximity to Silicon Valley is part of the reason, Imbens points out. “We get a lot of exposure to the kinds of questions the tech companies have and the kinds of problems they’re wrestling with,” he notes.

Yet, as his colleagues’ research on psychiatric patients and college students has demonstrated, this research agenda has applications far beyond streamlining apps and platforms. “The sweet spot is to find research that’s relevant to the tech companies — while realizing that, actually, these problems are much more general, and what we’re doing has relevance for other contexts as well,” Imbens says.

Another result of the increased collaboration in these areas is the dissolution of walls between separate disciplines, which Athey emphasizes is a win. “These three distinct fields — statistics and econometrics and machine learning — weren’t really talking to each other that much,” she says. Within this prolific group of researchers, however, those fields are now in close conversation.

“You might think, ‘How can each of these people be the pioneer in the same thing, experimentation design and analysis?” Athey says. “But Stanford is the pioneer. We built a group of people excited about these problems — and so it’s not an accident that we’re all here.”