A/B Testing 101: The Power of Experimentation
The $100M Experiment: How A/B Testing Transformed Bing’s Revenue
"One accurate measurement is worth more than a thousand expert opinions."
– Admiral Grace Hopper
In the digital age, where data reigns supreme, the ability to measure the impact of changes accurately is crucial. The famous quote by Admiral Grace Hopper encapsulates the essence of why businesses like Microsoft's Bing have turned to A/B testing to drive innovation and profitability.
The Bing Case Study
In 2012, a small yet transformative idea within Microsoft’s Bing team provided a striking example of the power of data-driven experimentation. A simple suggestion to change ad headlines was overlooked for months, buried under bigger projects. When finally implemented, this idea became Bing’s most significant revenue-generating change, highlighting the profound impact of rigorous online controlled experiments (A/B tests).
The proposed change involved extending ad title lines by combining them with text from the line below, creating a more engaging headline. Initially deemed low priority, it wasn’t until a developer implemented it—due to its coding simplicity—that the idea was tested on real users. A/B testing split users between the existing and modified layouts, tracking interactions, clicks, and revenue metrics.
Within hours of launch, an anomaly emerged: revenue surged unexpectedly. Teams braced for a potential bug, but upon verification, the results proved legitimate. The new ad layout increased Bing’s revenue by 12%, equating to over $100 million annually in the U.S. alone, highlighting several key lessons:
Value Assessment: Even seemingly minor ideas can have significant impacts, yet their value is often underestimated or overlooked.
Impact of Small Changes: A small alteration can lead to massive financial returns if it aligns well with user behavior and expectations.
Rarity of Big Wins: Not every change will yield such dramatic results; this was one of the few blockbuster successes from thousands of experiments conducted annually.
Efficiency in Experimentation: The overhead of running an experiment must be small. Bing’s engineers had access to ExP, Microsoft’s experimentation system, which made it easy to scientifically evaluate the idea.
What is A/B Testing?
A/B testing, also known as split testing, is a method of comparing two versions of a webpage, email, or other marketing asset to determine which one performs better. By splitting your audience into two groups and showing each group a different version (Version A vs. Version B), you can measure which variation drives more conversions, clicks, or other desired outcomes.
Companies like Airbnb, Amazon, Booking.com, eBay, Facebook, Google, LinkedIn, Lyft, Microsoft, and Netflix rely heavily on online controlled experiments (Gupta et al., 2019). These organisations conduct thousands to tens of thousands of experiments annually, often involving millions of users. They test a wide range of elements, such as user interface (UI) changes, relevance algorithms (including search, ads, personalization, and recommendations), latency and performance, content management systems, customer support systems, and more. Experiments span various channels, including websites, desktop applications, mobile apps, and email.
For example, if you’re testing a landing page, you might experiment with two different headlines. Group A sees “Save Big on Your Next Adventure,” while Group B sees “Discover Affordable Travel Deals.” By analysing which headline leads to more sign-ups or sales, you gain valuable insights into what resonates with your audience.
In typical online controlled experiments, users are randomly assigned to different variants in a consistent manner, ensuring they experience the same variant across multiple visits. In the Bing example, the Control group saw the original ad display, while the Treatment group saw ads with longer titles. User interactions with the Bing website were tracked and logged, enabling the calculation of metrics from the logged data. These metrics were then used to evaluate the differences between the variants.
Understanding A/B Testing Terminology
A/B testing, or controlled experiments, involves several terms:
Overall Evaluation Criterion (OEC): A metric that encapsulates the experiment's goal, like increase in revenue balanced with user experience metrics. The OEC should be measurable within the short timeframe of an experiment while being expected to causally influence long-term strategic goals.
Experiments may have multiple objectives, and analysis can employ a balanced scorecard approach. However, it is highly recommended to select a single metric, potentially as a weighted combination of these objectives, to streamline evaluation (Roy 2001, 50, 405−429).
Parameters: A controllable experimental variable, often called a parameter, is a factor that can be adjusted during an experiment and is believed to influence the Overall Evaluation Criterion (OEC) or other key metrics of interest. These parameters are assigned specific values or levels, representing the variations being tested. Understanding how these parameters affect outcomes allows experimenters to optimize for the best results.
Simple A/B Tests:
In an A/B test, there is typically one parameter with two levels. A single parameter like button color might have two levels: blue (A) and red (B), with the goal of determining which generates higher click-through rates.Univariable Tests with Multiple Levels:
A test with a single parameter that has more than two levels.A parameter like button placement could have levels such as top of the page (A), middle (B), bottom (C), or sidebar (D), to identify the most effective location.Multivariable Tests (Multivariate Tests - MVTs):
MVTs test multiple parameters simultaneously, making it possible to analyze how their interactions impact results.Multiple parameters are tested together, such as button shape (round, square) and button text (“Buy Now,” “Add to Cart”), to evaluate combinations and uncover interactions that maximize conversions.
While simple A/B tests or univariable designs are effective for evaluating straightforward changes, MVTs are particularly useful when multiple factors interact in non-obvious ways. For instance, a font size that performs well with one color might perform poorly with another. By testing these combinations, experimenters can discover a global optimum—the best overall combination of changes that maximizes the desired outcome.
Variants: A user experience being tested is defined by assigning specific values to parameters, creating distinct variants for comparison. In a typical A/B test, these variants are labeled as Control and Treatment, with the Control representing the existing version (baseline) and the Treatment reflecting the modified version being tested. While some literature uses "variant" solely for the Treatment, the Control is also considered a critical variant, serving as the benchmark for measuring changes in key metrics.
For example, in an experiment testing button colors, the Control might feature the current blue button, while the Treatment introduces a red button. If a bug or unexpected issue arises during the experiment, it is standard practice to abort the test and ensure all users are reverted to the Control variant, thereby minimizing the potential for adverse impacts. This process safeguards the user experience while maintaining the integrity of the baseline performance data for future analysis.
Randomisation Unit: Randomization is a critical component of controlled experiments. A pseudo-randomization process, such as hashing, is applied to units (e.g., users or pages) to assign them to different variants. Proper randomization ensures that the populations assigned to each variant are statistically similar, enabling causal effects to be determined with high confidence. The mapping of units to variants must be both persistent and independent. For example, if a user is the randomization unit, that user should consistently experience the same variant throughout the experiment, and their assignment should reveal nothing about how other users are assigned.
Using users as the randomization unit is highly recommended for online experiments targeting audiences across websites, apps, or other digital platforms. However, alternative randomization units are sometimes employed based on the experiment's goals. These can include:
Pages: Randomising content displayed on specific pages.
Sessions: Assigning a variant to a single user session but allowing different experiences across sessions.
User-Days: Ensuring a consistent experience for a user within a specific 24-hour period defined by the server.
Proper randomisation is essential to maintain the integrity of the experiment. In cases where each variant is assigned an equal proportion of users, every user must have an equal probability of being assigned to any variant. This deliberate process eliminates potential biases that could distort results.
Why is A/B Testing Important?
The challenge with data interpretation lies in distinguishing correlation from causation. Observational data can mislead; for instance, users experiencing more errors might show lower churn rates, not because errors are beneficial, but because they are heavy users. Controlled experiments help establish causality by systematically varying one thing at a time while keeping everything else constant.
Example: Microsoft Office 365
Consider the case of Microsoft Office 365, a subscription-based software service. Observational data might reveal that users who encounter more error messages and software crashes have lower churn rates compared to others. At a glance, one might conclude that introducing more errors or lowering the software's quality could reduce customer churn, which is clearly not the logical step to take.
Correlation: The data shows a correlation between seeing error messages, experiencing crashes, and lower churn rates.
Misleading Interpretation: One might mistakenly infer that errors or crashes somehow keep users engaged or less likely to unsubscribe.
Actual Causation: However, the real underlying factor here is "usage". Heavy users of the product are more likely to encounter errors due to the frequency and intensity of their use. These heavy users also tend to have lower churn rates because they find more value in the product or are more invested in it, not because the errors are beneficial.
This example with Microsoft Office 365 illustrates why jumping to conclusions from mere correlations can lead to flawed strategies. Controlled experiments provide a structured approach to validate hypotheses, ensuring that decisions are based on causal relationships rather than coincidental correlations. By isolating variables and observing the direct impact of changes, businesses can avoid costly mistakes and invest in strategies that genuinely contribute to better user experiences and business outcomes.
Key Elements of an A/B Test
For A/B testing to be effective:
Experimental Units (or Users): Must be assignable to different variants without interference.
Sufficient Scale: Enough units to detect even small effects statistically.
Metrics: Key metrics like an OEC must be clearly defined and measurable.
Ease of Change: Software changes should be implementable without significant overhead.
Lessons for Organizations
Data-Driven Decision Making: Establish a clear OEC that aligns with strategic goals, ensuring measurable and actionable insights.
Invest in Experimentation Infrastructure: Reliable systems are crucial for managing experiments, logging data, and deriving trustworthy conclusions. Also, implementing software changes with minimal overhead is a critical enabler for effective experimentation. Agile systems and processes should allow teams to make swift updates, test hypotheses, and adapt based on findings. This approach reduces delays, fosters innovation, and ensures experiments can iterate rapidly to achieve meaningful insights without bottlenecks.
Embrace Humility in Idea Assessment: Most ideas fail to deliver the anticipated impact. Controlled experiments help validate assumptions, reducing wasted effort and guiding resources toward high-value initiatives.Common A/B Testing Mistakes to Avoid
Strategic and Tactical Benefits
Strategy Validation: Experiments can affirm or challenge strategic directions through tangible outcomes.
Tactical Optimization: Small, iterative changes can lead to substantial cumulative gains.
Pivoting: When experiments hint at strategic misalignments, they can lead to strategic pivots or reassessments.
Conclusion
The Bing example serves as a powerful illustration of how A/B testing can revolutionize digital products and experiences. It highlights the importance of fostering a culture that prioritizes data-driven decision-making, backed by robust infrastructure to conduct rigorous experiments at scale. Moreover, it emphasizes the value of intellectual humility—recognizing that even the most promising ideas must be validated through empirical evidence, as our initial intuitions are often flawed.
As digital platforms continue to evolve in complexity and scale, the reliance on controlled experiments will become even more indispensable. These experiments not only validate innovations but ensure they deliver measurable value to both businesses and users. By systematically testing and learning, organizations can refine their offerings, reduce risks, and make decisions grounded in objective insights rather than assumptions. In this way, A/B testing acts as a critical enabler of innovation, ensuring that progress is not only creative but also meaningful and impactful.
Reference
Kohavi, Ron, Diane Tang, and Ya Xu. 2020. Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press.
Gupta, Somit, Ronny Kohavi, Diane Tang, Ya Xu, and etal. 2019. “Top Challenges from the first Practical Online Controlled Experiments Summit.” Edited by Xin Luna Dong, Ankur Teredesai and Reza Zafarani. SIGKDD Explorations (ACM) 21 (1). https://bit.ly/OCESummit1.
Roy, Ranjit K. 2001. Design of Experiments using the Taguchi Approach : 16 Steps to Product and Process Improvement. John Wiley & Sons, Inc.