Inferentia: Experimentation

Running and Analysing Experiments: An End-to-End Example

Piyush Ranjan — Sun, 26 Jan 2025 14:22:36 GMT

Introduction

In today's data-driven world, making informed decisions about product changes is crucial. A/B testing offers a powerful framework for gathering actionable insights and making evidence-based decisions, particularly in web and app development. In this blog post, we delve into the essential principles of designing, executing, and analyzing an A/B test using a practical, real-world example.

The Experimental Context: A Widget Store's Coupon Code Dilemma

Our example centers on a fictional online commerce platform specializing in selling widgets. In an effort to boost sales, the marketing team proposes a strategy involving promotional emails that include a coupon code for discounts on widgets. This proposal marks a potential shift in the company's business model, as they have never offered coupon codes before. The team is cautious, however, due to findings from prior studies—such as Dr. Footcare's revenue loss following the introduction of coupon codes (Kohavi, Longbottom et al., 2009) and evidence from GoodUI.org suggesting that removing coupon codes can be beneficial (Linowski, 2018). These insights raise concerns that simply adding a coupon code field to the checkout page might negatively impact revenue, even if no valid codes are available. Users might become distracted searching for codes or even abandon their purchases altogether.

Control and Treatment 1

Treatment 2

To assess these potential effects, we adopt a “painted door” approach. This involves implementing a superficial change—adding a non-functional coupon code field to the checkout page. When users input any code, the system responds with “Invalid Coupon Code.” By focusing on this minimal implementation, we aim to evaluate the psychological and behavioral impact of merely displaying a coupon code field.

Given the simplicity of this change, we’ll test two distinct UI designs to compare their effectiveness. Testing multiple treatments alongside a control allows us to discern not only whether the idea of adding a coupon code is viable, but also which specific implementation is most effective. This targeted A/B test is a crucial step toward determining the feasibility of adopting coupon codes as part of the company’s broader business strategy.

Subscribe now

Hypothesis

Adding a coupon code field to the checkout page will degrade revenue-per-user for users who start the purchase process.

To test this hypothesis, we consider two UI implementations. This A/B test is a critical step in assessing the feasibility of this business model.

Goal Metrics

The primary metric, or Overall Evaluation Criterion (OEC), is revenue-per-user, normalized to account for sample size variability. To measure the impact of the change, we define success metrics carefully. While revenue is an obvious choice, using the total revenue sum is not recommended due to sample size variations across variants. Instead, revenue-per-user provides a more accurate and normalized metric. For this metric, it is critical to determine the denominator:

Including all site visitors introduces significant noise, as a large portion of users never initiate checkout. These users do not interact with the modified checkout process and, therefore, cannot provide meaningful data on the impact of the change. Including them dilutes the results, making it harder to detect any real differences caused by the experiment.
Restricting to only users who complete the purchase process presents a skewed perspective, as it inherently assumes that the modification influences purchase amounts without considering its effect on conversion rates. This approach excludes users who might abandon the process due to the change, potentially missing a critical aspect of the experiment's impact.
The best choice is users who start the purchase process, as they are directly exposed to the change at the checkout stage. This refined approach improves test sensitivity by ensuring that only the users who interact with the modified UI are analyzed. By excluding unaffected users, such as those who browse without adding items to the cart or initiating a purchase, the metric becomes more precise. This specificity helps isolate the true impact of the coupon code field, avoiding noise introduced by broader user behaviors that are irrelevant to the experiment.

With this setup, our refined hypothesis becomes: “Adding a coupon code field to the checkout page will degrade revenue-per-user for users who start the purchase process.”

Hypothesis Testing

Before we can design, run, or analyze our experiment, let us go over a few foundational concepts relating to statistical hypothesis testing. First, we characterize the metric by understanding the baseline mean value—the average of the metric under normal, unaltered conditions—and the standard error of the mean. The standard error provides insight into the variability of our metric estimates and is crucial for determining the required sample size to detect meaningful differences. By accurately estimating this variability, we can size our experiment properly and assess statistical significance during analysis.

For most metrics, we measure the mean, but alternative summary statistics, such as medians or percentiles, may be more appropriate in specific contexts, such as highly skewed data distributions. Sensitivity, or the ability to detect statistically significant differences, improves when the standard error of the mean is reduced. This can be achieved by either increasing the traffic allocated to the experimental variants or running the experiment for an extended period. However, running longer experiments may yield diminishing returns after a few weeks due to sub-linear growth in unique users (caused by repeat visitors) and potential increases in variance for certain metrics over time.

To evaluate the impact of the experiment, we analyze revenue-per-user estimates from the Control and Treatment samples by computing the p-value for their difference. The p-value represents the probability of observing the measured difference, or a more extreme one, under the assumption that the Null hypothesis—that there is no true difference—is correct. A sufficiently small p-value allows us to reject the Null hypothesis and infer that the observed effect is statistically significant. But what constitutes a small enough p-value?

Typically, the scientific benchmark is a p-value less than 0.05. This threshold means there is less than a 5% probability of incorrectly concluding there is an effect when none actually exists, providing confidence in 95 out of 100 cases. Furthermore, another approach to determine significance is through confidence intervals. A 95% confidence interval defines a range where the true difference between Treatment and Control lies 95% of the time. If this interval does not include zero, it reinforces the conclusion that the effect is statistically significant. These tools collectively help establish the robustness of experimental findings, ensuring decisions are data-driven and reliable.

Statistical Power

Statistical power measures the ability of an experiment to detect a meaningful difference between variants when such a difference truly exists. In simple terms, it is the probability of correctly rejecting the null hypothesis when there is an actual effect. For instance, if a retailer is testing a new homepage layout, statistical power ensures that subtle but real increases in sales do not go unnoticed.

To achieve reliable results, experiments are often designed with 80-90% power, meaning there is a high likelihood of detecting true changes. Power is influenced by factors such as sample size and effect size; larger sample sizes tend to improve power, but overly small differences might still evade detection. For example, while a large e-commerce platform like Amazon might be interested in detecting a 0.2% increase in revenue-per-user due to its massive scale, a smaller startup might focus only on changes exceeding 5-10% because such increases are critical for their growth.

While statistical significance helps us understand whether an observed difference is likely due to chance, it does not always translate into practical significance. Practical significance asks a more business-oriented question: Is the observed change large enough to matter? For example, a 0.2% increase in revenue-per-user might be meaningful for billion-dollar platforms like Google or Bing, but for a small startup seeking rapid growth, a 2% increase might still fall short of expectations. Setting clear business thresholds for what constitutes a meaningful change is essential. For our hypothetical widget store, we define practical significance as a 1% or larger increase in revenue-per-user, recognising this as the minimum impact needed to justify potential costs or risks of implementation.

Designing the Experiment

We are now ready to design our experiment. We have a hypothesis, a practical significance boundary, and we have characterised our metric. We will use this set of decisions to finalize the design:

What is the randomisation unit? The randomisation unit for this experiment is the user.
What population of randomisation units do we want to target? We will target all users and analyse those who visit the checkout page. Targeting a specific population allows for more focused results. For instance, if the new text in a feature is only available in certain languages, you would target users with those specific interface locales. Similarly, attributes such as geographic region, platform, and device type can guide targeting.
How large does our experiment need to be? The experiment size directly impacts the precision of results. To detect a 1% change in revenue-per-user with 80% power, we will conduct a power analysis to determine the sample size. The following considerations influence size:
- Using a binary metric like purchase indicator (yes/no) instead of revenue-per-user can reduce variability, allowing for a smaller sample size.
- Increasing the practical significance threshold—for example, detecting only changes larger than 1%—can also reduce sample size requirements.
- Lowering the p-value threshold, such as from 0.05 to 0.01, increases sample size needs.
How long should the experiment run? To ensure robust results, we will run the experiment for at least one week to capture weekly cycles and account for day-of-week effects. External factors like seasonality and primacy or novelty effects are also important:
- User behavior can vary during holidays or promotional periods, affecting external validity.
- Novelty effects (e.g., initial enthusiasm for a new feature) and adoption effects (e.g., gradual user adoption) may impact results over time.

Final Experiment Design

Randomization Unit: User
Target Population: All users visiting the checkout page
Experiment Size: Determined via power analysis to achieve 80% power for detecting a 1% change
Experiment Duration: Minimum of one week to capture weekly cycles and extended if novelty or primacy effects are detected
Traffic Split: 34/33/33% for Control, Treatment 1, and Treatment 2

By carefully designing the experiment with these considerations, we can ensure that the results are both statistically and practically significant. Overpowering an experiment is often beneficial, as it allows for detailed segment analysis (e.g., by geographic region or platform). This approach not only improves the robustness of results but also helps businesses uncover nuanced insights. For example, identifying specific trends in user behavior across different demographics can inform future product iterations and marketing strategies.

Furthermore, running a well-structured A/B test fosters a culture of data-driven decision-making within the organization. By investing in comprehensive experiment design and analysis, businesses can mitigate risks, allocate resources effectively, and achieve sustainable growth. Ultimately, the insights derived from this experiment will not only validate the feasibility of introducing coupon codes but also set a benchmark for future experimentation, reinforcing the importance of innovation and customer-centric strategies in a competitive market.

Running the Experiment and Getting Data

Now let us run the experiment and gather the necessary data. To run an experiment, we need both:

Instrumentation to get logs data on how users are interacting with your site and which experiments those interactions belong to.
Infrastructure to be able to run an experiment, ranging from experiment configuration to variant assignment.

Interpreting the Results

The data collected from the experiment is the foundation for actionable insights, but ensuring its reliability is critical. Before diving into the revenue-per-user analysis, it is essential to validate the experiment's execution by examining invariant metrics, also known as guardrail metrics. These metrics serve two primary purposes:

Trust-related Guardrails: Metrics such as sample size consistency and cache-hit rates ensure that the Control and Treatment groups align with the experiment configuration. Any deviation here might indicate issues in randomization or assignment.
Organizational Guardrails: Metrics like latency or system performance, which are crucial to business operations, should remain stable across variants. For example, significant changes in checkout latency would signal underlying problems unrelated to the coupon code test.

If these metrics show unexpected changes, it suggests flaws in the experiment design, infrastructure, or data processing pipeline. Addressing these issues before analyzing the core results is vital to maintaining trust in the findings.

Once the guardrails are validated, the next step is to analyze and interpret the results with precision. For example, if the p-value for revenue-per-user in both Treatment groups is below 0.05, we reject the null hypothesis and conclude that the observed differences are statistically significant. However, statistical significance alone is insufficient. Practical significance—the magnitude of the observed effect—determines whether the change is worth implementing.

Results Table

From the table, we observe that both treatments significantly reduce revenue-per-user compared to the control group. While the p-values confirm statistical significance, the negative impact on revenue highlights the need to reassess introducing coupon codes.

Decision-Making Framework

Consider the context of your experiment:

Short-Term vs. Long-Term Impact: Changes with minimal downside risks, such as testing promotional headlines, may allow for lower thresholds of significance. Conversely, introducing high-cost features like a coupon code system requires higher thresholds due to long-term resource commitments.
Balancing Metrics: A decrease in revenue may be acceptable if offset by an increase in user engagement, but only if the net impact aligns with organizational goals.

Ultimately, the results must translate into a clear decision framework. For example:

If the results are both statistically and practically significant, the decision to launch is straightforward.
If statistically significant but not practically meaningful, the change may not justify further investment.
If results are inconclusive, consider increasing sample size or re-evaluating the design.
The result is statistically significant, and likely practically significant. Like prior examples, it is possible that the change is not practically significant. In this situation, repeating the test with greater power is advisable. However, from a launch/no-launch perspective, choosing to launch is a reasonable decision. It is crucial to explicitly document the factors influencing this decision, particularly how they align with the practical and statistical significance boundaries. This clarity not only supports current decision-making but also establishes a solid foundation for future analyses.

By grounding decision-making in robust analysis and broader business considerations, organizations can confidently use A/B testing as a tool for sustainable growth and innovation.

A/B Testing 101: The Power of Experimentation

Piyush Ranjan — Wed, 15 Jan 2025 03:30:33 GMT

"One accurate measurement is worth more than a thousand expert opinions."

– Admiral Grace Hopper

In the digital age, where data reigns supreme, the ability to measure the impact of changes accurately is crucial. The famous quote by Admiral Grace Hopper encapsulates the essence of why businesses like Microsoft's Bing have turned to A/B testing to drive innovation and profitability.

The Bing Case Study

In 2012, a small yet transformative idea within Microsoft’s Bing team provided a striking example of the power of data-driven experimentation. A simple suggestion to change ad headlines was overlooked for months, buried under bigger projects. When finally implemented, this idea became Bing’s most significant revenue-generating change, highlighting the profound impact of rigorous online controlled experiments (A/B tests).

The proposed change involved extending ad title lines by combining them with text from the line below, creating a more engaging headline. Initially deemed low priority, it wasn’t until a developer implemented it—due to its coding simplicity—that the idea was tested on real users. A/B testing split users between the existing and modified layouts, tracking interactions, clicks, and revenue metrics.

Within hours of launch, an anomaly emerged: revenue surged unexpectedly. Teams braced for a potential bug, but upon verification, the results proved legitimate. The new ad layout increased Bing’s revenue by 12%, equating to over $100 million annually in the U.S. alone, highlighting several key lessons:

Value Assessment: Even seemingly minor ideas can have significant impacts, yet their value is often underestimated or overlooked.
Impact of Small Changes: A small alteration can lead to massive financial returns if it aligns well with user behavior and expectations.
Rarity of Big Wins: Not every change will yield such dramatic results; this was one of the few blockbuster successes from thousands of experiments conducted annually.
Efficiency in Experimentation: The overhead of running an experiment must be small. Bing’s engineers had access to ExP, Microsoft’s experimentation system, which made it easy to scientifically evaluate the idea.

What is A/B Testing?

A/B testing, also known as split testing, is a method of comparing two versions of a webpage, email, or other marketing asset to determine which one performs better. By splitting your audience into two groups and showing each group a different version (Version A vs. Version B), you can measure which variation drives more conversions, clicks, or other desired outcomes.

Companies like Airbnb, Amazon, Booking.com, eBay, Facebook, Google, LinkedIn, Lyft, Microsoft, and Netflix rely heavily on online controlled experiments (Gupta et al., 2019). These organisations conduct thousands to tens of thousands of experiments annually, often involving millions of users. They test a wide range of elements, such as user interface (UI) changes, relevance algorithms (including search, ads, personalization, and recommendations), latency and performance, content management systems, customer support systems, and more. Experiments span various channels, including websites, desktop applications, mobile apps, and email.

For example, if you’re testing a landing page, you might experiment with two different headlines. Group A sees “Save Big on Your Next Adventure,” while Group B sees “Discover Affordable Travel Deals.” By analysing which headline leads to more sign-ups or sales, you gain valuable insights into what resonates with your audience.

In typical online controlled experiments, users are randomly assigned to different variants in a consistent manner, ensuring they experience the same variant across multiple visits. In the Bing example, the Control group saw the original ad display, while the Treatment group saw ads with longer titles. User interactions with the Bing website were tracked and logged, enabling the calculation of metrics from the logged data. These metrics were then used to evaluate the differences between the variants.

Understanding A/B Testing Terminology

A/B testing, or controlled experiments, involves several terms:

Overall Evaluation Criterion (OEC): A metric that encapsulates the experiment's goal, like increase in revenue balanced with user experience metrics. The OEC should be measurable within the short timeframe of an experiment while being expected to causally influence long-term strategic goals.
Experiments may have multiple objectives, and analysis can employ a balanced scorecard approach. However, it is highly recommended to select a single metric, potentially as a weighted combination of these objectives, to streamline evaluation (Roy 2001, 50, 405−429).
Parameters: A controllable experimental variable, often called a parameter, is a factor that can be adjusted during an experiment and is believed to influence the Overall Evaluation Criterion (OEC) or other key metrics of interest. These parameters are assigned specific values or levels, representing the variations being tested. Understanding how these parameters affect outcomes allows experimenters to optimize for the best results.
- Simple A/B Tests:
  In an A/B test, there is typically one parameter with two levels. A single parameter like button color might have two levels: blue (A) and red (B), with the goal of determining which generates higher click-through rates.
- Univariable Tests with Multiple Levels:
  A test with a single parameter that has more than two levels.A parameter like button placement could have levels such as top of the page (A), middle (B), bottom (C), or sidebar (D), to identify the most effective location.
- Multivariable Tests (Multivariate Tests - MVTs):
  MVTs test multiple parameters simultaneously, making it possible to analyze how their interactions impact results.Multiple parameters are tested together, such as button shape (round, square) and button text (“Buy Now,” “Add to Cart”), to evaluate combinations and uncover interactions that maximize conversions.
While simple A/B tests or univariable designs are effective for evaluating straightforward changes, MVTs are particularly useful when multiple factors interact in non-obvious ways. For instance, a font size that performs well with one color might perform poorly with another. By testing these combinations, experimenters can discover a global optimum—the best overall combination of changes that maximizes the desired outcome.
Variants: A user experience being tested is defined by assigning specific values to parameters, creating distinct variants for comparison. In a typical A/B test, these variants are labeled as Control and Treatment, with the Control representing the existing version (baseline) and the Treatment reflecting the modified version being tested. While some literature uses "variant" solely for the Treatment, the Control is also considered a critical variant, serving as the benchmark for measuring changes in key metrics.
For example, in an experiment testing button colors, the Control might feature the current blue button, while the Treatment introduces a red button. If a bug or unexpected issue arises during the experiment, it is standard practice to abort the test and ensure all users are reverted to the Control variant, thereby minimizing the potential for adverse impacts. This process safeguards the user experience while maintaining the integrity of the baseline performance data for future analysis.
Randomisation Unit: Randomization is a critical component of controlled experiments. A pseudo-randomization process, such as hashing, is applied to units (e.g., users or pages) to assign them to different variants. Proper randomization ensures that the populations assigned to each variant are statistically similar, enabling causal effects to be determined with high confidence. The mapping of units to variants must be both persistent and independent. For example, if a user is the randomization unit, that user should consistently experience the same variant throughout the experiment, and their assignment should reveal nothing about how other users are assigned.
Using users as the randomization unit is highly recommended for online experiments targeting audiences across websites, apps, or other digital platforms. However, alternative randomization units are sometimes employed based on the experiment's goals. These can include:
- Pages: Randomising content displayed on specific pages.
- Sessions: Assigning a variant to a single user session but allowing different experiences across sessions.
- User-Days: Ensuring a consistent experience for a user within a specific 24-hour period defined by the server.
Proper randomisation is essential to maintain the integrity of the experiment. In cases where each variant is assigned an equal proportion of users, every user must have an equal probability of being assigned to any variant. This deliberate process eliminates potential biases that could distort results.

Why is A/B Testing Important?

The challenge with data interpretation lies in distinguishing correlation from causation. Observational data can mislead; for instance, users experiencing more errors might show lower churn rates, not because errors are beneficial, but because they are heavy users. Controlled experiments help establish causality by systematically varying one thing at a time while keeping everything else constant.

Example: Microsoft Office 365

Consider the case of Microsoft Office 365, a subscription-based software service. Observational data might reveal that users who encounter more error messages and software crashes have lower churn rates compared to others. At a glance, one might conclude that introducing more errors or lowering the software's quality could reduce customer churn, which is clearly not the logical step to take.

Correlation: The data shows a correlation between seeing error messages, experiencing crashes, and lower churn rates.
Misleading Interpretation: One might mistakenly infer that errors or crashes somehow keep users engaged or less likely to unsubscribe.
Actual Causation: However, the real underlying factor here is "usage". Heavy users of the product are more likely to encounter errors due to the frequency and intensity of their use. These heavy users also tend to have lower churn rates because they find more value in the product or are more invested in it, not because the errors are beneficial.

This example with Microsoft Office 365 illustrates why jumping to conclusions from mere correlations can lead to flawed strategies. Controlled experiments provide a structured approach to validate hypotheses, ensuring that decisions are based on causal relationships rather than coincidental correlations. By isolating variables and observing the direct impact of changes, businesses can avoid costly mistakes and invest in strategies that genuinely contribute to better user experiences and business outcomes.

Key Elements of an A/B Test

For A/B testing to be effective:

Experimental Units (or Users): Must be assignable to different variants without interference.
Sufficient Scale: Enough units to detect even small effects statistically.
Metrics: Key metrics like an OEC must be clearly defined and measurable.
Ease of Change: Software changes should be implementable without significant overhead.

Subscribe now

Lessons for Organizations

Data-Driven Decision Making: Establish a clear OEC that aligns with strategic goals, ensuring measurable and actionable insights.
Invest in Experimentation Infrastructure: Reliable systems are crucial for managing experiments, logging data, and deriving trustworthy conclusions. Also, implementing software changes with minimal overhead is a critical enabler for effective experimentation. Agile systems and processes should allow teams to make swift updates, test hypotheses, and adapt based on findings. This approach reduces delays, fosters innovation, and ensures experiments can iterate rapidly to achieve meaningful insights without bottlenecks.
Embrace Humility in Idea Assessment: Most ideas fail to deliver the anticipated impact. Controlled experiments help validate assumptions, reducing wasted effort and guiding resources toward high-value initiatives.Common A/B Testing Mistakes to Avoid

Strategic and Tactical Benefits

Strategy Validation: Experiments can affirm or challenge strategic directions through tangible outcomes.
Tactical Optimization: Small, iterative changes can lead to substantial cumulative gains.
Pivoting: When experiments hint at strategic misalignments, they can lead to strategic pivots or reassessments.

Conclusion

The Bing example serves as a powerful illustration of how A/B testing can revolutionize digital products and experiences. It highlights the importance of fostering a culture that prioritizes data-driven decision-making, backed by robust infrastructure to conduct rigorous experiments at scale. Moreover, it emphasizes the value of intellectual humility—recognizing that even the most promising ideas must be validated through empirical evidence, as our initial intuitions are often flawed.

As digital platforms continue to evolve in complexity and scale, the reliance on controlled experiments will become even more indispensable. These experiments not only validate innovations but ensure they deliver measurable value to both businesses and users. By systematically testing and learning, organizations can refine their offerings, reduce risks, and make decisions grounded in objective insights rather than assumptions. In this way, A/B testing acts as a critical enabler of innovation, ensuring that progress is not only creative but also meaningful and impactful.

Reference

Kohavi, Ron, Diane Tang, and Ya Xu. 2020. Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press.
Gupta, Somit, Ronny Kohavi, Diane Tang, Ya Xu, and etal. 2019. “Top Challenges from the first Practical Online Controlled Experiments Summit.” Edited by Xin Luna Dong, Ankur Teredesai and Reza Zafarani. SIGKDD Explorations (ACM) 21 (1). https://bit.ly/OCESummit1.
Roy, Ranjit K. 2001. Design of Experiments using the Taguchi Approach : 16 Steps to Product and Process Improvement. John Wiley & Sons, Inc.