Worked Example: A/B Testing an AI Feature
Walk through a complete answer to 'How would you set up an A/B test for an AI feature?' covering the unique challenges of testing non-deterministic systems.
The Question
Here is the question: 'You have shipped an AI-powered feature that suggests meeting times based on participants' calendars and preferences. How would you set up an A/B test to measure its impact?' This question tests a specific and important technical skill: A/B testing non-deterministic systems. AI features introduce challenges that traditional A/B tests do not account for, and interviewers want to see that you understand these challenges.
The key challenge with A/B testing AI features is that the output is not deterministic. The same input (meeting request) might produce different suggestions depending on model state, randomness in the model, or subtle differences in input processing. This makes measuring the causal impact of the feature harder than testing a deterministic UI change.
Worked Answer: Experimental Design
"Let me start with the experimental design. The randomization unit should be the user, not the meeting or the session. If we randomize at the meeting level, the same user might get the AI suggestion for some meetings but not others, which creates a confusing experience and contaminates the treatment effect. User-level randomization means a user is either in the treatment group (sees AI suggestions) or control group (sees the existing scheduling experience) for the entire test duration."
"The treatment group sees the AI meeting time suggestions as the primary scheduling interface. The control group sees the existing manual scheduling flow with no AI suggestions. I would not use a 'degraded AI' control (e.g., random suggestions) because it would annoy users and not represent the true counterfactual."
"Sample size and duration: I would run a power analysis targeting a minimum detectable effect of 5% improvement in scheduling completion rate, which is our primary metric. Assuming a baseline completion rate of 60% and a significance level of 0.05 with 80% power, we need approximately 3,200 users per group. I would run the test for 3 weeks to capture variation across weekdays and meeting patterns, and to account for the novelty effect (users might engage more with AI suggestions initially, then revert)."
[Interviewer note: Correct randomization unit choice with clear justification. The decision against a degraded-AI control group shows practical judgment. The power analysis is appropriate, and the 3-week duration to capture weekly patterns and novelty effects is thoughtful. This is a strong start.]
Worked Answer: Metrics and AI-Specific Challenges
"Primary metric: scheduling completion rate (percentage of meeting scheduling flows where a time is successfully booked). Secondary metrics: time-to-schedule (how long it takes from initiating a meeting request to confirming a time), suggestion acceptance rate (what percentage of AI suggestions are accepted without modification), participant response rate (do participants accept the meeting faster?), and meeting no-show rate (a downstream quality metric)."
"Now, the AI-specific challenges. First, non-determinism. The model might suggest different times for similar inputs depending on minor variations. To manage this, I would log the model's confidence score for every suggestion and analyze results stratified by confidence level. This tells us not just whether the feature works overall, but in which scenarios it works well and where it struggles."
"Second, the cold-start problem. Users who are new to the platform have less calendar data, so the AI suggestions will be worse for them. I would segment the A/B test results by user tenure: new users (less than 30 days), established users (30-180 days), and power users (180+ days). If the feature helps established users but hurts new users, we know we need a minimum data threshold before showing AI suggestions."
"Third, network effects. Meeting scheduling involves multiple participants. If the organizer is in the treatment group but participants are in the control group, the treatment effect is diluted. I would account for this by also measuring at the meeting level: for meetings where all participants are in the treatment group, is the effect stronger? This requires careful analysis but gives us a more accurate treatment effect estimate."
[Interviewer note: The metrics are well-chosen with a clear primary metric and meaningful secondary metrics. The three AI-specific challenges (non-determinism, cold-start, network effects) are exactly the right challenges to raise for this use case. The stratified analysis by confidence level and user tenure shows analytical sophistication. The network effects consideration is a subtle point that most candidates miss. Score: 4.5/5.]
Worked Answer: Analysis and Guardrails
"For the analysis plan, I would look at: the overall treatment effect on the primary metric with a confidence interval. Heterogeneous treatment effects by user segment, device type, and meeting size. Time-series analysis of the treatment effect over the 3-week period to check for novelty effects. If the effect is strong in week 1 but declines in weeks 2 and 3, the feature is novel but not sustainably valuable."
"Guardrail metrics that would trigger stopping the test early: if the treatment group's scheduling abandonment rate increases by more than 10% relative, if user complaints about incorrect suggestions exceed 1% of treatment users, or if system latency for the scheduling flow increases by more than 500ms P95 (indicating the AI model is too slow)."
"One last consideration: I would include a holdout group. Even after the test concludes and we ship the feature, I would keep 5% of users in a permanent holdout (no AI suggestions) for 3 months. This lets us measure the long-term impact of the feature, including effects on user retention and meeting quality that take time to manifest. It also protects against slowly degrading model performance that a short A/B test would not catch."
[Interviewer note: The analysis plan is thorough. Checking for novelty effects via time-series analysis is important for AI features. The guardrail metrics are specific and actionable. The permanent holdout group is a best practice that shows the candidate thinks about long-term measurement, not just launch decisions. Overall: 4.5/5, Strong Hire.]
Key Takeaways
- Randomize at the user level for AI features, not at the event level. This avoids contaminating the user experience
- Run AI feature A/B tests for at least 3 weeks to capture weekly patterns and account for novelty effects
- Stratify results by user tenure and model confidence level to understand where the feature works and where it does not
- Account for network effects when the AI feature involves multiple users (like meeting scheduling)
- Use a permanent holdout group (5%) post-launch to measure long-term impact and detect model degradation