Worked Example: Improving Smart Reply
Walk through a complete answer to 'How would you improve Gmail Smart Reply?' showing how to analyze an existing AI feature and propose measured improvements.
The Question
Here is the question: 'You are a PM on the Gmail team at Google. How would you improve Smart Reply?' This is a different flavor of product sense question. Instead of designing a new AI product, you are improving an existing one. The bar is higher here because the interviewer expects you to understand how the current product works, identify specific weaknesses, and propose improvements that are technically feasible within the existing architecture.
Smart Reply is Google's feature that suggests short responses to emails. It uses a sequence-to-sequence model trained on email data to generate three reply suggestions. The feature launched in 2017 and has been iterated on since. Knowing this context before walking into the interview is table stakes.
Worked Answer: Audience and Problem Analysis
"Before proposing improvements, I want to understand who uses Smart Reply and where it falls short. Based on public data, Smart Reply is used in about 12% of Gmail mobile replies. The primary users are mobile users who want to reply quickly without typing. I see three pain points from user research signals."
"First, the suggestions are often too generic: 'Sounds good!' or 'Thanks!' for emails that require a substantive response. Second, the suggestions do not reflect the user's personal voice. A senior executive and a college student get the same casual tone. Third, Smart Reply does not handle complex emails well. For emails with multiple questions, it only addresses one, leaving the sender's other questions unanswered."
"I want to focus on the second problem: personalization of tone and style. This is the highest-impact improvement because: it addresses a pain point across all user segments (not just power users), it directly increases the acceptance rate of suggestions (users are more likely to use a reply that sounds like them), and it leverages a capability that has become more tractable with recent advances in few-shot learning and style transfer."
[Interviewer note: Strong problem identification. The candidate cited a real usage metric (12% of mobile replies is approximately correct), identified three genuine pain points, and selected one with a clear rationale tied to user impact and technical feasibility. This shows they did their homework on the existing product.]
Worked Answer: Intelligence and Approach
"The current Smart Reply model is a sequence-to-sequence model that generates replies based on the incoming email content. To add personalization, I would propose a two-part approach. First, build a user style profile by analyzing each user's sent emails to extract: average reply length, formality level, common phrases, greeting and sign-off patterns, and emoji usage. This profile would be generated offline and updated weekly."
"Second, modify the generation step to condition on the user style profile. In practice, this means using a fine-tuned model that takes both the incoming email and the user's style profile as input. With LLMs, this could be implemented as in-context learning: include 3-5 examples of the user's actual past replies as few-shot examples in the prompt. This approach avoids the cost of per-user fine-tuning while still capturing individual voice."
"The key technical risk is privacy. We are analyzing users' sent emails to build style profiles. This must be done on-device or with explicit consent and clear data retention policies. I would propose an on-device approach where the style profile is computed locally and only a compact embedding (not raw text) is sent to the server for generation conditioning. This mirrors how Google handles Gboard personalization."
[Interviewer note: The candidate addressed the technical approach at the right level of abstraction, proposed a realistic implementation (few-shot learning with style profiles), and proactively raised the privacy concern with a concrete mitigation. The on-device processing mention shows knowledge of how Google handles sensitive data in production.]
Worked Answer: Design and Evaluation
"For the user experience, I would make personalization implicit by default. Users should not have to configure anything. The system learns their style from their sent emails automatically. For users who want control, add a setting in Gmail preferences: 'Smart Reply tone' with options like 'Match my style (recommended),' 'Professional,' 'Casual,' and 'Brief.' This gives advanced users control without adding complexity for the default case."
"The failure mode I am most concerned about is the model generating replies that match the user's informal style in a professional context, or vice versa. For example, using casual language to reply to your VP. I would add context awareness: detect the relationship between sender and recipient (using email frequency, organizational data from Google Workspace, and email thread formality) and adjust the formality level accordingly. If the model is not confident about the appropriate formality, it falls back to the current non-personalized suggestions."
"For evaluation, I would measure: acceptance rate (percentage of Smart Reply suggestions that users tap), edit rate (percentage of accepted suggestions that users modify before sending, lower is better), and per-user acceptance rate over time (should increase as the style profile improves). The success threshold is a 15% relative increase in acceptance rate and a 10% relative decrease in edit rate. I would A/B test with a 4-week holdout because style profiles need time to build, so a 1-week test would underestimate the treatment effect."
[Interviewer note: Good metric selection. The acceptance rate and edit rate combination is exactly how Smart Reply is evaluated internally. The 4-week test duration shows awareness that personalization features need longer evaluation periods. The formality context-awareness is a thoughtful design addition. Score: 4/5.]
Final Score and Debrief
Overall score: 4.0/5 (Hire). This answer shows strong product sense applied to an existing AI feature. The candidate demonstrated knowledge of the current product, identified a real and high-impact improvement, proposed a technically sound approach, addressed privacy proactively, and defined clear evaluation criteria.
To reach 4.5+: The candidate could have discussed the multi-question problem as a follow-up improvement (addressing all questions in a complex email), explored international considerations (Smart Reply personalization works differently across languages and cultures), or proposed a specific experiment for measuring whether personalized replies affect email recipient satisfaction (not just sender convenience). The best answers show second-order thinking about how changes affect the broader ecosystem.
Key Takeaways
- For 'improve an existing AI product' questions, demonstrate knowledge of how the current product works before proposing changes
- Pick one focused improvement rather than listing many. Depth beats breadth in interview answers
- Proactively address privacy concerns for any feature that uses personal data. This shows production maturity
- Design for context-aware behavior: the same AI feature should behave differently based on the situation (professional vs. casual email)
- For personalization features, use longer A/B test durations (4+ weeks) because the model needs time to learn user patterns