Module 3: Technical AI/ML QuestionsLesson 3.3

Worked Example: Building a Recommendation System

Walk through a complete answer to 'Walk me through building a recommendation system' from problem framing through model selection to evaluation.

16 min readLesson 12 of 29

The Question

Here is the question: 'Walk me through how you would build a recommendation system for a streaming platform.' This is a classic technical question that tests whether you can think through the full ML product lifecycle: problem framing, data, model selection, evaluation, and deployment. It is commonly asked at companies like Netflix, Spotify, YouTube, and any company with a content feed.

The trap in this question is that most candidates jump straight to 'collaborative filtering.' The interviewer wants to see you frame the problem as a product problem first, then connect it to a technical approach.

Worked Answer: Problem Framing

"Before I discuss models, I want to frame the recommendation problem as a product problem. On a streaming platform, recommendation serves three distinct use cases: what to watch next (immediate intent, high engagement signal), homepage discovery (no immediate intent, exploring the catalog), and re-engagement (bringing back churning users with personalized notifications). Each use case requires a different optimization target and potentially a different model."

"I will focus on homepage discovery because it is the highest-impact use case: it determines the first impression every time a user opens the app. Netflix has publicly shared that 80% of viewing hours come from recommendations, and the homepage is where most of those recommendations surface. The product goal is: increase the percentage of homepage sessions where the user starts watching something within 2 minutes."

[Interviewer note: Strong problem framing. The candidate separated recommendation into three distinct use cases with different optimization targets, rather than treating it as a monolithic problem. Picking homepage discovery and citing the Netflix 80% stat shows domain knowledge. Defining a specific product goal ('start watching within 2 minutes') connects the technical work to a measurable outcome.]

Worked Answer: Data and Model Selection

"The data we need falls into three categories. Implicit signals: viewing history, watch time, completion rate, browsing patterns, search queries. These are the richest signals because they represent actual behavior. Explicit signals: ratings, thumbs up/down, 'my list' additions. These are sparser but higher intent. Content features: genre, cast, director, release year, synopsis embeddings, thumbnail features. These help with cold-start for new content."

"For the model architecture, I would propose a two-stage system. Stage 1 is candidate generation: a broad retrieval model that narrows the full catalog (say, 50,000 titles) down to 500 candidates. This would use a two-tower neural network: one tower encodes the user (from their implicit and explicit signals) and one tower encodes the content (from content features). The model is trained to bring user and content embeddings close together when the user watched and enjoyed the content. This is efficient at scale because the content embeddings can be pre-computed."

"Stage 2 is ranking: a more complex model that takes the 500 candidates and ranks them. This would be a deep learning ranking model that combines the user and content features with contextual features: time of day, device type, day of week, and how recently the user finished watching something. The ranking model optimizes for predicted watch time rather than just click-through, because we want engagement quality, not just clicks."

"I considered a simpler approach: pure collaborative filtering using matrix factorization. The advantage is simplicity and interpretability. I rejected it as the primary model because it struggles with new content (cold start) and cannot incorporate contextual features. However, I would use collaborative filtering signals as input features to the neural ranking model, so we get the best of both approaches."

[Interviewer note: This is a well-reasoned technical answer at the right level of abstraction. The two-stage architecture (candidate generation + ranking) is industry standard for recommendation at scale. The two-tower model for retrieval and the contextual ranking model are appropriate choices. The candidate discussed data requirements clearly, considered and rejected a simpler approach with justification, and showed awareness of practical considerations like pre-computing content embeddings. Score: 4.5/5 on technical depth.]

Worked Answer: Evaluation and Deployment

"For offline evaluation, I would use: precision@K and recall@K (does the model surface items the user actually watches?), NDCG (does the model rank better items higher?), catalog coverage (are we recommending a diverse set of titles, or just the same popular content?), and novelty (are we surfacing content the user would not have found on their own?). I would evaluate on a held-out test set of the most recent 2 weeks of data, since recommendation models need to be evaluated on temporal holdouts, not random splits."

"For online evaluation, I would A/B test against the current recommendation system with these metrics: primary: watch-within-2-minutes rate (our product goal). Secondary: average watch time per session, catalog diversity of viewed content, and churn rate. Guardrails: the new system must not increase content-start abandonment rate (user starts watching but stops within 5 minutes, indicating the recommendation was misleading). The A/B test should run for at least 3 weeks to capture weekly patterns."

"Deployment considerations: the candidate generation model needs to refresh daily to incorporate new viewing data. The ranking model should serve predictions in under 100ms. We need a fallback to popularity-based recommendations for new users (cold start). And we should monitor for feedback loops: if the model only recommends popular content, popular content gets more views, which makes the model recommend it more. I would add an exploration component (10% of recommendations are deliberately diverse) to prevent this."

[Interviewer note: Strong evaluation section. Temporal holdout for offline eval is a subtle but important detail that many candidates miss. The online metrics are well-chosen with a clear primary metric and relevant guardrails. The deployment section shows production awareness: latency requirements, daily refresh, cold start handling, and feedback loop mitigation. The exploration component shows awareness of recommendation system dynamics. Overall: 4.5/5.]

Key Takeaways

Frame the recommendation problem as a product problem first: different use cases (next-watch, discovery, re-engagement) need different optimization targets
A two-stage architecture (candidate generation + ranking) is industry standard for recommendation at scale
Evaluate recommendation models with temporal holdouts, not random splits. Recent data is the right test set
Monitor for feedback loops in recommendation systems: popular content getting more popular creates a diversity problem
Cold-start handling (new users, new content) must be part of the design. Fall back to popularity-based recommendations and add an exploration component

3.2 Worked Example: Evaluating an LLM for Production

3.4 Worked Example: A/B Testing an AI Feature