5. Machine Learning System Design

Designing ML systems for production

This is one of my favorite interviews in which you can shine bright and up-level your career. I'd like to mention the following important notes:

Remember, the goal of ML system design interview is NOT to measure your deep and detailed knowledge of different ML algorithms, but your ability to zoom out and design a production-level ML system that can be deployed as a service within a company's ML infrastructure.
Deploying deep learning models in production can be challenging, and it is beyond training models with good performance. Several distinct components need to be designed and developed in order to deploy a production level deep learning system.

For more insight on different components above you can check out the following resources):
- Full Stack Deep Learning course
- Production Level Deep Learning
- Machine Learning Systems Design
- Stanford course on ML system design [TBA]

1. ML System Design Flow

Approaching an ML system design problem follows a similar flow to the generic software system design. For more insight on general system design interview you can e.g. check out Grokking the System Design Interview and System design primer.

I developed the following design flow that worked pretty well during my own interviews:

Problem Formulation
- What does it mean?
- Use cases
- Requirements
- Assumptions
- Do we need ML to solve this problem?
- Trade off between impact and cost
  - Costs: Data collection, data annotation, compute
  - if Yes, go to the next topic. If No, follow a general system design flow.
ML Metrics (Offline and Online)
- Accuracy metrics:
  - imbalanced data?
- Latency
- Problem specific metric (e.g. CTR)
MVP Logic (High Level Design)
- Model based vs rule based logic
  - Pros and cons, and decision
    - Note: Always start as simple as possible and iterate over
- Propose a simple model (e.g. a binary logistic regression classifier)
Data Pipeline
- Needs
  - type (e.g. image, text, video, etc) and volume
- Sources
  - availability and cost
- Labelling (if needed)
  - labeling cost
- Feature Generation
  - what to chose as and how to chose features
  - feature representation
Training
- data splits (train, dev, test)
  - portions
  - how to chose a test set
- debugging
- Iterate over MVP model (if needed)
  - data augmentation
Inference (online)
- Data processing and verification
- Prediction module
- Serving infra
- Web app
Scaling, Monitoring, and Updates
- Scaling for increased demand (same as in distributed systems)
  - Scaling web app and serving system
  - Data partitioning
- Data parallelism
- Model parallelism
- A/B test and deployment
  - How to A/B test?
  - what portion of users?
  - control and test groups

ML System Design Sample Questions

ML System Design Topics

I observed there are certain sets of topics that are frequently brought up or can be used as part of the logic of the system. Here are some of the important ones:

Recommendation Systems

Collaborative Filtering (CF)
- user based, item based
- Cold start problem
- Matrix factorization
Content based filtering

NLP

Preprocessing
- Normalization, tokenization, stop words
Word Embeddings
- Word2Vec, GloVe, Elmo, BERT
Text classification and sentiment analysis
NLP specialist topics:
- Language Modeling
- Part of speech tagging
- POS HMM
  - Viterbi algorithm and beam search
- Named entity recognition
- Topic modeling
- Speech Recognition Systems
  - Feature extraction, MFCCs
  - Acoustic modeling
    - HMMs for AM
    - CTC algorithm (advanced)
  - Language modeling
    - N-grams vs deep learning models (trade-offs)
    - Out of vocabulary problem
- Dialog and chatbots
  - CMU lecture on chatbots
  - CMU lecture on spoken dialogue systems
- Machine Translation
  - Seq2seq models, NMT

Note: The reason I have more topics here is because this was my focus in my own interviews

Ads and Ranking

CTR prediction
Ranking algorithms

Information retrieval

Search
- Pagerank
- Autocomplete for search

Computer vision

Image classification
Object Tracking
Popular architectures (AlexNet, VGG, ResNET)
[TBD]

Transfer learning

Why and when to use transfer learning
How to do it
- depending on the dataset sizes and similarities

Once you learn about the basics, I highly recommend checking out different companies blogs on ML systems. You can refer to some of those resources in the subsection ML at Companies below.

ML Systems at Big Companies

AI at LinkedIn
ML at Google
- ML pipelines with TFX and KubeFlow
- How Google Search works
  - Page Rank algorithm (intro to page rank, the algorithm that started google)
- TFX production components
  - TFX workshop by Robert Crowe
- Google Cloud Platform Big Data and Machine Learning Fundamentals
Scalable ML using AWS
ML at Facebook
- Machine Learning at Facebook Talk
- Scaling AI Experiences at Facebook with PyTorch
- Understanding text in images and videos
- Protecting people
- Ads
  - Ad CTR prediction
  - Practical Lessons from Predicting Clicks on Ads at Facebook
- Newsfeed Ranking
- Photo search
- Social graph search
- Recommendation
  - Recommending items to more than a billion people
  - Social recommendations
- Live videos
- Large Scale Graph Partitioning
- TAO: Facebook’s Distributed Data Store for the Social Graph (Paper)
- NLP at Facebook
ML at Netflix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

ml-system-design.md

ml-system-design.md

5. Machine Learning System Design

Designing ML systems for production

1. ML System Design Flow

ML System Design Sample Questions

ML System Design Topics

Recommendation Systems

NLP

Ads and Ranking

Information retrieval

Computer vision

Transfer learning

ML Systems at Big Companies

Files

ml-system-design.md

Latest commit

History

ml-system-design.md

File metadata and controls

5. Machine Learning System Design

Designing ML systems for production

1. ML System Design Flow

ML System Design Sample Questions

ML System Design Topics

Recommendation Systems

NLP

Ads and Ranking

Information retrieval

Computer vision

Transfer learning

ML Systems at Big Companies