Machine Learning System design

3 min readJun 24, 2022

Machine Learning System design

Basically, we need to go through following phases while designing an ML system:
- Clarifying Requirements
- Architecture
- Data
- Model
- Serving

Clarifying Requirements
The first thing that we should clarify the requirements. For example, my question can be “Design a system that recommends our products to users who have a profile.”

Some questions that we should ask to understand the scope:
- How much data would we have access to? [for smaller datasets less complex models would be more appropriate but for bigger datasets, larger models like deep neural networks would work]
- Hardware Constraints: How much time do we have to complete the task?
- How much compute power is available? [if we are limited on hardware size, then we should use simpler models]
- Do we need a model that is quick to respond to a request or do we need a model that is extremely accurate? [deep models are usually slower but more accurate than traditional ML models, this question shows the interviewer that you are thinking about the tradeoffs]
- Do we need to think about retraining the model?

Metrics
Now that we have a clear idea of the use case and asked a few clarifying questions, we can use this information to determine the best metric(s) to use when modeling. We should always give at least two metrics: one for offline and one for online.

Offline metrics: are those we used to score the model when we are building it. This is before it is put into production and shown to users. This is the typical scenario in research or tutorials where you split your dataset into three sets train, eval, and test. Some examples of these offline metrics are AUC, F1, R², MSE.

Online metrics: are the scores we get from the model once it is in production serving requests. Online metrics could be the click-through rate or how long the users spends watching a video that was recommended. These metrics are use case-specific. We need to think about how a company would evaluate whether the model was useful to the users once it is in production.

Another set of metrics that would be useful are non-functional metrics.

Non-functional metrics:
- Training speed and scalability to very large datasets
- Extensibility to new techniques
- Tooling for easy training, debugging, evaluation and deployment
- Cost effectiveness
- Performance
- Reliability

Architecture: Following steps are included here
- Data Collection and Target Identification
- Data Preprocessing
- Model

First, identify the target variable and how you would collect and label it. In the recommendations example, the target variable would be whether, historically, a user liked a product from the company. There are typically two ways we can collect this target value: implicitly or explicitly. An example of explicit target collection would be if we look at our logs and check whether someone bought a product — this means they liked the product enough to buy it. On the other hand, an implicit target collection would be if a user “saves for later” a product or a user views a product a certain number of times. Note that explicit data collection is usually the best way to collect the target variable for most cases. If we think we can find a way to collect the target variable implicitly, then have this discussion with our interviewer and then talk about the pros and cons of each of our implicit suggestions.
Second: Discuss features and possible feature crosses: Some possible features for the recommendations example can be user-location, user-age, previous videos watched, video-title, video-freshness etc.
Third: Feature Engineering: Train-test split.
Handle missing values or outliers: usually, we can drop outliers and if there is a lot of data then we can drop missing values, if data is limited, then we can impute data via average (or any other method for dealing with missing values).
Balancing positive and negative training examples: if we notice that there probably will be a very big imbalance then we should discuss ways to solve this: up-sampling, under-sampling as well as other techniques like SMOTE.
Normalize certain columns.
Fourth: Feature Selection:
If we are using a deep neural network, then we do not need feature selection. If we need, tree-based estimators can be used to compute feature importances. Additionally, we can use the L1 norm for regularization so that some feature coefficients would be zero.
Fifth: Additional Considerations:
Biases: Are we sampling from a large enough subset of demographics, if not maybe we can group the largest and set others to be OOV demographics.
Any concerns with privacy/laws? We may need to anonymize or remove data depending on privacy concerns.

Written by Suchismita Sahu