Production Testing methods for Machine Learning features

2 min readJun 24, 2022

Batch testing: validates the model by performing testing in an environment that is different from its training environment, which is carried out on a set of samples of data to test model inference using metrics of choice, such as accuracy, RMSE, etc. Batch testing can be done in various types of computes, for example, in the cloud, or on a remote server or a test server. The model is usually served as a serialized file, and the file is loaded as an object and inferred on test data.

A/B testing: It is often used in service design (websites, mobile apps, and so on) and for assessing marketing campaigns. To evaluate the results of A/B testing, statistical techniques are used based on the business or operations to determine which model will perform better in production. A/B testing is usually conducted in this manner:
Real-time or live data is fragmented or split into two sets, Set A and Set B.
Set A data is routed to the old model, and Set B data is routed to the new model.
In order to evaluate whether the new model (model B) performs better than the old model (model A), various statistical techniques can be used to evaluate model performance (for example, accuracy, precision, etc), depending on the business use case or operations.
Then, we use statistical hypothesis testing
a. The null hypothesis asserts that the new model does not increase the average value of the monitoring business metrics.
b. The alternate hypothesis asserts that the new model improves the average value of the monitoring business metrics.
Ultimately, we evaluate whether the new model drives a significant boost in specific business metrics.

Stage test or shadow test: Before deploying a model for production, the model is tested in a replicated production-like environment (staging environment). This is especially important for testing the robustness of the model and assessing its performance on real-time data. Is done by deploying the develop branch or a model to be tested on a staging server and inferring the same data as the production pipeline. The only shortcoming here will be that end-users will not see the results of the develop branch or business decisions will not be made in the staging server. The results of the staging environment will statistically be evaluated using suitable metrics to determine the robustness and performance of the model.

Production Testing methods for Machine Learning features

Written by Suchismita Sahu