The bread and butter of machine learning (ML) are data and models. As Data Science academic research and competitions focus mostly on improving the ML models and algorithms, in many aspects the data remain overlooked. This creates an artificial division between the data and model in the ML system that starts to frame two separate approaches towards AI - Model-centric and Data-centric.
The benefits of excellent models
A famous quote often attributed to the statistician George Box says that all models are wrong but some are useful. By extension, some models are extremely useful, and some are, let’s face it, useless. To build a good ML solution, you need a model that captures the underlying dependencies in the data, filtering out the idiosyncratic noise and performing well on new, unseen data.
A model improvement can be achieved in various ways. While there are many common recipes and tools for model optimization, for many applications, the modelling work remains affined to the artwork. The usual workflow includes:
- testing various model architectures and specifications, different objective functions and optimization techniques
- fine-tuning the hyper-parameters defining the model structure and the model-training process.
What is referred to as a model-centric approach is an activity of dedicating time and resources to reiterating the model. The goal is to improve the accuracy of the ML solution while keeping the training data set fixed.
The more one approaches the realistic limits for model performance, the smaller the room for model improvements becomes and the marginal return on spending time and resources on the task starts to diminish. All this doesn’t say that one has reached the potential for the whole ML solution. There might still be vast room for improvement available.
The benefits of high-quality data
Once you see that you reach the potential of your model on the given dataset, the usual go-to is the universal “get more training data.” This might often be all you need to reach the performance goals of your model. Sometimes though, what you need is not more data, but better data.
The data-centric approach is concerned with how to improve the overall performance of the ML solution by focusing on the quality and sufficiency of the data while keeping the model training part fixed. What the Data-centric approach suggests is not something novel or revolutionary but a reminder that actually no model can be better than the data it was trained on and that improvements in the quality of the data can lead to much higher performance gains for the overall ML solution.
Data consistency, data coverage, label consistency, feedback timeliness and thoroughness, and model metadata are some of the aspects of the data that can improve your ML solution.
- Consistent data is data, anything else is confusion and ambiguity. Are the ETL (extract, transform and load) pipelines providing you with the clean and systematic data necessary for your ML applications? If the answer is no, then perhaps a greater effort is required to improve upon the relevant processes.
- The data coverage asks whether the sample you are training your model on is representative of the population your model is going to be used on. If some subpopulations or classes are underrepresented, evaluate what might be the effect of this and, if needed, think about how to overcome this. Often data filtering, rebalancing, or data augmentation might help. Another aspect of the coverage is the content. Are all characteristics relevant for the discrimination between the observations present in your dataset, do you need and can you get additional features for your ML task?
- Labels consistency - this one is a huge issue for any supervised ML task. From the correct definition of the labels for your ML task to the accurate labelling of the dataset: all aspects can hugely affect the outcome of the model training. There are multiple strategies and techniques that can be useful for improving the labels in your project and it is always a good idea to spend some time checking the quality of your labels manually - even on a very small subset of the data.
- Monitoring data - once deployed to production, the ML system is not done. Model performance will inevitably deteriorate due to data or concepts drifts. Setting up good monitoring for your model is the first line of defence against such a trend. Often one cannot foresee in which aspect the input data for the model may shift or how the performance of the model may decrease and setting up monitoring on a wider range of indicators and subpopulations may reveal underlying changes faster.
- Model Metadata - the high quality of an ML system is also akin to transparency and reproducibility. Model performance metrics and means for reproducibility can generally be called model metadata and are also important for easing the work on model experimentation and optimization.
Business and analytic tradeoffs
How to strike the right balance between improving your code and improving the quality of your data? You can - as with any other decision - put some data into use.
Analyze your processes and see what is the ratio of the time spent working on data vs the time spent working on the code for improvement of the accuracy of the ML applications. Time-box the model optimization part, put the model in production when you reach satisfactory results, and start collecting feedback for gaining insight into your model and improving your data set. Prioritize high-quality data throughout all phases of the ML project for the MLOps team.
It might be worth reconsidering also the composition of your ML teams. How many data engineers and analysts vs ML engineers and modellers do you have?
This can be generalized further at an organizational level for any decision concerning your data assets and ML projects. Build and maintain better data infrastructure instead of investing in more ML projects. And consider how better data quality and infrastructure can improve the profitability of the undertaken ML projects.
Where to go from here?
Starting from the investigation phase of the project, spend some time on what would be the upper feasible limit on the performance of the model that is going to be built. If this is a frequently occurring ML task, one can check the literature for what is the level already achieved by other Data Scientists. Alternatively, take a small sample and measure the human-level performance on it. This can be used as a guideline for the feasible model performance regarding the task at hand.
Once realistic benchmarks for the output of your ML project are set up front and the first model prototype is ready, carefully analyze what is missing to get to this benchmark. A quick analysis of the errors of your model, evaluating some human-level performance benchmarks, and digging into the potential gaps can guide you on whether it’s worth to continue training and optimizing your model or whether it's better to spend more time on collecting additional data, better labelling or feature creation. Iterate.
What will help in moving through these phases effectively is a data-centric infrastructure for the ML solution. What you need here is an automated retraining and deployment process and integrated model monitoring that can quickly bring the feedback for your model and the new training data increments to trigger model retraining or reworking. For this purpose, the project requires a developed MLOps infrastructure providing timely and consistent, high-quality data for your system. Tools and expertise for building full MLOps pipelines are quickly piling up to meet the new requirements and demand in the field of production ML.
Prioritize data quality over data quantity. Prioritizing tasks on creating and maintaining systematic, high-quality data for your business would unlock the potential for better analytics and better ML solutions for your organization. Instead of investing in creating models for the multiple use cases, you want to address, put your data in the centre of your decision-making and build the data infrastructure that would allow you to create cutting-edge ML solutions to reach the quality necessary to make the ML investment profitable and protect your solutions from potentially hard to fix or costly deteriorations in performance.
And know that you are not alone in this. Andrew Ng is on a quest for higher data awareness and more and more useful content on the topic can be found on Data-Centric AI Resource Hub.
The data should show the way
The data-centric approach isn’t anything new. The applied Data Scientists and ML practitioners would always know that the data is the guiding light, the main ingredient for their recipes. What the data-centric approach emphasizes is that the marginal product of data-quality-related activities in many applications might be higher than in the model-related investment.
Let your data show you the way and allow a gradual shift from a model-centric to a data-centric mindset to help you rethink how ML projects are formulated and implemented.
Do you need a partner in navigating through times of change?
At Prime, we specialize in delivering success and will be happy to accompany you through your data science and analytics journey, all the way into the stratosphere. Learn all you need to know about data science or just book a consultation with our team of experts to start your data science journey efficiently, with the right team on your side.