Harnessing the power of analytics and data is a competitive advantage that has proven its tremendous potential for business development. A McKinsey discussion paper from 2017 claims that organizations that managed to put AI and machine learning into production saw between a 3% and 15% profit margin increase. Moreover, in the past couple of decades, multiple companies like Netflix and Amazon rose to the status of global market leaders almost entirely due to their investment and restructuring around the use of analytics and data. Nowadays, many companies recognize this as a profitable path for business growth, but few are aware of the difficulties one might face as they develop data science at scale.
Progress can get out of hand
Around the turn of the century, one of the first sectors that already used machine learning models on a regular basis was the financial industry. The usual situation would be to have a small or medium-sized team responsible for a limited number of models with a relatively long lifecycle (model review is done once every 6 months or 1 year). This is a manageable situation, but also one where the team is left a lot of room for manual work and little incentive to automate.
Nowadays, companies that start developing their analytics capacity, can quickly find themselves in a situation where a small or medium-sized team would have to manage a growing number of complex models with short lifecycles. As the number grows, manually reviewing and deploying new versions of models becomes impossible. In order to avoid ending up in a similar situation, it is a recommended practice to use ML Ops and automation by design.
What is ML Ops?
MLOps is a set of practices focused on the operationalization of ML models. Its main aim is to deploy and maintain machine learning models in production reliably and efficiently. The word is a combination of "machine learning" or “ML” for short, and the continuous development practice of DevOps in the software development field. Machine learning models are often developed and validated in experimental environments. When a new algorithm is ready to be launched to production, MLOps helps to transition the algorithm to production systems.
Similar to DevOps, MLOps seeks to automate the process and improve the quality of production models, while also focusing on business and regulatory requirements. While MLOps started as a set of best practices, it is slowly evolving into an independent and at this point indispensable approach to ML lifecycle management. MLOps applies to the entire model lifecycle - from integrating with model versions, orchestration, and deployment, to monitoring, governance, and business metrics.
Often the effort to bring a model successfully to production can outweigh the effort of actually developing the model, due to the fact that machine learning models differ from traditional pieces of software in many aspects. In response to these challenges, it has become a usual practice for data science teams to designate ML Ops engineers as people specifically responsible for this process.
The model lifecycle
Developing and operationalizing a machine learning model has several sequential phases (see the diagram below). The initial phase is the model development and model pipeline development phase. The first step of this phase would be for data scientists and machine learning engineers to work together and create a working prototype that will address a certain business use case. This phase is mostly manual work and aims to produce a working proof-of-concept. The next step would be to develop the automated pipeline that would bring the model to the end-user and in a sense this pipeline consists of similar tasks which the team performed manually during their prototype development, but in its final state, the process is supposed to be much more robust and fail-safe. This second step is where the ML Ops engineers would step in to automate the process. However, they might work in collaboration with DevOps experts, privacy and security experts, data engineers, and data scientists as well, due to the complexity and implications of the process.
So far the model has not reached the end-user but is only being prepared to get there. The following phase is bringing the model and the automated ML pipeline to production after extensive testing of the pipeline is completed and the model is validated. This is where job scheduling and orchestration software solutions come in handy. Since the machine learning process is a sequence of tasks it is important to have testing procedures in place to validate the input and output of each step, which is usually some collection of data points. Once the process is implemented one could either have the machine learning pipeline run on a regular schedule depending on when new input data comes in or have end-user actions trigger it. An example of the latter would be a video streaming platform, which retrains and deploys a new model every time the user watches new consent in order to provide the most relevant suggestions to him.
The feedback loop
One final step in the process of model operationalization is to collect feedback from the end-users of the model in order to understand if the model is useful to them. Very often models perform well on experimental datasets, but significantly worse on real-world data. Computer vision problems often face these problems, when the initial training set of the model consists of high-resolution and high-quality images, but the real-world data comes from data sources that cannot reproduce the same image quality. Collecting end-user feedback serves two purposes. One is to monitor the model in real-time and intervene if the performance is not satisfactory and the other is to use the collected data to improve the model.
How to achieve this step depends on the use case at hand and usually, there is no right way to do it. If the product utilizing the model is a software product, one would have to work with the UI and find a way to make this feedback collection as effortless as possible for the user. If the use-case is manufacturing, for example, one might have to manually verify if the predictions are correct. For example, it might be hard to know if a model detected a manufacturing defect correctly unless a human double-checks it.
Transform your organization with ML Ops
Having dedicated ML Ops engineers will assist your organization in having automation by design, reducing the risk of technical debt as you build the foundation of your data science practice. Because machine learning processes are complex, it's possible to make a decision early on that will require a complete redesign later. When the processes are automated by design, such a scenario can be accommodated, but caution in planning is still advised.
Read more: Data Science Insights: all you need to know
Understanding and defining the model lifecycle is the first step in coming up with the relevant ML Ops process. There is no need for unnecessary complexity where the use case does not require it. For example, there is no point in setting up a process for real-time model training, which is resource-consuming and will require performance optimization, if the model is good enough in the case of regularly scheduled offline training. The technical stack you choose is also important because it will almost certainly be a factor that you depend on in the long run.
However, running into any of these challenges means one is on the right path to process automation and greater efficiency. So far ML Ops has been established as the correct approach to managing a growing collection of machine learning models as it ensures reliability and robustness, and allows for continuous integration and deployment. Therefore, it is the way forward in managing data science at scale.
Do you need a partner to help you transform your organization with ML Ops?
Book a consultation with our team of experts to start your data science journey efficiently, with the right team on your side.