Lessons from building a small MLOps pipeline

The image above was taken from the MLOps organisation's website.
Recently I found myself doing yet another quirky machine learning side project (this one), and it occurred to me that making the code as production-ready as possible would be an interesting exercise and a good way to get familiar with some of the latest MLOps frameworks. All of this considering that this was not an industry project, and so there weren't any hard requirement, or any serious consideration of things like monitoring or deployment. Still, as a research project, reproducibility and a fast paced development are always implicit requirements, even though they can contradict each other at times; the dataset size also justified some use of cloud technologies I was happy to play with. Along the way there were some good lessons I learned when it comes to automating ML pipelines.
Stages in the ML lifecycle should be loosely coupled
An ML pipeline, excluding deployment, can be roughly divided into ingestion, data validation, preprocessing, model training and model validation; the image above, from the book Building Machine Learning Pipelines by Hapke and Nelson, is an example of the approach the TFX library takes. Every one of these stages should be a standalone set of python modules, organised inside a single package. As any proper Python package, there should be documentation, error handling and logging in each module. They should be loosely coupled, so that they can be run individually, and should only interact with each other through the mediation of storage, or a higher layer abstraction that takes outputs from one stage and passes them to the next one.
Configuration schemas and parsers should be a priority
Interactive sessions shouldn't be necessary to run anything. Instead, this is better done through a couple of entrypoint scripts that receive a low number of parameters. A handy way of doing this is by defining a comprehensive configuration schema for each stage in a human-readable format (say, YAML), and then passing a valid execution configuration file to the entrypoint scripts. This is in spirit the same approach that Kubernetes use, letting the user specify a desired state in a configuration file, and leaving everything else to the control plane, or in this case, the project's package. Adopting a declarative approach when running ML pipelines is much clearer and more easily reproducible.
The parser classes should do all the heavy lifting that comes from 1) reading the configuration file and matching it to the corresponding stage schema, which entails checking optional and mandatory parameters, checking dependencies between parameters (e.g., par1 <= par2
) and making sure the values make sense. Then, they would instantiate execution dependencies (e.g., a model with the specified architecture) and pass those objects to the stage modules to be used for execution. A good couple of package to do some of this are strictyaml
and pydantic.
Something I found handy was to include every parameter I could expect to change in the configuration schema, even model architecture, optimizers and so on. This way, running experiments is just a matter of changing a line in a file without messing with the code. Besides enabling grid search over the entire project's parameter space, this partly solves the problem of reproducibility, as it forces you to have a written record of the entire configuration you just ran; it also prevents you from making volatile changes in the pipeline in an interactive session. Using experiment tracking frameworks such as MLFlow on top of this makes it trivial to log everything necessary for full reproducibility.
Comprehensive I/O abstractions should be defined early on
As a data scientist, one is used to having data in specific file formats, with specific schemas at specific locations (e.g. local or remote storage) in order to do our jobs, but every one of those things can change: prototyping happens locally but training at scale may need to be done remotely; if it's a long running project, input file formats and schemas can change; compression formats can also change. I/O is so fundamental to everything else that making the necessary changes at a later stage can be painful, so it's better to decouple I/O early on from the processing logic, and to have extensible abstractions to handle variations for the foreseeable future, such as local vs remote I/O. The ideal scenario would be to have pluggable, specialized classes to perform data fetching, serialization, decompression and schema formatting without the processing logic having anything to do with it. All that the processing classes need to do is to expose their desired data format and schemas.
Preprocessing logic should work on both batches and streams
It should be possible to use the exact same codebase for batch and stream data preprocessing, as opposed to somehow replicating one codebase on another. This can help preventing hard-to-debug issues from training-serving skew and even makes for faster development iterations by letting you open up a console and trying things out to see what's happening.
There's already plenty of good frameworks to scale up as much as you need
I found that once the points above are taken care of and the codebase is solid, integrating the project with cloud based services or building on top of it is much easier, and feels almost like using Lego blocks. There are lots of good MLOps frameworks out there that can give you a big boost without having to do much; my experience with MLFlow has been very positive so far, and there are other that look fantastic as well such as ZenML or ClearML; specialized open source deployment tools such as Seldon and BentoML also can help you to offload some work in the late stages of a project. Having a solid codebase means that you can almost immediately wrap it into existing high-quality frameworks and grow to whatever scale you need.