A step-by-step guide to using MLFlow Recipes to refactor messy notebooks
Table of contents:
- What is MLFlow Recipes?
- Why use MLFlow Recipes? What are the benefits?
- It’s easy enough to use MLFlow Recipes for a new project. What about retrofitting an existing project to the framework? Will it be difficult?
- Steps to using MLFlow to refactor a messy notebook
Code repository for this post is here: you can see the MLFlow Recipes template in the main
branch and the filled-in template on the fill-in-steps
branch.
MLFlow Recipes
The announcement of MLFlow 2.0 included a new framework called MLFlow Recipes. For a Data Scientists, using MLFlow Recipes means cloning a git repository, or “template”, that comes with a ready-to-go folder structure for any regression or binary classification problem. This folder structure includes everything, from library requirements, configuration, notebooks and tests, that’s needed to make a data science project reproducible and production-ready.
It’s easy to start a new project with MLFlow Recipes — git clone
a template from the MLFlow repository, and you are good to go.
Benefits to MLFlow Recipes
MLFlow Recipes is meant to solve for several pain points in a data scientist’s workflow:
- Reduce mental overhead: Previously, we would have to create project subfolders, modularise code, and configure testing for every new project. Now, with a standard “recipe” to follow, data scientists can make use of a “pre-fabricated” files and folders to reduce the manual toil of getting a project started and into production.
- Standardised development practices: Without a pre-defined template to work with, individual team members will default to developing personal, sometimes idiosyncratic processes to organising their code. As the team grows, uncoordinated standards means that one data scientists cannot easily pick-up on the work of his or her colleague,
- Easily switch between development environments: Because code inside the
notebooks
folder is driven by external configuration files, we can change parameters and settings, such as file locations, for running our model in differentdev
,staging
andprod
environments without touching the code for our model. We can also move between local environments and the cloud. For example, we can point to a localcsv
file when quickly prototyping on a laptop, and then point to adelta
table on S3 when we want our model to run on a hosted workspace such as Databricks.
Getting started is easy. What about retrofitting an existing project to MLFlow Recipes?
This is a valid question. Most teams have existing code that they would like to standardise and improve on. These same teams may also have projects in the pipeline that are at various levels of maturity. Structuring these projects before they become too complex is a good way of paying down technical debt.
MLFlow Recipes does not have to be used only on new projects. With some refactoring, existing projects can also benefit from that the framework has to offer.
How one might start with this refactoring is what the following section will cover.
Starting point
We have a notebook that we have been using for exploratory data analysis and for prototyping a house price prediction model using the Ames housing dataset. We’re fairly satisfied with the model we have created. However, our code all sits inside one notebook and it’s not well organised. Although this was okay while we were ideating, at this point, we’d like to start to clean things up and have the project in a more production-ready state. At this point, we can start by cloning a template repository that fits our use case (in our case regression).
git clone https://github.com/mlflow/recipes-regression-template.git
Repository Structure
We can see this repository has several folders. To see how these folders connect with each other, a diagram is helpful:
steps
folder
The first folder to look at is the steps
folder. Inside this folder, there is a .py
file for each stage inside the model development process. According to the README.md of the MLFlow docs, these stages are made up of:
ingest -> split -> transform -> train -> evaluate -> register
. Hence, we see a corresponding file for ingestion ( ingest.py
), data splitting ( split.py
), data transformation ( transform.py
), and model training ( train.py
) . For the evaluation step, we can define custom metrics inside the custom_metrics.py
file, or indicate that we would like to use a metric that comes included with the framework. Finally, the register step does not have it’s own python file, but we can customise our model registry location and model name inside the recipe.yaml
file.
profiles folder
As mentioned previously, to make our code more flexible, parameters such the file location and file format of our training data goes into yaml
files inside the profiles
folder. For example, we can specify that we want to read in data as a csv
file from our local data
folder using the load_file_as_dataframe
function inside our ingest.py
file
INGEST_CONFIG:
using: "csv"
location: "./data/ames.csv"
loader_method: load_file_as_dataframe
recipe.yaml
file and notebooks
folder
The recipe.yaml
file can be seen as the “master” configuration file that can take in as variables values we have defined inside the profiles
folder. For instance, if we want to be able to switch between local.yaml
and a Databricks workspace (specified as databricks.yaml
) we can refer to the ingest location using the Jinja templating language (for those unfamiliar with Jinja, this is the variable inside the curly brackets).
steps:
# Specifies the dataset to use for model development
ingest: {{INGEST_CONFIG}}
Eventually, we will rely on a notebook (whether Jupyter Notebook or Databricks Notebooks) to pull together everything we have worked on.
Effectively, this notebook acts as the “driver” that will read in the desired configurations and steps, and then execute these as part of a DAG that can be visualised.
Now that we know how the various components of an MLFlow Recipes repository work together, we can start our refactoring process.
Refactoring Steps
Step 1: Extract code into the steps
folder
Inside our original notebook, we have specified our data loading, pre-processing and transformation logic linearly within one notebook. The first thing we should do is to separate each step into its relevant .py
file inside the steps folder (note that at this time it’s not possible to customise what we want our steps to be, or to add additional steps to what is already provided).
For example, we originally defined a scikit-learn
pipeline to include a numerical and categorical transformations with a LassoCV model at the end.
However, inside MLFlow Recipes, any transformations, such as data imputation and one-hot encoding, should be part of the transform
step. Then, the machine learning algorithm forms part of the train
step.
Therefore, we will need to break up the original pipeline. The ColumnTransformer
stage goes into the transform
step while the LassoCV
goes into the train
step.
Our original code:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
cat_linear_processor = OneHotEncoder(handle_unknown="ignore")
num_linear_processor = make_pipeline(
StandardScaler(), SimpleImputer(strategy="mean", add_indicator=True)
)
linear_preprocessor = make_column_transformer(
(num_linear_processor, num_selector), (cat_linear_processor, cat_selector)
)
lasso_pipeline = make_pipeline(linear_preprocessor, LassoCV())
Eventual transform.py
:
"""
This module defines the following routines used by the 'transform' step of the regression recipe:
- ``transformer_fn``: Defines customizable logic for transforming input data before it is passed
to the estimator during model inference.
"""
def transformer_fn():
"""
Returns an *unfitted* transformer that defines ``fit()`` and ``transform()`` methods.
The transformer's input and output signatures should be compatible with scikit-learn
transformers.
"""
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
import numpy as np
# get the list of columns names of categorical variables
cat_selector = make_column_selector(dtype_include=object)
# get the list of column names of numerical variables
num_selector = make_column_selector(dtype_include=np.number)
cat_linear_processor = OneHotEncoder(handle_unknown="ignore")
num_linear_processor = make_pipeline(
StandardScaler(), SimpleImputer(strategy="mean", add_indicator=True)
)
linear_preprocessor = make_column_transformer(
(num_linear_processor, num_selector), (cat_linear_processor, cat_selector)
)
return linear_preprocessor
MLFlow Recipes will automatically implement best practices for us by only applying the transformation to the training and validation datasets created in the split
step, while leaving the test set unchanged. By doing this, we prevent data leakage by making sure that information from the test set is not implicitly used to transform the training and validation data.
Another detail to note is that data cleaning and filtering are considered to be separate from the transform
step, and should belong instead to the split
step, where a custom post-processing method can be applied to the training, validation and testing set.
Further note about the training logic: inside the original notebook, we evaluate our model with cross-validation. Cross-validation is not available inside MLFlow Recipes for now (we use a static training, val and test split). We can, however, implement hyperparameter tuning using hyperopt.
split:
split_ratios: [0.75, 0.125, 0.125]
post_split_filter_method: create_dataset_filter
Step 2: Specify configurations as YAML files in the profiles
folder
Inside the profiles
folder we can specify environment-specific variables. For instance, when working on a local machine, we can specify inside a local.yaml
to load our data from a csv
file inside a folder on our laptop.
INGEST_CONFIG:
using: "csv"
location: "./data/ames.csv"
loader_method: load_file_as_dataframe
Then, inside a databricks.yaml
file, we specify to read the data from a Delta table using a Spark SQL query
INGEST_CONFIG:
using: spark_sql
sql: SELECT * FROM delta.`dbfs:/ames_housing`
Step 3: Specify non-environment-specific configuration inside the recipe.yaml
file
The recipe.yaml
can reference profiles using curly brackets for example our different ingestion methods are referred to using {{INGEST_CONFIG}}
. The recipe.yaml
is also where we can reference any functions created as part of the different steps.
Note that there is also some logic we can specify around how the model is registered. In our case, by indicating allow_non_validated_model: false
, if a model does not meet a pre-specified performance threshold, we will not register the model into the MLFlow Model Registry.
recipe: "regression/v1"
target_col: "SalePrice"
primary_metric: "root_mean_squared_error"
steps:
ingest: {{INGEST_CONFIG}}
split:
split_ratios: [0.75, 0.125, 0.125]
transform:
using: "custom"
transformer_method: transformer_fn
train:
estimator_method: estimator_fn
evaluate:
validation_criteria:
- metric: root_mean_squared_error
threshold: 10
register:
# Indicates whether or not a model that fails to meet performance thresholds should still
# be registered to the MLflow Model Registry
allow_non_validated_model: false
It is the profiles and the reciple.yaml
files that give MLFlow Recipes it’s flexibility. Instead of modifying steps when we want to change parameters, we change instead the configuration files. By leaving the code unchanged, we can reassure ourselves that the coding and modelling logic will remain the same between experiment runs.
Step 4: Test run a step using a notebook in the notebooks
folder
Finally, to test our refactoring, we can run a step using one of the notebooks inside the notebooks
folder.
from mlflow.recipes import Recipe
r = Recipe(profile="local")
r.run("ingest")
From experience, it’s easier to refactor by focusing on one step
at a time. Modify a step
file and it’s corresponding section inside the configuration file, make sure the step
works by running it from a notebook, then only move on.
Step 5: Make a change
To make any updates (for example, to add hyperparameter tuning or a different evaluation metric), modify the relevant configuration yaml
or python file inside steps
Hopefully, this makes migrating to MLFlow Recipes more concrete, and you can benefit from the standardised layouts and outputs that the framework has to offer.