A step-by-step guide to using MLFlow Recipes to refactor messy notebooks

littlereddotdata
9 min readDec 26, 2022

--

Table of contents:

  • What is MLFlow Recipes?
  • Why use MLFlow Recipes? What are the benefits?
  • It’s easy enough to use MLFlow Recipes for a new project. What about retrofitting an existing project to the framework? Will it be difficult?
  • Steps to using MLFlow to refactor a messy notebook

Code repository for this post is here: you can see the MLFlow Recipes template in the main branch and the filled-in template on the fill-in-steps branch.

MLFlow Recipes

The announcement of MLFlow 2.0 included a new framework called MLFlow Recipes. For a Data Scientists, using MLFlow Recipes means cloning a git repository, or “template”, that comes with a ready-to-go folder structure for any regression or binary classification problem. This folder structure includes everything, from library requirements, configuration, notebooks and tests, that’s needed to make a data science project reproducible and production-ready.

It’s easy to start a new project with MLFlow Recipes — git clone a template from the MLFlow repository, and you are good to go.

Repository structure for MLFlow Recipes

Benefits to MLFlow Recipes

MLFlow Recipes is meant to solve for several pain points in a data scientist’s workflow:

  • Reduce mental overhead: Previously, we would have to create project subfolders, modularise code, and configure testing for every new project. Now, with a standard “recipe” to follow, data scientists can make use of a “pre-fabricated” files and folders to reduce the manual toil of getting a project started and into production.
  • Standardised development practices: Without a pre-defined template to work with, individual team members will default to developing personal, sometimes idiosyncratic processes to organising their code. As the team grows, uncoordinated standards means that one data scientists cannot easily pick-up on the work of his or her colleague,
  • Easily switch between development environments: Because code inside the notebooks folder is driven by external configuration files, we can change parameters and settings, such as file locations, for running our model in different dev , staging and prod environments without touching the code for our model. We can also move between local environments and the cloud. For example, we can point to a local csv file when quickly prototyping on a laptop, and then point to a delta table on S3 when we want our model to run on a hosted workspace such as Databricks.

Getting started is easy. What about retrofitting an existing project to MLFlow Recipes?

This is a valid question. Most teams have existing code that they would like to standardise and improve on. These same teams may also have projects in the pipeline that are at various levels of maturity. Structuring these projects before they become too complex is a good way of paying down technical debt.

MLFlow Recipes does not have to be used only on new projects. With some refactoring, existing projects can also benefit from that the framework has to offer.

How one might start with this refactoring is what the following section will cover.

Starting point

We have a notebook that we have been using for exploratory data analysis and for prototyping a house price prediction model using the Ames housing dataset. We’re fairly satisfied with the model we have created. However, our code all sits inside one notebook and it’s not well organised. Although this was okay while we were ideating, at this point, we’d like to start to clean things up and have the project in a more production-ready state. At this point, we can start by cloning a template repository that fits our use case (in our case regression).

git clone https://github.com/mlflow/recipes-regression-template.git

Repository Structure

We can see this repository has several folders. To see how these folders connect with each other, a diagram is helpful:

How MLFlow Recipes repo folders fit together

steps folder

The first folder to look at is the steps folder. Inside this folder, there is a .py file for each stage inside the model development process. According to the README.md of the MLFlow docs, these stages are made up of:

ingest -> split -> transform -> train -> evaluate -> register . Hence, we see a corresponding file for ingestion ( ingest.py), data splitting ( split.py ), data transformation ( transform.py ), and model training ( train.py) . For the evaluation step, we can define custom metrics inside the custom_metrics.py file, or indicate that we would like to use a metric that comes included with the framework. Finally, the register step does not have it’s own python file, but we can customise our model registry location and model name inside the recipe.yaml file.

profiles folder

As mentioned previously, to make our code more flexible, parameters such the file location and file format of our training data goes into yamlfiles inside the profiles folder. For example, we can specify that we want to read in data as a csv file from our local data folder using the load_file_as_dataframe function inside our ingest.py file

INGEST_CONFIG:
using: "csv"
location: "./data/ames.csv"
loader_method: load_file_as_dataframe

recipe.yaml file and notebooks folder

The recipe.yaml file can be seen as the “master” configuration file that can take in as variables values we have defined inside the profiles folder. For instance, if we want to be able to switch between local.yaml and a Databricks workspace (specified as databricks.yaml ) we can refer to the ingest location using the Jinja templating language (for those unfamiliar with Jinja, this is the variable inside the curly brackets).

steps:
# Specifies the dataset to use for model development
ingest: {{INGEST_CONFIG}}

Eventually, we will rely on a notebook (whether Jupyter Notebook or Databricks Notebooks) to pull together everything we have worked on.

Effectively, this notebook acts as the “driver” that will read in the desired configurations and steps, and then execute these as part of a DAG that can be visualised.

Now that we know how the various components of an MLFlow Recipes repository work together, we can start our refactoring process.

Refactoring Steps

Step 1: Extract code into the steps folder

Inside our original notebook, we have specified our data loading, pre-processing and transformation logic linearly within one notebook. The first thing we should do is to separate each step into its relevant .py file inside the steps folder (note that at this time it’s not possible to customise what we want our steps to be, or to add additional steps to what is already provided).

For example, we originally defined a scikit-learn pipeline to include a numerical and categorical transformations with a LassoCV model at the end.

Original scikit-learn pipeline

However, inside MLFlow Recipes, any transformations, such as data imputation and one-hot encoding, should be part of the transform step. Then, the machine learning algorithm forms part of the train step.

Therefore, we will need to break up the original pipeline. The ColumnTransformerstage goes into the transformstep while the LassoCVgoes into the trainstep.

Our original code:

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer

cat_linear_processor = OneHotEncoder(handle_unknown="ignore")

num_linear_processor = make_pipeline(
StandardScaler(), SimpleImputer(strategy="mean", add_indicator=True)
)

linear_preprocessor = make_column_transformer(
(num_linear_processor, num_selector), (cat_linear_processor, cat_selector)
)

lasso_pipeline = make_pipeline(linear_preprocessor, LassoCV())

Eventual transform.py :

"""
This module defines the following routines used by the 'transform' step of the regression recipe:

- ``transformer_fn``: Defines customizable logic for transforming input data before it is passed
to the estimator during model inference.
"""

def transformer_fn():
"""
Returns an *unfitted* transformer that defines ``fit()`` and ``transform()`` methods.
The transformer's input and output signatures should be compatible with scikit-learn
transformers.
"""
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
import numpy as np

# get the list of columns names of categorical variables
cat_selector = make_column_selector(dtype_include=object)
# get the list of column names of numerical variables
num_selector = make_column_selector(dtype_include=np.number)


cat_linear_processor = OneHotEncoder(handle_unknown="ignore")

num_linear_processor = make_pipeline(
StandardScaler(), SimpleImputer(strategy="mean", add_indicator=True)
)

linear_preprocessor = make_column_transformer(
(num_linear_processor, num_selector), (cat_linear_processor, cat_selector)
)

return linear_preprocessor

MLFlow Recipes will automatically implement best practices for us by only applying the transformation to the training and validation datasets created in the split step, while leaving the test set unchanged. By doing this, we prevent data leakage by making sure that information from the test set is not implicitly used to transform the training and validation data.

Another detail to note is that data cleaning and filtering are considered to be separate from the transform step, and should belong instead to the split step, where a custom post-processing method can be applied to the training, validation and testing set.

Further note about the training logic: inside the original notebook, we evaluate our model with cross-validation. Cross-validation is not available inside MLFlow Recipes for now (we use a static training, val and test split). We can, however, implement hyperparameter tuning using hyperopt.

split:
split_ratios: [0.75, 0.125, 0.125]
post_split_filter_method: create_dataset_filter

Step 2: Specify configurations as YAML files in the profiles folder

Inside the profiles folder we can specify environment-specific variables. For instance, when working on a local machine, we can specify inside a local.yaml to load our data from a csv file inside a folder on our laptop.

INGEST_CONFIG:
using: "csv"
location: "./data/ames.csv"
loader_method: load_file_as_dataframe

Then, inside a databricks.yaml file, we specify to read the data from a Delta table using a Spark SQL query

INGEST_CONFIG:
using: spark_sql
sql: SELECT * FROM delta.`dbfs:/ames_housing`

Step 3: Specify non-environment-specific configuration inside the recipe.yaml file

The recipe.yaml can reference profiles using curly brackets for example our different ingestion methods are referred to using {{INGEST_CONFIG}} . The recipe.yaml is also where we can reference any functions created as part of the different steps.

Note that there is also some logic we can specify around how the model is registered. In our case, by indicating allow_non_validated_model: false, if a model does not meet a pre-specified performance threshold, we will not register the model into the MLFlow Model Registry.

recipe: "regression/v1"
target_col: "SalePrice"
primary_metric: "root_mean_squared_error"
steps:
ingest: {{INGEST_CONFIG}}
split:
split_ratios: [0.75, 0.125, 0.125]
transform:
using: "custom"
transformer_method: transformer_fn
train:
estimator_method: estimator_fn
evaluate:
validation_criteria:
- metric: root_mean_squared_error
threshold: 10
register:
# Indicates whether or not a model that fails to meet performance thresholds should still
# be registered to the MLflow Model Registry
allow_non_validated_model: false

It is the profiles and the reciple.yaml files that give MLFlow Recipes it’s flexibility. Instead of modifying steps when we want to change parameters, we change instead the configuration files. By leaving the code unchanged, we can reassure ourselves that the coding and modelling logic will remain the same between experiment runs.

Step 4: Test run a step using a notebook in the notebooks folder

Finally, to test our refactoring, we can run a step using one of the notebooks inside the notebooks folder.

from mlflow.recipes import Recipe
r = Recipe(profile="local")
r.run("ingest")

From experience, it’s easier to refactor by focusing on one step at a time. Modify a stepfile and it’s corresponding section inside the configuration file, make sure the step works by running it from a notebook, then only move on.

Step 5: Make a change

To make any updates (for example, to add hyperparameter tuning or a different evaluation metric), modify the relevant configuration yaml or python file inside steps

Hopefully, this makes migrating to MLFlow Recipes more concrete, and you can benefit from the standardised layouts and outputs that the framework has to offer.

--

--

littlereddotdata
littlereddotdata

Written by littlereddotdata

I work with data in the little red dot

Responses (2)