Putting ethical ML systems into production with Databricks

8 min readMar 31, 2023

INTRODUCTION

When Cass Sunstein, Daniel Kahneman and Olivier Sibony released their book Noise: A Flaw in Human Judgment, they touched on an uncomfortable truth: humans are remarkably inconsistent decision makers. In a summary published in the Harvard Business Review, the authors share how a wide range of professionals, from “appraisers in credit-rating agencies, physicians in emergency rooms, underwriters of loan and insurance” had their judgements “strongly influenced by irrelevant factors, such as their current mood, the time since their last meal, and the weather”.

The book offers a remedy — rely more on algorithms, which are noise-free. Yet, as those familiar with computational systems know, algorithms, especially those that rely on probabilistic machine learning, are neither completely consistent, nor free from bias. In fact, when deployed, algorithmic systems often end up encoding and exacerbating existing biases and inequalities in society.

There is lots of good work on how to measure a machine learning model for bias, see here for an example.

However, what we want to do in this post is to highlight another dimension to mitigating bias — how to design a production-ready system that can reliably operationalise these bias-mitigating techniques. As mentioned by FICO’s AI research arm:

“Enforcing fairness for production-ready ML systems in Fintech requires specific engineering commitments at different stages of ML system life cycle”

We will focus on credit scoring as an example use case, though similar principles could be applied to other use cases as well.

Production-ready ethical machine learning systems with Delta, Databricks Feature Store, and MLFlow.

A machine learning system in production that seeks to track and mitigate bias needs to have several features incorporated into it, which we outlined below. Firstly, the system needs to be reproducible in order to be revisited and audited in the future. This means that the parameters, code, features and data sources needed to train and reproduce the model are made available.

Secondly, fairness metrics should be tracked and be used to inform whether or not a model is ready to be deployed. Additionally, any protected variables that are included, for example gender and race, are monitored appropriately. This ensures that the model developers are fully aware of the effect these variables have on their models, and can decide on appropriate trade-offs between model performance, business profits, and fairness.

Lastly, models are deployed, and even models that are archived, should be made searchable for auditing purposes.

The diagram below shows what a such a system might look like on Databricks:

Overview of use case

For our example, we will work with a classification model built with the Credit Card dataset from the UCI Machine Learning Repository. This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

We will use this dataset to train a binary classifier that predicts whether or not a customer will default on his or her credit card payment in the next month. In the real world, this might be the sort of model that would typically influence a loan-decision making process or a credit score.

The full code to this example can be found from this Github repository.

Feature Store

At this point, we have already ingested our data into the bronze layer and performed some data cleaning. Now, we will derive two features from our original dataset and store these features inside a Feature Store table.

For the PAY_* columns, (-2=no credit to pay, -1=pay duly, 0=meeting the minimum payment, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above), we create a feature called num_paym_unmet that sums up the total number of times, over 6 months, that a customer had a delayed payment

We also create a second feature called num_paym_met that sums up the total number of times, over 6 months, that a customer met their payment obligations

def num_paym_unmet_fn(df):
  """
  sums up the total number of times, over 6 months, that a customer had a delayed payment
  """
  def count_unmet(col):
    return F.when(F.col(col) > 0, 1).otherwise(0)
  feature_df = df[["USER_ID", "PAY_1", "PAY_2", "PAY_3", "PAY_4", "PAY_5", "PAY_6"]]
  feature_df = feature_df.withColumn("NUM_PAYM_UNMET", reduce(add, [count_unmet(x) for x in silver.columns if x in pay_cols])) \
.select("USER_ID", "NUM_PAYM_UNMET")
  return feature_df

def num_paym_met_fn(df):
  """
  sums up the total number of times, over 6 months, that a customer met their payment obligations
  """
  def count_met(col):
    return F.when(F.col(col) <= 0, 1).otherwise(0)
  feature_df = df[["USER_ID", "PAY_1", "PAY_2", "PAY_3", "PAY_4", "PAY_5", "PAY_6"]]
  feature_df = feature_df.withColumn("NUM_PAYM_MET", reduce(add, [count_met(x) for x in silver.columns if x in pay_cols])) \
.select("USER_ID", "NUM_PAYM_MET")
  return feature_df

num_paym_unmet_feature = num_paym_unmet_fn(silver)
num_paym_met_feature = num_paym_met_fn(silver)

Later, by logging and serving our model as a Feature Store model, the same featurization logic will be applied during training and serving, which minimizes our risk of training-serving skew.

Fairness metrics tracking

There are multiple open-source packages available to measure fairness metrics, for example Microsoft’s Fairlearn and Google’s Model Cards In this example, we will use the open-source Veritas package released by the Monetary Authority of Singapore, since it is an example showing how to include fairness considerations into a production system based on recommendations and tools produced by an actual financial regulator.

In order to use this package, we first define what we call protected variables, which are input features that should not unduly bias our model. Here, we set a person’s SEX and marital status to be protected variables. Within those protected variables, we also choose which are the privileged groups. In the case of the SEX variable, this is the “male” segment, and in the case of the MARRIAGE variable, this is the “married” segment.

p_var = ['SEX', 'MARRIAGE']
p_grp = {'SEX': [1], 'MARRIAGE':[1]}

After initializing an mlflow experiment with mlflow.start_run(), We pass the training data, protected variables and groups, and our model into the Model Container object. Then, we specify that we want to use this object for a Credit Scoring evaluation, and pass the metrics that we want to focus on into this object.

cre_sco_obj= CreditScoring(model_params = [container], fair_threshold = 0.43, fair_concern = “eligible”, fair_priority = “benefit”, fair_impact = “significant”, perf_metric_name = “balanced_acc”, fair_metric_name = “equal_opportunity”)```

Calling the evaluate method on this CreditScoring object will generate fairness-oriented reports and evaluations that we can save as mlflow.artifacts for analyzing later.

cre_sco_obj.evaluate()
cre_sco_obj.tradeoff(output=False)
mlflow.log_artifact("/dbfs/FileStore/fairness_artifacts/cre_sco_obj.pkl")
mlflow.log_dict(cre_sco_obj.fair_conclusion, "fair_conclusion.json")
mlflow.log_dict(cre_sco_obj.get_fair_metrics_results(), "fair_metrics.json")
mlflow.log_dict(cre_sco_obj.get_perf_metrics_results(), "perf_metrics.json")
mlflow.log_dict(get_tradeoff_metrics(cre_sco_obj), "tradeoff.json")

So, at this point, we’ve managed to cover the reproducibility aspect of a production-ready credit scoring system. Having a reproducible system means that our model training process is transparent and auditable to regulators.

But, the productionisation process doesn’t end at the model training stage. At this point, a Data Science or a Machine Learning engineer will typically transition a model into a staging environment, execute checks and validation in staging, before transitioning the model into production. For our application, we want to focus on checks related to the fairness of our model. Checking our model before it is deployed allows us to avoid inadvertently applying a model that disadvantages certain groups.

To automate this process, we make use of Databricks Model Registry Webhooks

Job and HTTP Webhook setup

Job webhooks allow us to trigger a notebook job when a model transition request is made from the MLFlow Model Registry

In our case, we setup an HTTP webhook to give us a Slack notification when a transition request is made. We also set up a Job Webhook that triggers a notebook job to generate a fairness report

Transition (UI-based)

So now we have trained a model, registered it and setup our webhooks, we will proceed to make a request to transition the model to PRODUCTION.

This should then trigger our fairness validation notebook to run as an automated job.

Now I’m going to put myself in the shoes of a model approver, who has to validate the testing notebook’s output. For example, our notebook can output various performance metrics for evaluation.

Fairness validation notebook job

So let’s look a little deeper into what fairness evaluations we can conduct for our model

Parity

We can analyse our model based on certain fairness metrics such as demographic parity, equal opportunity and equal odds. Here, we look at the differences in these metrics for the under-priveledged group, relative to the privileged group.

Indeed, although we are now manually evaluating this chart, a portion of this validation process can be automated. For example, we set tests to make sure that our parity measures do not exceed a certain threshold, and we save the results of this test as `mlflow` model tags that we can refer to inside the MLFlow model registry UI

for p in pvars:
  if all(df[df['protected_var']==p]['parity'] > 0.1) == False:
    client.set_model_version_tag(name=dict["model_name"], version=dict["version"], key=f"parity_check_{p}", value=1)
    print("Parity checks passed")
  else:
    client.set_model_version_tag(name=dict["model_name"], version=dict["version"], key=f"parity_check_{p}", value=0)

Performance-fairness tradeoff

What tradeoffs to performance might we expect if we calibrate our model according to fairness constraints? These are reports that we can surface to our business SMEs

Finally, we save metadata around our model’s deployment history to a separate table so we can track this information should we need to audit the model.

Conclusion

There are many methods for measuring and mitigating bias in machine learning systems, something we have also addressed in another post. However, additional challenges still remain. In particular, these involve making sure a system’s fairness metrics are quantified and closely tracked during training and pre-deployment, and where the entire model lifecycle is reproducible and auditable. Only when these challenges are addressed can we move towards machine learning systems that are fair towards the groups to which they are applied.