I built an awesome ML model with Spark, Delta and MLFlow. How do I get the right people to use it?

littlereddotdata
3 min readFeb 1, 2023

--

When we want to train machine learning models on dataset that are too large to fit on a single machine, MLFlow, Spark and Delta are a great toolset. Delta ensures we can have ACID transactions on data stored on our data lake, Spark lets us run distributed algorithms, and MLFlow helps us to track our experiments and select our best-performing model.

But not all users or downstream applications operate within this ecosystem. A high-availability web-service might need to read data from a low-latency NoSQL store. A business user may want to create reports from PowerBI. Even for one out of the many possible downstream use cases, there are multiple connectors available. Here’s a short decision tree to clarify our options!

Making model predictions in Delta available to downstream consumers

The first question we can ask is: are our consumers humans or machines? The difference is important because these two entities process information very differently! A human might need a dashboard with contextual information and visual coherence; a machine would need clear APIs for making requests and receiving responses.

In the former case, an example of a human consumer might be a marketing analyst. This analyst would want to plan a promotion campaign that targets customers who are likely to reorder high-value products. Another example of a human user would be an engineer in a power plant who wants to see which machine is predicted to be likely to need maintenance. For both these personas, our model’s predictions could be stored in Delta, and then presented in a DBSQL dashboard or connected to a visualization tool like PowerBI. With a well-designed dashboard, downstream human analysts are empowered to use our model to make sound decisions.

Our considerations will change if our downstream consumers are machine systems. In this scenario, it will be our system’s performance requirements that will guide our decisions. For example, we may be using our model to provide product recommendations via a mobile shopping app. Low latency is a must in this case — our app needs to surface recommendations swiftly to engage customers. Additionally, if there are peak shopping periods, the app needs to be able to serve a large number of recommendations at a time (throughput needs to be high).

Therefore, we need a backend database built for these kinds of workloads. Ideally, this would be a transactional database that can handle row-level queries at a high speed and / or throughput. Although Delta is great at analytical, column-based queries, it may not be suitable here. Instead, a transactional database like Cosmos DB or Dynamo DB might be used to power the product recommender in our example.

Another question we might consider is whether we need Spark to post-process our model’s predictions. This is usually a good idea if we have large amounts of data that would benefit from being processed in a distributed manner. If this is the case, then we can make use of the Delta Lake APIs (available in Python, Scala and Java) to read, write and otherwise handle our data. If we don’t need to create a Spark Session (maybe our data can fit into a single machine’s RAM) then we can bypass Spark and consider a range of other native connectors.

One native connector available is the Delta Rust API, which allows Rust “low-level access to Delta tables”. Developed by engineers at Scribd, the library was developed specifically because engineers did not want the overhead of creating a Spark Session just to access small amounts of data from Delta to pass to their applications. The Rust API also has a Python wrapper, which means developers who prefer to use Python can also access Delta tables without Apache Spark.

In a similar way, Java and Scala developers can make use of the Delta Standalone Connectors to read and write from Delta tables via a JDBC connection. Keep in mind, however, that this method will be slower and less scalable than reading from one of the transactional databases that we mentioned earlier.

The Delta Lake connector ecosystem is diverse. There are options for almost every popular programming language, as well as different deployment scenarios and different audiences. While this diversity is great, it can also be hard to navigate. Hopefully, this flowchart gives you a starting point for evaluating your options!

--

--

littlereddotdata
littlereddotdata

Written by littlereddotdata

I work with data in the little red dot

No responses yet