Metrics that matter in production for LLMs: a simple framework

8 min readSep 17, 2024

The difference between a prototype and an actual application

It only takes a developer an afternoon of pecking away at a laptop to create an LLM chatbot that answers questions over a set of documents. But it can take months for this prototype to be ready for end-users with the safety filters, latency guarantees and privacy guardrails that users need. Moving from “prototype to production” (or implementing LLMOps) can be painfully slow. And the process can be slow not because the process itself is time-consuming, but because it takes teams a long time to understand and plan for the all the different components involved. This post outlines a framework that can help team members align and prioritize on the LLMOps metrics and processes that really matter.

How different is LLMOps from MLOps?

LLM systems are usually compound systems, meaning that they include different moving parts that need to be managed together. While traditional ML systems might involve a pre-processing and post-processing step that can be included into a single model call, LLM systems might involve multiple chains with nested logic, or an agent framework that uses different tools depending on a user’s request
LLM outputs are open-ended. Traditional machine learning applications might involve tagging a post with appropriate topics (multi-label classification) or forecasting future prices of a product (regression). In contrast, LLM models, by dealing mainly in natural language inputs and outputs, might give different answers to the same question that nonetheless have similar meanings. This means that ground truths are usually not the only “right” answer.
LLM evaluations are usually subjective. While summarisation, entity extraction and translation use cases can make use of traditional numeric metrics such as ROUGE scores, precision, recall and BLEU scores, other, more modern LLM applications such as generating SQL from natural language questions do not lend themselves to numeric scores. For example, there can be more than one way of writing a SQL statement that gives the same dataframe output. In these cases, we may need to handle this subjectivity by getting a human evaluator in our development loop, and / or using an LLM judge.

A framework categorizing metrics that matter for LLMOps

Model quality metrics

Not all metrics are created equally. When developing an LLM application, Our first priority is optimising for quality. Without achieving a standard of quality that our users can accept, we cannot begin to optimise for other metrics such as latency and safety.

How we define quality depends on the type of application we are building. For RAG applications, we would be interested in metrics such as the precision and recall of our document retrieval step and how grounded our LLM’s responses are in the context that is retrieved. For summarization use cases, we may define a custom quality scale from 1–5 and use an LLM judge to rate the quality of our summarizations. Since an LLM’s outputs are usually open-ended, it becomes imperative that we have an evaluation set with our expected inputs, outputs and (if necessary) relevant documents to benchmark against. Also, although LLM judges are useful for scaling our evaluations to many examples, human feedback is still crucial for judging quality.

Operational metrics (eg. latency and throughput)

Once we have a quality application, then we can begin to optimize for operational metrics such as latency and throughput. Optimizing for latency may involve reducing the number of steps our model needs to take to reach a response, or cutting down on specific steps that bottleneck the pipeline (such as a lengthy database call). During prototyping, we typically do not focus on latency bottlenecks related to request load or network latency, as this will be something we address as part of the production process.

Safety metrics

Safety though important, also comes after quality. If we are referring to safety in terms of a model’s responses, then we may implement a topic filter or content moderation filter either as a pre-processing or post-processing step in our pipeline. A filter acting as a pre-processor might filter out irrelevant or toxic questions from users; a filter included as part of a post-processing step might also filter out undesire-able responses, but from the model instead of the user.

In these cases, we may assign a toxicity score or another metrics (perhaps professionalism) to our inputs and outputs, and flag instances where this score exceeds a certain threshold.

Safety may not only refer to questions and generated responses. We will also want to ensure that our data is safe — this means that a model may not be allowed to use Personally Identifiable Information (PII) when generating a response, or that only certain users have access to a model. For this aspect of safety, we would rely on data access controls through a governance tool. We may also want to proactively quarantine sensitive data when it appears upstream in our pipeline so downstream systems are not affected.

Usage and cost metrics

While the previously mentioned metrics are usually the responsibility of a team’s data scientists, usage and cost metrics are unique because they are typically not managed by a platform engineer or governance team instead. Here, what matters are that costs are attributable to a certain business unit, and that usage does not exceed certain limits set by an organization. It’s an unpleasant surprise when a bill comes from higher than expected usage costs. To prevent this, platform administrators and governance teams may implement rate limits or cost alerting by business team to manage resources prudently.

How to iterate on metrics

Available tools

With an open-source machine learning framework such as MLflow, evaluation can be managed with the MLflow Evaluate library. With MLflow Evaluate, we can define custom metrics such as a bespoke toxicity score, set-up custom LLM judges, and also present our results side-by-side to a human evaluator from the MLflow UI. If we are using Databricks’ Mosaic AI Agent Framework or another front-end tool such as Gradio, we can also create a UI for domain experts to provide feedback on the quality of our system’s responses.

For identifying latency bottlenecks in our pipeline and for troubleshooting areas that need quality improvement, MLflow also offers MLflow Tracing. MLflow Tracing presents a visual breakdown of Langchain and LlamaIndex pipelines with a simple mlflow.langchain.autolog() or mlflow.llamaindex.autolog() one liner. For more bespoke applications that may use the more flexible mlflow.pyfunc flavor, we can also define custom spans and traces to be logged.

Flexibility is key

During the prototyping or POC stages of a project, it is the Data Scientist who will usually be iterating on metrics related to model quality and LLM application latency. At this stage, flexibility is key. Different models give different responses to similar questions; a data scientist will need to select the model that provides the best quality answers, for a reasonable latency and cost. To achieve this, all available tooling should be used, from traces, to scalable LLM judges, to human feedback.

Underpinning all of this work is the governance layer. The governance layer ensures data is protected, and that API usage and costs are kept under control.

Sample development setup using Databricks Mosaic AI Agent Framework and Databricks Mosaic AI Evaluation

Extending this framework to LLMOps

What LLMOps means

When we talk about “LLMOps”, it is usually in the context of how we want to “move from POC to production”. More concretely, what we usually mean is that we want a define set of steps to migrate an application from our “development” environment to our “production” environment.

This migration matters because development environments do not have the service level-agreements, robust testing and governance guardrails in place that keep an application up and running without 404 errors, security vulnerabilities and latencies that are unacceptable to an end user. A development environment is flexible and useful for iterating quickly to reach our required quality bar. But once we reach this bar, we then need to plan for how to make our application robust enough that it means “production standards”.

In practice, turning a prototype into something of “production standard” might involve:

translating one-off notebook evaluation code into repeatable workflows that can run on schedule
implementing a data pipeline to log incoming requests and responses, with their corresponding traces
implementing a monitoring feature on logged requests and responses, and defining appropriate thresholds for important metrics such as latency
having a process for upgrading and / or replacing older models with newer, more performance ones (this could involve A/B testing two models with a traffic split between models).

If possible, we can also translate human feedback into automated checks as much as possible. For example, if human feedback was used to evaluate whether a model gave responses that was off-topic, then we may automate this human evaluation step with a traditional topic model that is faster to run than creating another LLM step.

This is a sample process of what might happen in production, using the Databricks Mosaic AI tech stack as an example. Our initial evaluations developed by the data scientist is still used, but this time defined as a repeatable workflow that evaluates our models responses to inputs on a set schedule.

Sample production setup using the Databricks platform

The role of the ML Engineer in LLMOps

The machine learning engineer is usually tasked with defining the assets that we outlined in the previous section (automated workflows, pipelines, logging tables). To make the process repeatable, these assets might be create with Infrastructure-as-code tools such as Terraform. They may have also have their configurations defined within YAML files and saved to version control so change management becomes easy via pull and merge requests.

Creating a flywheel

With our system handling user requests in a production environment, we are now in a position to create a “flywheel” that can lead to continuous system improvements. As user requests come in, we log the requests to a table, run evaluations on the model response quality and output our evaluation metrics to a dashboard. Over time, we gather more and more instances of edge cases, unexpected questions and also feedback on the overall user experience of our application.

User inputs can form the basis of an ever-expanding evaluation set to evaluate our system against. Unexpected inputs and edge cases can allow us to add tests to our pre-deployment check to increase our system’s robustness. User feedback allows us to redesign and rethink our interfaces so they are more user friendly.

Conclusion

LLMOps is a tough topic mainly because it contains many moving parts. LLMOps covers many types of metrics, from quality metrics, operational metrics, to safety and also governance metrics. Additionally, orchestrating LLMOps well involves not only technology and tools, but people and processes as well. Having a framework to categorize the different metrics and stages of this process is a first step to making sense of this complexity.