How MLflow Traces allows me to improve my NBA stats SQL Agent

4 min readJul 24, 2024

💻 code for this article is here

It’s quite common to use unstructured data to augment the contextual knowledge of a Large Language Model. But it’s less common to do the same with structured data. I got curious about this, and so built a SQL Agent to answer questions about NBA games based on NBA game statistics.

Sports data analysis is actually a really rich field! Some sites I got inspiration from include https://sltsportonomix.substack.com/p/beyond-the-court-assessing-player which did a few really in-depth posts on how data can “decode the dynamics of comeback victories” and provide new ways of understanding a player’s team contribution.

My use of the data is much simpler. I’m sure if I were a basketball enthusiast I would be able to ask many more interesting questions about the data. But we’ll leave that for another day.

The Agent Itself

I used Langchain to build the Agent and logged the Agent to MLflow for further analysis. The code for that is here, and there is more information on using MLflow with Langchain. But the focus here will be on the subsequent step of analysing the Agent’s performance.

Evaluation Dataset

We can’t analyse performance without an evaluation dataset, so here is one. Evaluation dataset follows the schema expected by Mosaic AI Agent Evaluation. It’s a schema that’s really useful for structuring how we think about LLM evaluation. Specifically, that we not only need to have questions and answers, but also additional context to the answers. For unstructured data, this could be documents that we retrieve from a vector database. For structured data (our case) this could be the actual SQL query that we would have needed to answer a question about the data. For this particular use case, we also add an extra field — difficulty - to the schema so we can separate evaluation of easier and harder questions

eval_data = [
    {
        "request": "How many away games did the Boston Celtics win in 2023",
        "difficulty": "easy",
        "expected_response": "Boston Celtics won 20 away games in 2023",
        "expected_retrieved_context": [
            {
                "doc_uri": """SELECT count(season_id) as total_wins,year(game_date) as year 
                                FROM nba_games 
                                WHERE wl_away = "W" and team_name_away = "Boston Celtics" and year(game_date) = 2023 
                                GROUP BY year(game_date)"""
            },
        ],
    },]

Starting the MLflow run for evaluation:

with mlflow.start_run(run_id=mlflow_run_id):
    # Evaluate
    eval_results = mlflow.evaluate(
        data=eval_df,
        model=f"runs:/{mlflow_run_id}/chain",  
        model_type="databricks-agent",
    )

Analyzing Traces

MLflow Traces logs all the Langchain calls that the Langchain SQL Agent Executor makes to answer our question. We can parse these Traces, and extract the outputs of the ReActSingleOutputParser calls to see the sequence of steps our Agent has gone through to answer our question. To help with parsing, we define an AgentAnalyzer() class that contains helper methods for our analysis. The final outputs can be saved into a Pandas Dataframe, from which we can see which tools the Agent selected as well as the Agent’s reasoning for selecting that tool.

Is our Agent able to select tools appropriately?

🔍 In this example, we can see that the SQL Agent seems to be using the tools available to it in a logical sequence. It checks for what tables are available with sql_db_list_tables , gets the table schema with sql_db_schema , then crafts a SQL query and checks and runs it.

How many times does our Agent need to call a tool before getting an answer?

Sometimes, when an Agent is unable to construct a valid SQL query or find the right table to call, it can end up in a loop where it continually queries the data until a time out happens. That doesn’t seem to be the case here though. From the correctness feedback given by the LLM-as-a-judge from MLflow Evaluate, we can see the steps taken by the LLM is pretty even across requests, and there aren’t really calls that take many steps.

Does the Agent’s final output make sense?

Although in this case we only have 3 questions, if we were to keep changing chain parameters such as the LLM model and the prompt, constantly evaluating the LLM output would take a lot of time. So instead of manual evaluation, we can extract the response/llm_judged/correctness/rating and response/llm_judged/correctness/rationale evaluations from the LLM-as-a-judge evaluation to check the correctness of the Agent’s final output. Here, we see room for improvement. Although our previous analyses shows that the Agent is able to call the right tools to answer our question, the final output is still incorrect 2/3 times. This shows us we may need to provide more context to our model, either by using the prompt to elaborate on the columns the model should query or to provide the model with some query examples.