Ideas for thinking about LLM quality
“Your question is not related to any of the datasets in my database. I am happy to help with questions related to sales and product, however. Feel free to ask me about those!”
This was a recent response I got from an LLM. I was stunned for a moment. I had come to expect long strings of text streaming onto the screen a few seconds after I asked a question. But after a while, I realised that this guardrail was probably a good idea. Outputs should be controlled and scoped properly.
And indeed, as more teams move from POC to production, quality guarantees are a must-have.
The Vanishing Gradients podcast recently covered this topic with a series of episodes featuring well-known practitioners and academics such as Hamel Husain and Shreya Shankar (full links to all the resources at the end of this post). The episode with Shreya Shankar in particular was very informative. After listening to the podcast, I decided to take a closer looks at the two papers discussed.
https://vanishinggradients.fireside.fm/32
This post offers a summary and some takeaways for those interested in LLM quality!
- spade: Synthesizing Data Quality Assertions for Large Language Model Pipelines
- Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preference
Paper Summaries
SPADE starts with the premise that, during the prompt engineering process, developers take a prompt and then add additional instructions to it based on what they care about. For example, adding the instruction, “always address the user politely” may mean that professionalism is important. Therefore, there should be a test that measures the professionalism of every LLM response.
Using this insight, the paper proposes an algorithmic process for automatically creating quality assertions based on how prompts are elaborated on, corrected and changed (something the authors call “prompt deltas”).
The SPADE paper focuses on the technical underpinnings of automating quality checks for LLM applications. A later paper, Who Validates the Validators, takes an HCI perspective on the problem by discussing how users interact with an application to iteratively improve generated assertions.
This paper observes the process that study participants go through to iteratively select and refine tests using a tool the paper’s authors developed (rather than relying completely on an automated process like the SPADE paper). Through these observations, the paper then gives some insights into how developer tools should be design to ensure application quality.
Takeaways
Taking an empirical approach
What I found most valuable from these papers was their empirical approach to quality assertions. The authors didn’t just formulate an algorithm or build a tool based on what they thought would be important to users. Rather, they made observations of existing prompts and user behaviour, and then design their tools accordingly.
Categorizing prompt deltas
In fact, it was super helpful to see the SPADE paper categorizing the difference instructions that people include in their prompts. In short, amongst the prompts analysed, ~35% of the changes were “structural”, while ~65% of the changes were “content-based”. Structural changes refer to relatively minor changes that do not change a task definition (for example returning a response in JSON format, or requesting newlines between statements. Content-based changes refer to changes that alter the task definition, for example outlining the sequence of steps an LLM should take (do A, then do B) or giving instructions such as “maintain a professional tone”. Knowing overall how prompts are structured across applications and developers can eventually help with better prompt design.
The iterative process of prompt refinement
Another insight from the paper Who Validates the Validators paper is the observation that “users need criteria to grade outputs, but grading outputs helps users define criteria”.
The analogy I can think of is the writing — you need a plan to write an article, but the process of writing an article in itself helps you to refine your plan. Criteria then, start off as something like rough sketches that are progressively filled out and colored in.
Human judgement with automation works better than just automation alone
It would save developers a lot of time if the entire quality assertion process could be fully automated (just like how much developer time would be saved if software tests could write themselves). But unfortunately full automation still falls short of a hybrid human + automated approach. For instance, for a particular pipeline, the study authors observed that a hybrid approach produced an assertion set less than half the size of the fully automated one, and increased coverage from 49% to 73%.
Does this smell like data labelling?
The whole interactive process of allowing developers to iteratively optimise is quite reminiscent of UIs for data labelling. But instead of correctly outputs directly, we are correcting the functions and criteria that check outputs. It’s like if Great Expectations had a user interface that intelligently surfaced rules for you to optimise.
Conclusion
By themselves, these papers don’t give recommendations for what to do to improve an LLM system (for example, ways of improving the prompt, or how to judge the context provided by a retrieval step). But they are a great
Resources
spade: Synthesizing Data Quality Assertions for Large Language Model Pipelines
Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preference
https://www.youtube.com/watch?v=eGVDKegRdgM
https://www.sh-reya.com/blog/ai-engineering-flywheel/
https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/
https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-ii/