A simple playbook for evaluating LLMs for your language-specific use case

10 min readMay 30, 2023

code for this post can be found here

Why non-English Natural Language Processing is hard

My favourite part about living in South-East Asia is its cultural diversity. Each country, from Malaysia, Singapore, Thailand, Indonesia and Vietnam, comes with its own food, ways-of-living and culture. This variety is great. Sometimes though, language barriers can be tricky. Imagine asking for directions when travelling in a country where you do not speak the language. In fact, issues with language applies not only when humans go travelling, but also when we build natural language applications meant for non-English speakers!

Non-english languages pose several challenges for machine learning models, even sophisticated ones like the Large Language Models (ChatGPT, Alpaca and Bloom) that are getting a lot of attention.

Firstly, the number of languages that exist is simply enormous; it would be impossible to have a model or dataset that covers everything. For example, the creators of the Roots corpus put a lot of thought into the languages represented in their data (there are a total of 46 and you can search for each here). However, it’s quite likely that someone will not find their language of interest appearing in the data (there is no Japanese or Korean, for example). The same goes for the SEAHORSE dataset, which is a multilingual dataset for evaluating the quality of machine-generated summaries. While the authors take care to ensure a diverse representation of languages in their data, again, only 6 languages are represented. It’s more likely than not that one’s specific language of interest is not present in the data.

*note: for simplicity, we are not considering claims about how well a model is able to generalise to unseen languages. We just consider whether or not a language has been explicitly accounted for.

Secondly, a non-English NLP practitioner doesn’t just have to consider his or her language of interest. Amongst the many LLM models available now, there are some that function better for certain applications compared to others. For instance, the first iteration of the 176B parameter model Bloom, was an autoregressive LLM, which means it was trained to predict the next word in a sentence.

"The cats ate a can of" <fill-in-the-blank> # complete this sentence

Later, BigScience released the BLOOMZ & mT0 family of models, which have been fine-tuned to follow human instructions and to generalise to unseen languages.

example instruction: "Describe the sound an owl makes."
output: "The sound that an owl makes is commonly known as 'hoot.' However, it's important to note that different owl species produce varying vocalizations, and not all of them hoot. The classic hooting sound associated with owls is typically produced by the larger species.

While the first model is suitable for text generation, it would not be suitable for use cases that rely on a user’s instructions, for example responding to a prompt such as “based on Wikipedia, tell me what is Charles Dicken’s most famous novel”. To compound the problem, even if an LLM has not been trained specifically to handle multiple languages, it would probably have seen non-English data when it was trained. Therefore, it’s not always clear upfront what cross-lingual capabilities a model will have.

Implications for practitioners

If your user’s native language is not English, and you would like to build an NLP-powered application for them, it can be hard to predict upfront whether a particular model will perform satisfactorily. Will it respond in English when asked a question in Indonesian? Will it respond in the correct language, but give an answer that is not relevant? And if one model does not give us the quality we are looking for, should we invest effort into improving the model, or try a different one? In these cases, development is not as simple as downloading a model from HuggingFace, integrating it with a suitable langchain Agent and rolling it out to users. Before starting these aforementioned steps, we first need a protocol for quickly evaluating whether an model (usually an LLM nowadays) will provide us with the quality we need.

This is what the following section will cover.

Simple playbook for evaluating LLMs for your language-specific use case

Clearly define your application domain and use case — even the best models on a leaderboard may not perform well in a niche context

When a model is first introduced in a research paper, it is usually evaluated against recognised benchmarks. The benefit of these benchmarks, such as the Massive Multitask Language Understanding (MMLU) benchmark, are that they aim for both depth and breadth. The drawback is that they may underperform in the application domain we are interested in, even while they report high overall average scores.

For example, even while LLMs now show unprecedented abilities to answer questions on diverse topics from math to code to literature, the creators of the MMLU benchmark report that even state-of-the-art models can show near-random accuracy on topics such as law. If we were building a legal tech-focused application, we would rightly be concerned. Therefore, in the interest of being properly critical, we first need to define what domain we are working in (whether it is manufacturing, luxury retail or software engineering). Then, we will need consider whether certain specifics of our domain (or example if it has a high amount of jargon) might influence our model’s performance.

Indeed, the details become even more important when we drill down from the domain-level into the use case-level. A common general benchmark in multi-lingual NLP is the Wino-X schema, which tests for multi-lingual commonsense inference. This means that the task tests whether a model is able to understand that a statement such as “a woman is typing” naturally follows on from a statement such as “a woman is sitting in front of a computer”. While this schema is a good starting point for evaluating machine intelligence, when we consider the specifics of our industry use case, we may notice important caveats.

For instance, we may have business users in our organisation who do not understand SQL queries. We would like to build an application where these users can key in their their questions using their native language such as Mandarin. Then this application will translate these questions into SQL and return the results to the user. At this point, we have at least three aspects of language to consider:

the users native language
the SQL query language (which is most similar with English)
SQL table metadata (this may be a combination of metadata expressed in a machine-readable format such as JSON, and native language comments included in the DDL descriptions)

In this case, we can’t definitively say that a model that is good at commonsense reasoning will also perform well at parsing all these different data. Therefore, as part of our evaluation process, we may need to separately test whether including or excluding this data helps or hurts our model.

The key takeaway: start by being clear about domain, use case, and the use case-specific data a model needs to understand. Sometimes, these three things can be quite different from the data that was used in the research literature to benchmark the model.

2. Define how we want to evaluate our model

Defining an evaluation protocol can range from straightforward to complex depending on our goals. Here we will start with the simple case and work towards more complex ones.

Use established methods

Sometimes, the metrics that are already widely-used in the research literature are enough. If we want to improve the translation capability of our application, then recognised machine translation metrics such as SacreBLEU is enough.

Then, after establishing this metric, we can focus on varying certain dimensions such as the source and target language, the prompts used, and model size. Indeed, Bawden et. al investigate the effect of these variables on model performance when they report the multi-lingual capabilities of BLOOM. The paper provides a good template for us to apply to our own data.

Combine established and customised methods

In this paper, Liu et. al. report the ability of chatGPT to generate SQL statements from a natural language question expressed either in English or Mandarin Chinese.

Some of the metrics used, such as a SQL statements validity, ability to execute, and performance on the test set, would be applicable to most text2SQL evaluations. However, SQL statements can vary substantially in complexity. Also, even if a SQL statement can execute correctly, it may not be optimised.

Hence, in this case, our evaluation can combine established and customised protocols. As a starting point, we can create an automated test suite that tests whether our LLM can generated executable queries when given a question expressed in a certain language. Then, a human can also use a scale to rate the query’s complexity and the extent to which a query is optimised for a specific SQL dialect like Postgres

Use mainly customised evaluation methods

In other cases, model evaluation remains subjective. This can be the case for summarization or conversational LLMs. As outlined by lmsys.org, evaluating a chatbot’s response involves multiple aspects such as “accuracy, helpfulness, relevance and detail”. Designing a metric to capture such subjective and multifaceted dimensions of quality is not easy.

Given the difficulty of designing concrete metrics, old-fashioned, customised human evaluation may be more feasible.

In the simplest case, we can create a binary score of yes/no depending on whether the model’s response is on-par or not with a human-generated response. Then, we calculate the percentage of yes’s over the entire test dataset.

Alternatively, we can define several dimensions of quality (such as comprehensiveness, quality of references) and ask a human to rate the model’s responses on a Likert scale.

3. Real-world testing

Once we have defined our use case and our evaluation protocol, we are ready to do some real-world testing. Here, we will include a simple case study and code example to illustrate what such a test might look like.

3.1 Gather a representative dataset

In this case, we define that we want to test a range of models on general instruction-following tasks expressed in Mandarin Chinese. We also want the model to response in Mandarin.

We will use the Chinese-LLaMA-Alpaca finetuning dataset since from an inspection, there is a decent variety of different instructions and questions, from coding questions, math questions, and general-knowledge based questions.

*note: If a language is low-resource, one option is to use machine translation to translate an existing dataset, and then have the translation verified before using it

3.2 Identify potential models on HuggingFace

For a simple comparison, we choose two models from the BLOOMZ & mT0 Model Family available from HuggingFace. One is the bloomz-7b1-mt model and one is the mt0-xxl-mt model, both of which have been trained on non-English prompts and multilingual datasets.

3.3 Download HuggingFace model and insert model into alangchain LLMChain()

def build_llm_chain(model_name):
  torch.cuda.empty_cache()
 
  instruct_pipeline = pipeline(model=model_name, model_kwargs= {"device_map": "auto", "load_in_8bit": True})
  
  template = """Below is an instruction and an optional input. Based on the instruction and optional input, generate a suitable response
  Instruction: {instruction}
  Input: {input}
  Response:
  """

  zh_template = """以下是说明和可选输入。根据指令和可选输入，生成合适的响应
  指令：{instruction}
  输入： {input}
  响应:
  """

  prompt = PromptTemplate(input_variables=['instruction', 'input'], template=zh_template)
 
  hf_pipe = HuggingFacePipeline(pipeline=instruct_pipeline)
  # Set verbose=True to see the full prompt:
  return LLMChain(llm=hf_pipe, prompt=prompt, verbose=True)

llm_chain = build_llm_chain("bigscience/mt0-xxl-mt")

4. Run inference

If our evaluation dataset is large, we can use a distributed computing platform such as Spark, and the Pandas UDFs library to parallelize our inference across multiple machines.

def follow_instructions_udf(inputs: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
  os.environ["TRANSFORMERS_CACHE"] = hugging_face_cache
  for i in inputs:
    i["generated_answer"] = llm_chain({"instruction":i["instruction"], "input": i["input"]})
    yield inputs

dataset.select(F.col("instruction"), F.col("input")).mapInPandas(follow_instructions_udf, schema="instruction string, input string, generated_answer string").show()

However, it’s worth noting that in a fine-grained human evaluation setting like this, about ~100 examples would be enough to identify certain patterns in the model’s responses and discern scenarios where the model may not give a quality response.

5. Conduct an error analysis

When doing error analysis, after a first pass through the model’s predictions, one useful technique is to categorize the predictions into archetypes that we observe. For example, the model’s answer may be too short compared to the original “gold” response, or the model might tend to repeat itself endlessly (this tends to happen when text generation models are used in an instruction-following setting).

For illustration, for the dataset we used, we see that the model output (middle column) is much shorter than the original output.

An example comparison between a human generated answer (rightmost column) and an answer generated by a multilingual LLM (middle). The machine-generated answer in this case is relevant, but not nearly as comprehensive as the original provided answer

When we have tagged the responses according to pre-defined categories, it becomes easier to visualise these responses on a bar graph, and also to compare models.

Interestingly, for this sample, the 7B model has a higher proportion of acceptable answers compared to the 13B, which is counter intuitive since a model with more parameters should be expected to perform better.

At this point, we can choose to continue to evaluate more data so we know whether or not our results are merely an artefact of small sample size. If not, we can also move on to testing other models, or considering what we might do to improve the model’s responses in relation to our use case. For instance, for an internal Question-Answering bot, we can store documents in a vector database for our model to reference, which might improve the quality of the answers generated.

Conclusion

When it comes to evaluating LLMs, despite there being benchmarks such as the HuggingFace Open LLM leaderboard and OpenAI Evals, ultimately, a model needs to work for your specific language, use case and domain. A general leaderboard will not offer this same specificity, and this is where a step-by-step framework such as the one we have just outlined can help. Hopefully, this helps with choosing and tuning the right LLM for your own scenario.