How to Evaluate LLM Applications: The Complete Guide

Suchismita Sahu
9 min readNov 3, 2024

--

ChatGPT, the leading code generator, has soared in popularity over the past year thanks to the seemingly omniscient GPT-4. Its ability to generate coherent and poetic responses to previously unseen contexts has accelerated the development of other foundational large language models (LLMs), such as Anthropic’s Claude, Google’s Bard, and Meta’s open-source LLaMA model. Consequently, this has enabled ML engineers to build retrieval-based LLM applications around proprietary data like never before. But these applications continue to suffer from hallucinations, struggle to keep up-to-date with the latest information, and don’t always respond relevantly to prompts.

In this article, I will outline how to evaluate LLM and retrieval pipelines, different workflows you can employ for evaluation, and the common pitfalls when building RAG applications that evaluation can solve.

Evaluation is (not) Eyeballing Outputs

Before we begin, does your current approach to evaluation look something like the code snippet below? You loop through a list of prompts, run your LLM application on each one of them, wait a minute or two for it to finish executing, manually inspect everything, and try to evaluate the quality of the output based on each input.

from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(chunk_size=1000)
documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=5)
def query(user_input):
return query_engine.query(user_input).response
prompts = [...]for prompt in prompts:
print(query(prompt))

Evaluation as a Multi-Step, Iterative Process

Evaluation is an involved process but has huge downstream benefits as you look to iterate on your LLM application. Building an LLM system without evaluations is akin to building a distributed backend system without any automated testing — although it might work at first, you’ll end up wasting more time fixing breaking changes than building the actual thing.

To evaluate LLMs, you need several components — an evaluation dataset (that improves over time), choose and implement up to a handful of evaluation metrics on criteria relevant to your use case, and some evaluation infrastructure in place to continuously run real-time evaluations throughout the lifetime of your LLM application.

Step One — Creating an Evaluation Dataset

The first step to any successful evaluation workflow for LLM applications is to create an evaluation dataset, or at least have a vague idea of the type of inputs your application is going to get. It might sound fancy and a lot of work, but the truth is you’re probably already doing it as you’re eyeballing outputs.

Let’s consider the eyeballing example above. Correct me if I’m wrong, but what you’re really trying to do is to judge an output based on what you’re expecting. You probably already know something about the knowledge base you’re working with and are likely aware of what retrieval results you expect to see should you also choose to print out the retrieved text chunks in your retrieval pipeline. The initial evals dataset doesn’t have to be comprehensive, but start by writing down a set of QAs with the relevant context:

dataset = [
{
"input": "...",
"expected_output": "...",
# context is a list of strings that represents ideally the
# additional context your LLM application will receive at query time
"context": ["..."]
},
...
]

Here, the “input” is mandatory, but “expected_output” and “context” are optional (you’ll see why later).

If you wish to automate things, you can try to generate an evals dataset by looping through your knowledge base (which could be in a vector database like Qdrant) and ask GPT-3.5 to generate a set of QAs instead of manually doing it yourself. It’s flexible, versatile, and fast, but limited by the data it was trained on. (Ironically, you’re more likely to care about evaluation if you’re building in a domain that requires deep expertise, since it’s more reliant on the retrieval pipeline rather than the foundational model itself.)

Step Two — Identify Relevant Metrics for Evaluation

The next step in evaluating LLM applications is to decide on the set of metrics you want to evaluate your LLM application on. Some examples include:

  • factual consistency (how factually correct your LLM application is based on the respective context in your evals dataset)
  • answer relevancy (how relevant your LLM application’s outputs are based on the respective inputs in your evals dataset)
  • coherence (how logical and consistent your LLM application’s outputs are)
  • toxicity (whether your LLM application is outputting harmful content)
  • RAGAS (for RAG pipelines)
  • bias (pretty self-explanatory)

I’ll write about all the different types of metrics in another article, but as you can see, different metrics require different components in your evals dataset to reference against one another. Factual consistency doesn’t care about the input, and toxicity only cares about the output.

Step Three — Implement a Scorer to Compute Metric Scores

This step involves taking all the relevant metrics you’ve previously identified and implementing a way to compute a score for each data point in your evals dataset. Here’s an example of how you might implement a scorer for factual consistency (code taken from DeepEval):

from sentence_transformers import CrossEncoder  
def predict(self, text_a: str, text_b: str):
# https://huggingface.co/cross-encoder/nli-deberta-base
model = CrossEncoder('cross-encoder/nli-deberta-v3-large')
scores = model.predict([(text_a, text_b), (text_b, text_a)])
softmax_scores = softmax(scores)
score = softmax_scores[0][1]
second_score = softmax_scores[1][1]
return max(score, second_score)

Here, we used a natural language inference model from Hugging Face to compute an entailment score ranging from 0–1 to measure factual consistency. It doesn’t have to be this particular implementation, but you get the point — you’ll have to decide how you want to compute a score for each metric and find a way to implement it. One thing to note is that LLM outputs are probabilistic in nature, so your implementation of the scorer should take this into account and not penalize outputs that are equally correct but different from what you expect.

We use a combination of model-based, statistical, but also LLM-based scorers depending on the type of metric we’re trying to evaluate. For example, we use a model-based approach to evaluate metrics such as factual consistency (NLI models) and answer relevancy (cross-encoders), while for more nuanced metrics such as coherence, we implemented a framework called G-Eval (which applies LLMs with Chain-of-Though) for evaluation using GPT-4. In fact, the authors of the paper found that G-Eval outperforms all traditional scores such as:

  • BLEU (compares n-grams of the machine-generated text to n-grams of a reference translation and counting the number of matches)
  • BERTScore (a metric for evaluating text generation based on BERT embeddings)
  • ROUGE (a set of metrics for evaluating automatic summarization of texts as well as machine translation)
  • MoverScore (computes the distance between the contextual embeddings of words in the machine-generated text and those in a reference text)

Lastly, you’ll need to define a passing criterion for each metric; the passing criterion is the threshold which the metric score will need to meet in order for your LLM application output to be deemed satisfactory for a given input. For example, a passing criterion for the factual consistency metric implemented above could be 0.6, since the metric outputs a score ranging from 0 to 1. (Similarly, the passing criteria might be 1 for a metric that outputs a 0 or 1 binary score.)

Step Four — Apply each Metric to your Evaluation Dataset

With everything in place, you can now loop through your evaluation dataset and evaluate each data point individually. The algorithm looks something like this:

  • Loop through your evaluation dataset.
  • For each data point, run your LLM application based on the given input.
  • Once your LLM application has finished generating an output for a given data point, compute a score for each of the metrics you’ve previously defined.
  • Identify and log failing metrics (metrics where the passing criteria wasn’t met).
  • Iterate on your LLM application based on these failing metrics.
  • Repeat steps 1–5 until no metrics are failing.

Now, you can stop eyeballing outputs and ensure that having confidence in your LLM application is as easy as having passing test cases.

Step Five — Integrate Evaluations as Unit Tests in CI/CD Pipelines

Having everything setup is great, but to take automated evaluations a step further, you can include evaluations in the form of unit tests in CI/CD pipelines such as on GitHub Actions, which you can do through DeepEval, the open-source LLM evaluation framework. DeepEval offers 14+ LLM evaluation metrics to cover almost any use case you may have, something which I’ve been working on to help other developers automate eyeballing LLM outputs.

First install DeepEval:

pip install deepeval

Then, create a test file, similar to Pytest:

touch test_llm.py

Write a simple test case:

from deepeval import assert_test
from deepeval.metrics import AnswerRelevancy
from deepeval.test_case import LLMTestCase
def test_answer_relevancy():
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
# Replace this with the actual output of your LLM application
actual_output="We offer a 30-day full refund at no extra cost.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."]
)
assert_test(test_case, [answer_relevancy_metric])

Which you can execute via the CLI:

deepeval test run test_llm.py

That’s all! To take unit testing to CI/CD pipelines, simply include a test file with your test cases, and execute this same command in for example. a YAML file (if for example you’re using GitHub workflows).

Step Six — Continuous Evaluations in Production

The final step involves evaluating LLM outputs in real-time. This is vital as it allows you to be alerted of any unsatisfactory responses and iterate on them as quickly as possible.

Evaluation Helps You Iterate Towards the Optimal Hyperparameters

There are several benefits of setting up an evaluation framework that would allow you to rapidly iterate and improve on your LLM application/retrieval pipeline:

  • Taking a RAG-based application as an example, you can now run several nested for loops to find the optimal combination of hyperparameters such as chunk size, top k retrieval, embedding model, and prompt template that would yield the highest metric scores for your evaluation dataset.
  • You’ll be able to make marginal improvements without worrying about unnoticed breaking changes.

Evaluation is Not Bullet-Proof Though

Although your evaluation framework is now in place, it is flimsy and fragile, especially in the early days of deploying to production. This is because your users will start prompting your application in ways you’ve never expected, but that’s okay. To build a truly robust LLM application, you should:

  • Identify unsatisfactory outputs, mark them for reproducibility, and add them to your evaluation dataset. This is known as continuous evaluation and without it, you’ll find that your LLM application will slowly become out of touch with what your users care most about. There are several ways you can identify bad outputs, but the most foolproof way would be to use humans as an evaluator.
  • Identify on a component level which part of your LLM pipeline is causing unsatisfactory outputs. This is known as evaluating with tracing and without it, you’ll find yourself making unnecessary changes because you “think” for example, the retrieval component is not retrieving the relevant text chunks when it’s actually the prompt template that’s the problem.

Other Approaches to Evaluation

Another way to evaluate LLM applications could be an auto-evaluation approach where LLMs are used as judges for picking the best output when presented with several different choices. In fact, data from Databricks claims that LLM-as-a-judge agrees with human grading on over 80% of judgments. There are several points to note when using LLM-as-a-judge:

  • GPT-3.5 works, but only if you provide an example.
  • GPT-4 works well even without an example.
  • Use low-precision grading scales like 1–5 or a binary scale to retain precision, instead of going for something like 1–100.

A possible approach to auto-evaluation is to:

  • Generate outputs on all different combinations of hyperparameters.
  • Ask GPT-4 to compare and pick the best set of outputs in a pairwise fashion.
  • Identify the set of hyperparameters for the best set of outputs GPT-4 has chosen.

A problem I have with this approach, and why we haven’t implemented a way to do this at Confident AI, is that it leaves nothing actionable for subsequent iteration and improvement.

Conclusion

Evaluating LLM pipelines is essential to building robust applications, but evaluation is an involved and continuous process that requires a lot of work. If you want to do short-lived, untrusted evaluation, print statements are a great choice.

from deepeval import assert_test
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase
def test_hallucination():
metric = HallucinationMetric(minimum_score=0.5)
test_case = LLMTestcase(input="...", actual_output="...")
assert_test(test_case, [metric])

and it comes with a platform that allows you to log and debug historical evaluation results, centralize evaluation datasets, and run real-time evaluations in production.

--

--

Suchismita Sahu
Suchismita Sahu

Written by Suchismita Sahu

Working as a Technical Product Manager at Jumio corporation, India. Passionate about Technology, Business and System Design.

No responses yet