How I Think About LLM Testing

Large language models are transforming how we build software. But as we integrate them into critical systems, a fundamental question emerges: how do we know they’re reliable?

Traditional software testing relies on a clear contract: given input $x$ , the correct output is $y$ . If the program produces something else, it’s a bug. This paradigm has served us well for decades.

LLMs break this contract.

Why Testing LLMs is Different

Consider a code generation task. You ask an LLM to “write a function that sorts a list.” There isn’t one correct answer — there are dozens of valid implementations (quicksort, mergesort, timsort…), each with different trade-offs. Traditional testing, which compares against a single expected output, can’t handle this multiplicity.

The Oracle Problem is the most fundamental challenge in testing AI systems. An oracle is a mechanism for determining whether a test has passed or failed. For deterministic software, the oracle is straightforward: compare the actual output to the expected output. For LLMs, no such oracle exists.

This doesn’t mean we should give up on testing. It means we need different approaches.

Metamorphic testing offers a way around the oracle problem by checking whether the system behaves consistently under transformations, rather than checking for a single correct output.

Three Approaches I’ve Found Useful

1. Metamorphic Relations

A metamorphic relation is a property that should hold across multiple executions. For example:

If an LLM correctly answers “What is the capital of France?”, it should also correctly answer “Paris is the capital of which country?”
If an LLM summarizes a document in 100 words, giving it the same document with irrelevant sentences appended should not change the summary’s key points.

By checking these relations rather than exact outputs, we can detect inconsistencies without knowing the “right” answer.

\text{MR}: \forall x, x'. \ \text{transform}(x) = x' \implies \text{relation}(f(x), f(x'))

2. Differential Testing

Run the same prompt through multiple models (or multiple versions of the same model) and compare outputs. Disagreement doesn’t necessarily mean a bug, but it flags areas worth investigating.

3. Invariant Checking

Define properties that should always hold:

The output should be syntactically valid for the requested format
The output should not contradict the input
The output should be internally consistent

Looking Ahead

As LLMs become more capable, testing them becomes both more important and more difficult. The approaches I’ve outlined here are a starting point — we need better tooling, better metrics, and better theoretical frameworks.

If you’re working on these problems, I’d love to hear from you. The field of LLM testing is young, and the best ideas are probably still ahead of us.