Evaluating LLMs: Beyond Vibes

Most teams rely on intuition and spot-checking to evaluate LLM output quality. Here's how to build evaluation that actually scales.

The evaluation problem

Ask most teams how they evaluate LLM output quality, and the honest answer is: someone looks at it. A few engineers review a sample of outputs. The product manager does some testing. If nothing obviously terrible happens, the feature ships. This approach works until it doesn't — until a quality regression slips through, a model update silently changes behavior, or an edge case causes a highly visible failure.

Building systematic LLM evaluation is one of the most valuable things a team can invest in — and one of the most neglected. The challenge is that 'good' LLM output is often hard to define precisely, human evaluation doesn't scale, and the space of possible inputs is enormous. There's no perfect solution, but there's a set of practices that dramatically improves on vibes-based evaluation.

Building your evaluation dataset

Good evaluation starts with good data. This means a test set that covers the distribution of inputs your system will actually encounter — not just the straightforward cases, but the edge cases, the adversarial inputs, and the ones that would be most costly to get wrong. In practice, the best eval datasets are built from two sources: real production inputs (if available) and hand-crafted cases that probe specific capabilities and failure modes.

A dataset of 200 carefully chosen examples often tells you more than 2,000 randomly sampled ones. The key properties: ground truth labels for what the correct output is; enough variation to be meaningful; and representation of the tail, not just the center of the distribution.

Happy path: the inputs the system was designed for
Edge cases: unusual but valid inputs the system should handle
Adversarial inputs: attempts to break or misuse the system
Out-of-scope inputs: things the system shouldn't do and should decline gracefully
Recently-seen failures: real failures from production, to prevent regression

Metrics that actually matter

The metrics worth tracking depend on what your system does. For classification tasks: precision, recall, F1, and confusion matrix breakdown. For extraction tasks: exact match, field-level accuracy, and coverage. For generation tasks: quality is harder to quantify, but you can measure factual accuracy, format compliance, and length compliance.

One metric that matters across almost all task types: task completion rate. Not 'did the model produce output' but 'did the user's goal get accomplished.' This often requires tracking user behavior after the AI output is presented — did they accept it, edit it, or discard it? Harder to measure, but much more informative.

LLM-as-a-judge: what works and what doesn't

Using a separate LLM call to evaluate the output of your primary LLM has become popular, and for good reason: it scales better than human evaluation and can be more consistent than rule-based checks. But it has failure modes that teams often underestimate. Judges have their own biases — often preferring longer, more confident-sounding outputs regardless of accuracy. Without calibration against human judgments, you don't know if your judge's ratings mean anything.

The approach works best when you're explicit about what 'good' means. 'Rate this output from 1–10' is a weak prompt. 'Does this output accurately represent the information in the source document without adding unsupported claims? Answer yes or no with a one-sentence explanation' is much more reliable.

Use a different model family than the one being evaluated to reduce bias
Ground evaluation in specific criteria rather than asking for overall quality
Require the judge to cite specific evidence for its rating
Calibrate judge ratings against human ground truth before trusting them at scale

Continuous evaluation in production

Once you have an eval dataset and a set of metrics, you can build regression tests that run automatically on every prompt change, model update, or system change. This is the equivalent of a test suite for traditional software — it tells you, before you ship, whether you've introduced a regression.

In production, continuous evaluation means sampling real requests, evaluating their quality, and tracking metrics over time. This is what catches model drift — the silent degradation in output quality that happens when providers update their models — before users do. The teams that stay in front of quality issues are the ones who built this infrastructure before they needed it.