Reduce your inference costs by up to 70% without sacrificing quality. Techniques from caching to intelligent model routing.
The bill you didn't plan for
LLM costs have a way of sneaking up on you. The per-token pricing looks manageable in development — a few cents here, a few cents there — until you're in production and you realize that a few cents per call times tens of thousands of calls per day is a five-figure monthly invoice. And that's before accounting for the latency penalties of running everything through the largest, most expensive model.
The good news: inference costs are highly compressible. Teams that approach cost optimization systematically — rather than just swapping to a cheaper model and hoping for the best — routinely reduce spend by 40–70% with no meaningful quality degradation. The key is understanding where your tokens are actually going.
Prompt caching: the biggest lever you're not using
If your prompts have a stable system prompt or context block — and most do — you're probably paying to re-encode that text on every single request. Prompt caching, now available in Anthropic's Claude API and elsewhere, lets you cache the KV activations for a static prefix. You pay to compute it once; subsequent requests that share the prefix are dramatically cheaper and faster.
The savings here can be enormous. A system prompt that's 2,000 tokens long, sent with every request, costs the same to process as a user message — but unlike user messages, it never changes. Teams running high-volume applications can reduce costs by 50% or more just from caching their static context.
- Long system prompts and behavioral instructions
- Reference documents and knowledge bases appended as context
- Few-shot examples that stay consistent across requests
- Tool definitions when using function calling
Model routing: right model for the right task
Not every task needs a frontier model. A surprisingly large fraction of enterprise AI workloads — classification, extraction, summarization, reformatting — can be handled by smaller, faster, cheaper models without users noticing a quality difference. The economics are striking: the difference between a frontier model and a capable mid-tier model can be 10x in price.
Building a routing layer means classifying each incoming request by complexity and routing to the appropriate model tier. Simple requests go to a smaller model. Complex requests requiring nuanced reasoning or multi-step generation go to a frontier model. With good calibration, you can maintain perceived quality while cutting costs significantly.
Context window management
Every token in your context window is a token you're paying for. The most common cost waste pattern we see is teams passing full conversation histories, entire documents, or raw database dumps into every request — when often only a fraction of that context is relevant to the task at hand.
Techniques worth adopting: conversation history truncation (keep the last N turns and a compressed summary of earlier context); selective document chunking (retrieve and include only relevant passages rather than full documents); and output length constraints. If you need a one-sentence summary, constrain the model to produce one sentence rather than hoping it's naturally concise.
Batching and async patterns
Many LLM workloads don't require synchronous responses. Report generation, data enrichment, document processing — these can be queued and processed in batches with significantly better economics than real-time inference. Batch APIs from major providers offer up to 50% discounts for workloads that can tolerate latency.
Designing your architecture to separate 'needs to be instant' from 'can wait a few seconds or minutes' is often the most impactful architectural change a team can make. Real-time requests get optimized for latency; batch jobs get optimized for throughput and cost.
What the numbers actually look like
To make this concrete: we've worked with teams whose initial production cost was $0.40 per user session. After implementing prompt caching for their static system prompt, routing low-complexity requests to a smaller model, and switching background processing to batch mode, they reached $0.09 per session. Same product, same quality perception, 78% lower cost.
Cost optimization and quality are not in opposition. The discipline of understanding where your tokens go forces you to think more carefully about what context actually matters — which tends to improve prompt quality as a side effect. The teams that do this work end up with better products, not just cheaper ones.