The conversation about AI has shifted from "can we build it?" to "can we afford to run it?" Enterprise LLM deployments routinely generate monthly bills that shock finance teams — $50K, $100K, or more for inference alone. And unlike traditional cloud costs, AI costs are harder to predict because they scale with usage in non-linear ways.
This is FinOps for the AI era. Here's how to bring your AI infrastructure costs under control.
The Cost Anatomy of LLM Deployment
API-Based (Azure OpenAI, Anthropic, etc.)
| Cost Component | Driver | Typical Range |
|---|---|---|
| Input tokens | Prompt length × volume | $1-$15 per million tokens |
| Output tokens | Response length × volume | $2-$60 per million tokens |
| Fine-tuning | Training data size × epochs | $8-$25 per million training tokens |
| Embeddings | Document volume | $0.01-$0.13 per million tokens |
Self-Hosted (Open Source Models)
| Cost Component | Driver | Typical Range |
|---|---|---|
| GPU compute | Model size × concurrency | $2-$30/hour per GPU |
| Storage | Model weights + data | $0.02-$0.08/GB/month |
| Networking | Data transfer | Variable |
| Engineering | Ops and maintenance | 1-2 FTEs |
The hidden cost in self-hosting is engineering time. Running LLM infrastructure requires MLOps expertise that's expensive and scarce.
Optimization Strategies
Strategy 1: Model Selection (Biggest Impact)
Not every task needs GPT-4 or Claude Opus. The single highest-impact optimisation is using the right model for each task.
| Task Type | Recommended Tier | Cost Reduction |
|---|---|---|
| Simple extraction, classification | Small model (Haiku, GPT-4o-mini) | 80-95% vs flagship |
| Summarisation, Q&A | Mid-tier (Sonnet, GPT-4o) | 50-70% vs flagship |
| Complex reasoning, creative writing | Flagship (Opus, GPT-4) | Baseline |
| Embeddings | Embedding model | 95%+ vs using LLM |
Implementation: Build a model router that selects the appropriate model based on task complexity. Start simple — route by task type, not by dynamic complexity assessment.
Strategy 2: Prompt Optimization
Shorter prompts cost less. But it's not just about fewer words — it's about more efficient prompts.
- Remove redundant instructions. If the model consistently follows an instruction without being told, remove it.
- Use structured output. JSON mode generates fewer tokens than free-form text for the same information.
- Compress examples. In few-shot prompting, use the minimum number of examples needed for quality.
- Cache system prompts. Many providers offer system prompt caching that reduces input token costs by 50-90% for repeated prompts.
Typical saving: 20-40% reduction in input token costs.
Strategy 3: Semantic Caching
Cache LLM responses for semantically similar queries. If a user asks "What's our refund policy?" and another asks "How do I get a refund?", the second query can be served from cache.
Implementation:
- Embed incoming queries using a fast embedding model
- Search the cache using vector similarity (cosine similarity > 0.95 threshold)
- Return cached response if found; call LLM if not
- Store new responses in cache with TTL
Typical saving: 30-60% reduction in API calls for customer-facing applications with repetitive queries.
Strategy 4: Batching and Queuing
For non-real-time workloads (document processing, report generation, data analysis), batch requests to take advantage of:
- Lower batch pricing (some providers offer 50% discounts for batch API)
- More efficient GPU utilisation (self-hosted)
- Predictable cost scheduling
Strategy 5: Quantization (Self-Hosted)
Running quantised models (4-bit or 8-bit) reduces GPU memory requirements by 50-75%, enabling larger models on smaller GPUs or more concurrent requests on the same hardware.
Trade-off: Small quality degradation (1-3% on benchmarks). For most enterprise use cases, the quality difference is imperceptible.
Strategy 6: Model Distillation
Train a smaller, cheaper model to replicate the behaviour of a larger model for specific tasks. The smaller model is fine-tuned on the larger model's outputs.
When it works: High-volume, well-defined tasks where you can generate thousands of high-quality examples from the large model.
Typical saving: 80-90% cost reduction for the distilled task, with 5-10% quality degradation.
Strategy 7: Request Routing and Load Balancing
When using multiple model providers, route requests based on:
- Cost: Route to the cheapest provider that meets the quality threshold
- Latency: Route to the fastest provider for real-time applications
- Availability: Fail over to alternative providers during outages
- Rate limits: Distribute across providers to avoid throttling
Cost Monitoring Framework
Metrics to Track
| Metric | Why It Matters | Target |
|---|---|---|
| Cost per query | Unit economics | Trending down |
| Cost per user per month | Business sustainability | Below revenue per user |
| Token efficiency | Prompt optimization | Improving over time |
| Cache hit rate | Caching effectiveness | Above 30% |
| Model tier distribution | Right-sizing | 70%+ on cheaper models |
| Cost by feature | Product decisions | Informed pricing |
Alerting
Set hard alerts for:
- Daily spend exceeding 2x the trailing 7-day average
- Single query cost exceeding a defined threshold (e.g., $1)
- Monthly budget utilisation exceeding 80%
Budgeting for AI Workloads
The Unit Economics Model
Calculate cost per business transaction, not cost per API call:
Cost per customer support resolution =
(Avg tokens per conversation × token price) +
(Tool calls × tool cost) +
(Embedding queries × embedding cost)
Compare this to the cost of human handling. If AI resolution costs $0.50 and human resolution costs $15, the ROI is clear even at scale.
Budget Allocation
| Phase | % of AI Budget on Inference | Notes |
|---|---|---|
| Experimentation | 10-20% | Most spend is on engineering |
| Pilot | 20-30% | Growing usage, optimisation beginning |
| Production | 40-60% | Scale drives cost; optimisation is critical |
| Optimised | 30-40% | Caching, routing, and distillation reduce spend |
AI cost optimisation isn't about spending less — it's about getting more value per dollar spent. The companies that master AI FinOps will scale their AI capabilities profitably while competitors struggle with unsustainable costs. If you need help optimising your AI infrastructure spend, let's talk.