Summary: Most AI companies route every request through expensive frontier models regardless of task complexity. A weather lookup doesn't need Claude Sonnet at $15/million output tokens when Haiku handles it at $5/million with identical quality. Intelligent model routing—matching query complexity to the cheapest capable model—delivers 40-60% cost reduction while maintaining output quality. This approach is foundational for sustainable AI economics.
The Pattern: Single-Model Architecture and Margin Compression
A common pattern: an AI startup launches with a single model powering everything. GPT-4o for customer support tickets. Claude Sonnet for intent classification. Frontier models processing "what's my account balance?"
The reasoning is understandable. One model means simpler architecture, fewer edge cases, and consistent behavior. But it also means paying $3-15 per million tokens for tasks that $0.10-1 models handle with equivalent quality.
This compounds the usage variance we've analyzed in Understanding Per-Customer Cost Distribution. High-usage customers—those sending 50+ messages per session—already show compressed margins. Routing all requests through expensive models adds 10-30x unnecessary cost per request.
The math is clear. If 70% of your traffic is simple classification, extraction, or FAQ responses, each request represents avoidable cost.
The Math: Why Model Selection Matters
Let's look at current API pricing (November 2025) across the major providers:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Best For |
|---|---|---|---|
| Gemini 2.0 Flash | $0.10 | $0.40 | High-volume, simple tasks |
| GPT-4o-mini | $0.15 | $0.60 | Cost-efficient classification |
| Claude Haiku 3.5 | $1.00 | $5.00 | Fast coding, structured output |
| GPT-4o | $2.50 | $10.00 | General multimodal tasks |
| Claude Sonnet 3.5 | $3.00 | $15.00 | Complex coding, agents |
| Claude Opus 4 | $15.00 | $75.00 | Frontier reasoning, architecture |
The spread is staggering. Gemini 2.0 Flash costs $0.40/million output tokens. Claude Opus 4 costs $75/million—a 187x difference. Even within the Anthropic family, Haiku is 3x cheaper than Sonnet and 15x cheaper than Opus.
Dr. Alexander Wissner-Gross, the Harvard fellow and AI researcher, captures the trend well:
"The cost of superintelligence is crashing... Genius is transitioning from a biological anomaly to a scalable commodity."
He's right. What was frontier capability six months ago is now available in lightweight models at a fraction of the cost. Claude Haiku 3.5 achieves 73.3% on SWE-bench Verified—performance that would have been state-of-the-art in early 2024—at one-third the cost of Sonnet.
Example: 1 Million Requests Monthly
Consider an AI assistant handling 1 million requests monthly with an average of 500 input tokens and 200 output tokens per request:
| Strategy | Monthly Cost | Savings vs. All-Sonnet |
|---|---|---|
| All Claude Sonnet 3.5 | $4,500 | — |
| All Claude Haiku 3.5 | $1,500 | 67% |
| 70% Haiku / 30% Sonnet | $2,400 | 47% |
Routing just 70% of traffic to the appropriate cheaper model saves $2,100/month—a 47% reduction. Scale that to 10 million requests and you're looking at $21,000 in monthly savings.
But here's the catch: how do you know if your routing is actually working?
Without granular cost tracking per model and per customer, you're guessing. You might implement routing and assume it's saving money, only to discover that your classifier is miscategorizing complex queries and forcing expensive fallbacks.
When Cheap Models Are Good Enough
The key insight is that model capability exists on a spectrum, and most tasks don't require frontier intelligence. Here's where lightweight models genuinely match or approach expensive ones:
Task Classification & Intent Detection. Determining whether a user wants to check their balance, file a complaint, or ask a product question doesn't require deep reasoning. A classifier prompt on Haiku or GPT-4o-mini handles this with 95%+ accuracy.
Structured Data Extraction. Pulling names, dates, amounts, and entities from text is pattern matching, not reasoning. Cheap models excel here.
Format Transformation. Converting between JSON schemas, reformatting text, or standardizing output doesn't need frontier capabilities.
Routing Decisions. Deciding which downstream workflow to trigger based on input characteristics is itself a classification task.
Simple Q&A. FAQ responses, policy lookups, and straightforward knowledge retrieval don't require complex reasoning chains.
Summarization. For standard summarization tasks without nuanced requirements, smaller models perform comparably to larger ones.
Benchmark data supports this. Haiku 3.5 achieves 90% of Sonnet 3.5's performance on Augment's agentic coding evaluation. For many production workloads, that 10% gap is irrelevant—and the 3x cost difference is significant.
When You Need the Big Guns
Not every task belongs on a cheap model. Route to expensive models when you need:
Complex Reasoning Chains. Multi-step logical inference, mathematical proofs, or tasks requiring synthesis across disparate information sources.
Nuanced Writing. Content requiring specific tone, style, persuasion, or creative voice. Marketing copy, legal language, and emotional intelligence tasks.
Architectural Code Decisions. Simple code completion works on Haiku. Designing system architecture, debugging complex interactions, or refactoring large codebases demands Sonnet or Opus.
Ambiguous or Novel Situations. When the task doesn't fit established patterns and requires genuine judgment.
Safety-Critical Outputs. Medical, legal, or financial advice where errors carry significant consequences.
Multi-Step Planning. Agent workflows that require maintaining coherent plans across many tool calls and decision points.
The goal isn't minimizing model spend at all costs—it's matching capability to requirement. Underspending on complex tasks creates errors that cost more than the savings.
Three Routing Approaches
Approach 1: Rule-Based Routing
The simplest implementation uses deterministic rules based on observable request characteristics.
type ModelTier = 'cheap' | 'standard' | 'premium';
interface RoutingConfig {
inputTokenThreshold: number;
premiumKeywords: string[];
premiumFeatures: string[];
}
function routeRequest(
request: {
prompt: string;
userId: string;
feature: string;
estimatedTokens: number;
},
config: RoutingConfig
): ModelTier {
// Premium features always get the best model
if (config.premiumFeatures.includes(request.feature)) {
return 'premium';
}
// Long prompts suggest complexity
if (request.estimatedTokens > config.inputTokenThreshold) {
return 'standard';
}
// Check for complexity indicators
const promptLower = request.prompt.toLowerCase();
const hasPremiumKeyword = config.premiumKeywords.some((kw) => promptLower.includes(kw));
if (hasPremiumKeyword) return 'standard';
// Default to cheap
return 'cheap';
}
// Usage
const tier = routeRequest(
{
prompt: "What's my account balance?",
userId: 'user_123',
feature: 'support-chat',
estimatedTokens: 50,
},
{
inputTokenThreshold: 2000,
premiumKeywords: ['analyze', 'compare', 'architecture', 'strategy'],
premiumFeatures: ['code-review', 'architecture', 'legal-analysis'],
}
);
Pros: Fast (no API call for routing), predictable, easy to debug.
Cons: Doesn't adapt to query content, misses edge cases where simple-looking queries are actually complex.
Approach 2: Classifier-Based Routing
Use a tiny model to classify query complexity, then route accordingly. The classifier cost is negligible compared to the savings from correct routing.
import Anthropic from '@anthropic-ai/sdk';
interface ClassificationResult {
complexity: 'simple' | 'moderate' | 'complex';
confidence: number;
reasoning: string;
}
async function classifyComplexity(client: Anthropic, query: string): Promise<ClassificationResult> {
const response = await client.messages.create({
model: 'claude-3-5-haiku-20241022',
max_tokens: 150,
messages: [
{
role: 'user',
content: `Classify this query's complexity for AI processing.
Query: "${query}"
Respond in JSON:
{
"complexity": "simple" | "moderate" | "complex",
"confidence": 0.0-1.0,
"reasoning": "brief explanation"
}
Simple: factual lookups, classification, extraction, simple Q&A
Moderate: summarization, standard writing, multi-step but clear tasks
Complex: reasoning chains, nuanced writing, architectural decisions, ambiguous`,
},
],
});
const text = response.content[0].type === 'text' ? response.content[0].text : '';
return JSON.parse(text);
}
async function routeWithClassifier(client: Anthropic, query: string): Promise<string> {
const classification = await classifyComplexity(client, query);
if (classification.complexity === 'simple' && classification.confidence > 0.8) {
return 'claude-3-5-haiku-20241022';
} else if (classification.complexity === 'complex' || classification.confidence < 0.6) {
return 'claude-3-5-sonnet-20241022';
} else {
return 'claude-3-5-haiku-20241022';
}
}
Pros: Adapts to query content, catches complex queries that look simple.
Cons: Adds ~50-100ms latency, classifier can be wrong, requires maintaining classifier prompt.
Cost of classification: At Haiku pricing ($1/$5 per million), classifying 1 million requests with ~100 input tokens and ~50 output tokens costs approximately $350. If correct routing saves $2,000+/month, the ROI is obvious.
Approach 3: Embedding Similarity Routing
Compare query embeddings to known examples of simple vs. complex queries. No real-time classification call required after initial setup.
interface EmbeddingExample {
text: string;
embedding: number[];
complexity: 'simple' | 'complex';
}
class EmbeddingRouter {
private examples: EmbeddingExample[] = [];
constructor(
private embedModel: (text: string) => Promise<number[]>,
private similarityThreshold: number = 0.85
) {}
async addExample(text: string, complexity: 'simple' | 'complex') {
const embedding = await this.embedModel(text);
this.examples.push({ text, embedding, complexity });
}
private cosineSimilarity(a: number[], b: number[]): number {
const dot = a.reduce((sum, val, i) => sum + val * b[i], 0);
const magA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
const magB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
return dot / (magA * magB);
}
async route(query: string): Promise<'cheap' | 'expensive'> {
const queryEmbedding = await this.embedModel(query);
let simpleScore = 0;
let complexScore = 0;
for (const example of this.examples) {
const similarity = this.cosineSimilarity(queryEmbedding, example.embedding);
if (similarity > this.similarityThreshold) {
if (example.complexity === 'simple') simpleScore += similarity;
else complexScore += similarity;
}
}
return simpleScore >= complexScore ? 'cheap' : 'expensive';
}
}
// Setup with examples
const router = new EmbeddingRouter(yourEmbeddingFunction);
await router.addExample("What's my account balance?", 'simple');
await router.addExample('List my recent transactions', 'simple');
await router.addExample('Analyze my spending patterns and suggest a budget', 'complex');
await router.addExample('Compare these two investment strategies', 'complex');
Pros: No per-request classification cost, easy to update by adding examples, fast inference.
Cons: Requires embedding infrastructure, needs curated example set, may miss novel query types.
Fallback Strategies: When Cheap Models Fail
Routing isn't fire-and-forget. You need mechanisms to catch and recover from routing failures.
interface FallbackConfig {
maxRetries: number;
escalationChain: string[];
}
async function executeWithFallback(
client: Anthropic,
prompt: string,
config: FallbackConfig
): Promise<{ content: string; model: string }> {
for (let i = 0; i < config.escalationChain.length; i++) {
const model = config.escalationChain[i];
try {
const response = await client.messages.create({
model,
max_tokens: 1000,
messages: [{ role: 'user', content: prompt }],
});
const content = response.content[0].type === 'text' ? response.content[0].text : '';
// Check for low-confidence indicators
const lowConfidenceSignals = [
"I'm not sure",
"I don't have enough information",
'This is unclear',
'I cannot determine',
];
const seemsUncertain = lowConfidenceSignals.some((signal) =>
content.toLowerCase().includes(signal.toLowerCase())
);
if (seemsUncertain && i < config.escalationChain.length - 1) {
console.log(`Model ${model} expressed uncertainty, escalating...`);
continue;
}
return { content, model };
} catch (error) {
if (i === config.escalationChain.length - 1) throw error;
console.log(`Model ${model} failed, trying next...`);
}
}
throw new Error('All models in escalation chain failed');
}
// Usage
const result = await executeWithFallback(client, userQuery, {
maxRetries: 2,
escalationChain: [
'claude-3-5-haiku-20241022',
'claude-3-5-sonnet-20241022',
'claude-3-opus-20240229',
],
});
Key fallback patterns:
- Confidence thresholds: Escalate when the model expresses uncertainty
- Output validation: Check response quality and retry with a better model if needed
- User feedback loops: Learn from corrections to improve routing over time
The Missing Piece: You Can't Optimize What You Can't Measure
Here's where most routing implementations fail: they don't measure whether routing is actually working.
You implement sophisticated routing logic, deploy it to production, and... hope for the best. Maybe you check your total OpenAI bill next month and see it went down. Or maybe it went up because your classifier is miscategorizing queries and triggering expensive fallbacks.
Without granular cost tracking, routing decisions remain unvalidated.
What You Need to Track
| Metric | What It Tells You |
|---|---|
| Cost per request by model | Whether routing decisions are actually saving money |
| Fallback rate | How often your router misjudges complexity |
| Quality scores by route | Whether cheap models are degrading user experience |
| Cost per customer | Which customers show negative margin even with routing |
| Routing accuracy over time | Whether your classifier is drifting |
A/B testing routing strategies: Run 10% of traffic through a new routing algorithm before full rollout. Compare cost and quality metrics. Be especially vigilant for edge cases where aggressive routing to cheap models creates user-facing quality degradation.
When routing isn't worth it: If your traffic is uniformly complex (all architecture reviews, all legal analysis), routing overhead won't pay off. Similarly, at low volume (under 10,000 requests/month), the engineering investment in sophisticated routing may exceed the savings.
This Is What Bear Billing Does
We built Bear Billing because spreadsheets don't cut it for AI cost tracking—and that's doubly true when you're implementing routing strategies.
What we track automatically:
- Cost per request by model - See exactly which routes are saving money and which aren't
- Per-customer cost attribution - Identify which customers show negative margin even after routing optimization
- Model usage distribution - Understand your actual routing split vs. intended split
- Fallback tracking - Know when cheap models are failing and triggering escalation
- Margin impact analysis - Connect routing decisions to bottom-line profitability
Without this visibility, routing is guesswork. With it, you can continuously optimize your routing logic based on real data.
Real-World Savings: What to Expect
Based on typical traffic distributions:
| Routing Strategy | Expected Savings | Requirements |
|---|---|---|
| Conservative (50% to cheaper) | 25-35% | Rule-based routing, minimal tuning |
| Moderate (70% to cheaper) | 40-50% | Classifier-based, fallback mechanisms |
| Aggressive (85% to cheaper) | 55-65% | Embedding routing, quality monitoring |
The aggressive numbers require excellent routing accuracy. Each misrouted complex query creates either quality degradation or a fallback that partially erases savings.
Dr. Wissner-Gross poses an interesting question about where this all leads:
"Whether humanity will need 'to tile the Earth's surface with computronium AI data centers' or discover algorithmic advances making such expansion 'laughable and quaint, in twenty years.'"
The rapid cost deflation we're seeing suggests the latter path. But even with falling costs, the spread between model tiers remains significant. Routing will stay valuable even as absolute costs decline—because there will always be a 10-50x spread between the cheapest model that can handle a task and the frontier option.
How This Connects to Margin Management
Model routing is one component of margin management for AI products, but it works in conjunction with other factors.
Usage-based pricing aligns your revenue model with actual costs—if you're routing expensive queries to expensive models, pricing should reflect that.
Spending controls provide visibility into high-usage customers and unexpected traffic patterns.
Cost visibility shows where costs are distributed, enabling informed routing decisions.
Model routing without cost visibility lacks validation. Cost visibility without routing optimization provides data without action. They work together.
See our analysis of the real cost of running an AI product—API costs are only 30-40% of your total bill. Infrastructure, monitoring, and failed requests add another 60-70%.
Next Steps
Start simple and iterate:
-
Identify your highest-volume, lowest-complexity endpoint. This is your first routing target. What percentage of requests are simple classifications, FAQs, or structured extraction?
-
Implement rule-based routing first. Get wins on the obvious cases before investing in classifier infrastructure.
-
Measure before optimizing further. Track cost per request, fallback rates, and quality metrics. Let data drive the next iteration.
-
Add classifier routing for the gray zone. Once rule-based routing handles the obvious cases, use a classifier for ambiguous queries.
-
Build feedback loops. Let routing decisions inform future improvements through quality monitoring and user feedback.
Key Takeaways
-
The pricing spread between models is 10-50x. Using Sonnet for tasks Haiku handles with equivalent quality represents avoidable cost.
-
70% of typical AI traffic doesn't need frontier models. Classification, extraction, simple Q&A, and format transformation work fine on cheap models.
-
Start with rule-based routing. Route by feature, user tier, and input length before investing in classifiers.
-
Classifier-based routing pays for itself. The cost of classifying with Haiku (
$350/million requests) is far less than savings from correct routing ($2,000+/million requests). -
Build fallback mechanisms. Confidence thresholds and escalation chains catch routing failures before they become user-facing quality issues.
-
You can't optimize what you can't measure. Without per-model, per-customer cost tracking, routing is guesswork. Use Bear Billing to see exactly where your AI spend is going and whether your routing decisions are paying off.
Track your model costs and routing effectiveness with Bear Billing. We give you the visibility to know exactly where your AI spend is going—and whether your optimization strategies are actually working.
Join the early access program →
Related reading: Usage Variance in AI Products | Real Cost of AI Products in 2025 | AI API Costs 2025