Summary: If you're calculating AI costs as "tokens × price," you're seeing 20-40% of your actual spend. The real unit of accounting is the trace—a complete workflow that includes orchestration overhead, tool calls, retrieval costs, reliability loops, and failure paths. Understanding cost-per-trace at the feature level is essential for accurate margin calculations.
The Token Fallacy
Here's how most AI startups think about costs:
Monthly Cost = Total Tokens × Price per Token
Simple. Clear. But incomplete.
This model worked when AI products were single-call wrappers around GPT-3. Send a prompt, get a response, bill the customer. But modern AI products aren't single calls—they're workflows.
Your "simple" customer support agent actually:
- Embeds the user query (embedding model cost)
- Retrieves relevant context from a vector database (infra cost)
- Calls a planner model to decide what to do (hidden tokens)
- Executes a tool call to your CRM (external API cost)
- Generates a response (visible tokens)
- Runs a safety check (guardrail model cost)
- Logs everything for debugging (observability cost)
That's 7+ cost centers in a single user interaction. Token math captures maybe two of them.
This is why AI startups report "good margins" in spreadsheets while their bank accounts tell a different story. They're only measuring part of the picture.
The Real Unit: The Trace
Here's the mental model shift that separates companies that survive from those that don't:
The unit of accounting is the trace, not the LLM call.
A trace is a complete workflow execution—from user request to final response—containing multiple spans (individual operations). Each span has its own cost profile.
Trace (feature_invocation)
├── Span: Retrieve (embedding lookup, vector DB query)
├── Span: Rerank (relevance scoring model)
├── Span: Tool Call (web search)
├── Span: LLM Generation (planner/router)
├── Span: Tool Call (CRM lookup)
├── Span: LLM Generation (synthesize response)
├── Span: Guardrail Check (safety model)
└── Span: Judge/Eval (quality verification)
This is exactly how observability tools like Langfuse and Arize model AI applications—because it's how costs actually accumulate.
When you think in traces, uncomfortable truths emerge:
- Your "cheap" Haiku-powered feature costs $0.15/trace because of the retrieval pipeline wrapped around it
- Your agent workflow costs $2.40/trace, not the $0.08 your token math suggested
- 40% of your spend is on steps users never see
What Actually Lives Inside a Trace
Let's break down every cost component in a production AI workflow.
1. LLM Generation Costs (What You're Already Tracking)
| Component | Notes |
|---|---|
| Input tokens | The prompt you send |
| Output tokens | The response you receive (usually 3-5x input pricing) |
| Cached tokens | Reused prefixes (50-90% discount, but mechanics matter) |
| Hidden tokens | Planning, routing, self-critique steps you don't see in the final output |
The hidden tokens trap: Multi-step agents often generate 5-20x more internal tokens than visible output. Planning prompts, tool-routing decisions, memory summarization, self-check passes—all billed, none visible to users.
2. Tool Costs (The Forgotten Line Item)
Modern agents are tool farms. Every external call is a cost:
| Tool Type | Cost Model |
|---|---|
| Web search (Serper, Tavily) | Per-query ($0.001-0.01) |
| Browser automation | Per-page or per-minute |
| Database queries | Per-read, varies by provider |
| CRM/ticketing APIs | Per-call or monthly + overage |
| Payment processing | Per-transaction |
| Maps/geocoding | Per-request |
If your agent calls three tools per trace, that's three line items your token math ignores.
Track: tool_name, tool_vendor_cost, tool_latency, tool_retry_count
3. Orchestration Overhead (Tokens You Didn't Know You Were Paying For)
Multi-step agents accumulate hidden costs:
| Hidden Cost | What It Is |
|---|---|
| Planner prompts | Agent deciding what to do next |
| Schema/JSON padding | Formatting overhead on every call |
| Memory summarization | Compressing context between steps |
| Rubric/critique steps | Agent validating its own work |
| Fallback passes | When primary model fails, backup runs |
A 5-step agent workflow doesn't cost 5x a single call. It costs 5x plus all the routing, planning, and verification steps in between—often 8-12 actual LLM calls.
4. Retrieval Costs (RAG Isn't Free)
If you're using RAG, you have a parallel cost structure:
| Component | Cost Type |
|---|---|
| Embedding generation | One-time per document chunk |
| Vector database | Monthly (Pinecone: $70-500+/month) |
| Query embedding | Per-search |
| Retrieval/rerank | Model calls for relevance scoring |
| Context stuffing | Retrieved chunks become input tokens |
Real example: A mid-sized e-commerce firm saw costs jump from $5K/month in prototyping to $50K/month in staging due to unoptimized RAG queries fetching 10x more context than needed.
5. Caching (Mechanics, Not Just Discounts)
Caching isn't just "cheaper tokens." It has specific behaviors:
| Factor | Impact |
|---|---|
| Cache key strategy | How you structure prompts for reuse |
| Minimum thresholds | OpenAI: 1,024+ tokens to trigger |
| Cache hit rate | % of requests served from cache |
| TTL/eviction | Caches expire; cold starts cost full price |
| Platform variance | Azure OpenAI has different rules than OpenAI direct |
The insight: Stable system prompts with dynamic user content = high cache hits. Fully dynamic prompts = no cache benefit. Prompt architecture directly affects cost.
Track: cache_hit_rate, cached_prefix_tokens, cache_key_strategy
6. Reliability Loops (The Silent Budget Killer)
Production AI adds model calls for quality assurance:
| Cost Type | What It Is |
|---|---|
| Offline evals | Regression testing on prompt/model changes |
| Online judges | LLM-as-judge verifying output quality |
| Guardrails | Safety/policy checks (often separate model calls) |
| Replay/debugging | Re-running traces to diagnose issues |
| Incident reprocessing | Fixing outputs after failures |
This is where prototype → production costs explode. A workflow that cost $0.05/trace in development suddenly costs $0.15/trace in production because you added eval, safety, and monitoring layers.
Track: eval_suite_cost, judge_model_cost, safety_model_cost, incident_cost
7. Failure Paths (You Pay for Mistakes)
Things break. You still pay:
| Failure Type | Cost |
|---|---|
| Failed requests | Tokens consumed before failure |
| Retries | Rate limits, timeouts = paying 2-3x |
| Fallback chains | Primary → secondary → tertiary model |
| Partial completions | Streaming failures mid-response |
| Validation failures | Output rejected, re-run required |
If your retry rate is 5% and your fallback rate is 2%, you're paying 7% more than your happy-path math suggests.
The Complete Formula
Shift from token costs to trace costs:
Cost per Feature Invocation =
Σ(step_cost) // All LLM calls in the trace
+ Σ(tool_cost) // All external API calls
+ retrieval_cost // RAG pipeline overhead
+ eval_cost_allocation // Reliability loops, amortized
+ failure_cost // Retries, fallbacks, incidents
Then calculate margins at the feature level, not the product level:
| Feature | Avg Trace Cost | Revenue/Use | Margin |
|---|---|---|---|
| Quick chat | $0.02 | $0.05 | 60% |
| Doc analysis | $0.18 | $0.25 | 28% |
| Agent workflow | $1.85 | $2.00 | 7.5% |
| Research task | $4.20 | $3.00 | -40% |
That last row illustrates margin compression from feature-level cost variance. A single underwater feature, heavily used by high-usage customers, can offset margins from other features—and this won't be visible when only tracking aggregate token spend.
We've written about this pattern in Usage Variance in AI Products. The same customer segment with highest engagement often has the highest cost-to-serve.
The Attribution Problem (And How to Solve It)
Even if you track all these costs, attributing them is hard. Finance can't tie token spend to business units. Product can't see which features are underwater. Engineering doesn't know if the last deploy made costs better or worse.
The fix is trace-native attribution. Every trace needs a canonical tag set:
| Tag | Purpose |
|---|---|
customer_id | Who to bill |
workspace_id | Multi-tenant isolation |
feature_name | Which product feature |
agent_name, agent_version | Which agent, which release |
workflow_name, workflow_version | Which workflow variant |
model_provider, model_name, model_version | Which model |
prompt_id, prompt_version | Which prompt template |
environment | prod / staging / dev |
experiment_id | A/B test variant |
trace_id | Ties everything together |
This isn't over-engineering. It's the minimum viable tagging to answer basic questions:
- "Why did costs spike last Tuesday?" → Filter by
prompt_version, find the regression - "Which customers are underwater?" → Group by
customer_id, compare trace costs to revenue - "Is the new agent version cheaper?" → Compare
agent_versionA vs B
If you're using OpenTelemetry for tracing, these tags flow naturally. Tools like Langfuse have native support for this exact pattern.
Pricing Readiness Checklist for AI Products
Before you set (or change) prices, you should be able to answer these questions:
Cost Visibility
- Do you know your cost per trace for each major feature?
- Can you break down trace cost by span (LLM calls, tools, retrieval)?
- Do you track orchestration overhead separately from visible output?
- Are tool costs (external APIs) first-class line items?
- Do you measure cache hit rates and their dollar impact?
Reliability Costs
- Do you know what your eval/judge pipeline costs per trace?
- Are guardrail/safety model costs tracked?
- What's your retry rate? Fallback rate? Incident reprocessing cost?
- Can you attribute cost spikes to specific prompt or agent versions?
Attribution
- Is every trace tagged with customer, feature, and version metadata?
- Can finance see cost-per-customer without engineering help?
- Can product see cost-per-feature in near real time?
- Do deploys automatically surface cost deltas?
Margin Safety
- Do you know which features are underwater?
- Do you know which customer segments are unprofitable?
- Do you have alerts when a customer's trace costs exceed their revenue?
- Can you model the margin impact before launching a new feature?
If you answered "no" to more than two of these, there are gaps in your pricing visibility.
Summary
Token-based cost thinking reflects an earlier era of wrapper apps. Modern AI products are workflows—multi-step, tool-heavy, reliability-wrapped workflows—and they need workflow-level cost accounting.
Companies with accurate margin visibility can make informed pricing decisions. Understanding true cost-per-trace at the feature level provides the foundation for sustainable AI product economics.
Cost-per-trace visibility is essential for accurate margin calculations.
What Bear Billing Does
We built Bear Billing because we lived this problem. Per-customer margin visibility, feature-level cost attribution, and pricing scenario modeling—so you can see which customers are profitable, which features are underwater, and what happens to your margins before you ship pricing changes.
If any of this resonated, talk to us. We're onboarding our founding cohort now.