Unit Economics for AI Products: A Complete Cost Framework Beyond Tokens

Summary: If you're calculating AI costs as "tokens × price," you're seeing 20-40% of your actual spend. The real unit of accounting is the trace—a complete workflow that includes orchestration overhead, tool calls, retrieval costs, reliability loops, and failure paths. Understanding cost-per-trace at the feature level is essential for accurate margin calculations.

The Token Fallacy

Here's how most AI startups think about costs:

Monthly Cost = Total Tokens × Price per Token

Simple. Clear. But incomplete.

This model worked when AI products were single-call wrappers around GPT-3. Send a prompt, get a response, bill the customer. But modern AI products aren't single calls—they're workflows.

Your "simple" customer support agent actually:

Embeds the user query (embedding model cost)
Retrieves relevant context from a vector database (infra cost)
Calls a planner model to decide what to do (hidden tokens)
Executes a tool call to your CRM (external API cost)
Generates a response (visible tokens)
Runs a safety check (guardrail model cost)
Logs everything for debugging (observability cost)

That's 7+ cost centers in a single user interaction. Token math captures maybe two of them.

This is why AI startups report "good margins" in spreadsheets while their bank accounts tell a different story. They're only measuring part of the picture.

The Real Unit: The Trace

Here's the mental model shift that separates companies that survive from those that don't:

The unit of accounting is the trace, not the LLM call.

A trace is a complete workflow execution—from user request to final response—containing multiple spans (individual operations). Each span has its own cost profile.

Trace (feature_invocation)
├── Span: Retrieve (embedding lookup, vector DB query)
├── Span: Rerank (relevance scoring model)
├── Span: Tool Call (web search)
├── Span: LLM Generation (planner/router)
├── Span: Tool Call (CRM lookup)
├── Span: LLM Generation (synthesize response)
├── Span: Guardrail Check (safety model)
└── Span: Judge/Eval (quality verification)

This is exactly how observability tools like Langfuse and Arize model AI applications—because it's how costs actually accumulate.

When you think in traces, uncomfortable truths emerge:

Your "cheap" Haiku-powered feature costs $0.15/trace because of the retrieval pipeline wrapped around it
Your agent workflow costs $2.40/trace, not the $0.08 your token math suggested
40% of your spend is on steps users never see

What Actually Lives Inside a Trace

Let's break down every cost component in a production AI workflow.

1. LLM Generation Costs (What You're Already Tracking)

Component	Notes
Input tokens	The prompt you send
Output tokens	The response you receive (usually 3-5x input pricing)
Cached tokens	Reused prefixes (50-90% discount, but mechanics matter)
Hidden tokens	Planning, routing, self-critique steps you don't see in the final output

The hidden tokens trap: Multi-step agents often generate 5-20x more internal tokens than visible output. Planning prompts, tool-routing decisions, memory summarization, self-check passes—all billed, none visible to users.

2. Tool Costs (The Forgotten Line Item)

Modern agents are tool farms. Every external call is a cost:

Tool Type	Cost Model
Web search (Serper, Tavily)	Per-query ($0.001-0.01)
Browser automation	Per-page or per-minute
Database queries	Per-read, varies by provider
CRM/ticketing APIs	Per-call or monthly + overage
Payment processing	Per-transaction
Maps/geocoding	Per-request

If your agent calls three tools per trace, that's three line items your token math ignores.

Track: tool_name, tool_vendor_cost, tool_latency, tool_retry_count

3. Orchestration Overhead (Tokens You Didn't Know You Were Paying For)

Multi-step agents accumulate hidden costs:

Hidden Cost	What It Is
Planner prompts	Agent deciding what to do next
Schema/JSON padding	Formatting overhead on every call
Memory summarization	Compressing context between steps
Rubric/critique steps	Agent validating its own work
Fallback passes	When primary model fails, backup runs

A 5-step agent workflow doesn't cost 5x a single call. It costs 5x plus all the routing, planning, and verification steps in between—often 8-12 actual LLM calls.

4. Retrieval Costs (RAG Isn't Free)

If you're using RAG, you have a parallel cost structure:

Component	Cost Type
Embedding generation	One-time per document chunk
Vector database	Monthly (Pinecone: $70-500+/month)
Query embedding	Per-search
Retrieval/rerank	Model calls for relevance scoring
Context stuffing	Retrieved chunks become input tokens

Real example: A mid-sized e-commerce firm saw costs jump from $5K/month in prototyping to $50K/month in staging due to unoptimized RAG queries fetching 10x more context than needed.

5. Caching (Mechanics, Not Just Discounts)

Caching isn't just "cheaper tokens." It has specific behaviors:

Factor	Impact
Cache key strategy	How you structure prompts for reuse
Minimum thresholds	OpenAI: 1,024+ tokens to trigger
Cache hit rate	% of requests served from cache
TTL/eviction	Caches expire; cold starts cost full price
Platform variance	Azure OpenAI has different rules than OpenAI direct

The insight: Stable system prompts with dynamic user content = high cache hits. Fully dynamic prompts = no cache benefit. Prompt architecture directly affects cost.

Track: cache_hit_rate, cached_prefix_tokens, cache_key_strategy

6. Reliability Loops (The Silent Budget Killer)

Production AI adds model calls for quality assurance:

Cost Type	What It Is
Offline evals	Regression testing on prompt/model changes
Online judges	LLM-as-judge verifying output quality
Guardrails	Safety/policy checks (often separate model calls)
Replay/debugging	Re-running traces to diagnose issues
Incident reprocessing	Fixing outputs after failures

This is where prototype → production costs explode. A workflow that cost $0.05/trace in development suddenly costs $0.15/trace in production because you added eval, safety, and monitoring layers.

Track: eval_suite_cost, judge_model_cost, safety_model_cost, incident_cost

7. Failure Paths (You Pay for Mistakes)

Things break. You still pay:

Failure Type	Cost
Failed requests	Tokens consumed before failure
Retries	Rate limits, timeouts = paying 2-3x
Fallback chains	Primary → secondary → tertiary model
Partial completions	Streaming failures mid-response
Validation failures	Output rejected, re-run required

If your retry rate is 5% and your fallback rate is 2%, you're paying 7% more than your happy-path math suggests.

The Complete Formula

Shift from token costs to trace costs:

Cost per Feature Invocation =
    Σ(step_cost)           // All LLM calls in the trace
  + Σ(tool_cost)           // All external API calls
  + retrieval_cost         // RAG pipeline overhead
  + eval_cost_allocation   // Reliability loops, amortized
  + failure_cost           // Retries, fallbacks, incidents

Then calculate margins at the feature level, not the product level:

Feature	Avg Trace Cost	Revenue/Use	Margin
Quick chat	$0.02	$0.05	60%
Doc analysis	$0.18	$0.25	28%
Agent workflow	$1.85	$2.00	7.5%
Research task	$4.20	$3.00	-40%

That last row illustrates margin compression from feature-level cost variance. A single underwater feature, heavily used by high-usage customers, can offset margins from other features—and this won't be visible when only tracking aggregate token spend.

We've written about this pattern in Usage Variance in AI Products. The same customer segment with highest engagement often has the highest cost-to-serve.

The Attribution Problem (And How to Solve It)

Even if you track all these costs, attributing them is hard. Finance can't tie token spend to business units. Product can't see which features are underwater. Engineering doesn't know if the last deploy made costs better or worse.

The fix is trace-native attribution. Every trace needs a canonical tag set:

Tag	Purpose
`customer_id`	Who to bill
`workspace_id`	Multi-tenant isolation
`feature_name`	Which product feature
`agent_name`, `agent_version`	Which agent, which release
`workflow_name`, `workflow_version`	Which workflow variant
`model_provider`, `model_name`, `model_version`	Which model
`prompt_id`, `prompt_version`	Which prompt template
`environment`	prod / staging / dev
`experiment_id`	A/B test variant
`trace_id`	Ties everything together

This isn't over-engineering. It's the minimum viable tagging to answer basic questions:

"Why did costs spike last Tuesday?" → Filter by prompt_version, find the regression
"Which customers are underwater?" → Group by customer_id, compare trace costs to revenue
"Is the new agent version cheaper?" → Compare agent_version A vs B

If you're using OpenTelemetry for tracing, these tags flow naturally. Tools like Langfuse have native support for this exact pattern.

Pricing Readiness Checklist for AI Products

Before you set (or change) prices, you should be able to answer these questions:

Cost Visibility

Do you know your cost per trace for each major feature?
Can you break down trace cost by span (LLM calls, tools, retrieval)?
Do you track orchestration overhead separately from visible output?
Are tool costs (external APIs) first-class line items?
Do you measure cache hit rates and their dollar impact?

Reliability Costs

Do you know what your eval/judge pipeline costs per trace?
Are guardrail/safety model costs tracked?
What's your retry rate? Fallback rate? Incident reprocessing cost?
Can you attribute cost spikes to specific prompt or agent versions?

Attribution

Is every trace tagged with customer, feature, and version metadata?
Can finance see cost-per-customer without engineering help?
Can product see cost-per-feature in near real time?
Do deploys automatically surface cost deltas?

Margin Safety

Do you know which features are underwater?
Do you know which customer segments are unprofitable?
Do you have alerts when a customer's trace costs exceed their revenue?
Can you model the margin impact before launching a new feature?

If you answered "no" to more than two of these, there are gaps in your pricing visibility.

Summary

Token-based cost thinking reflects an earlier era of wrapper apps. Modern AI products are workflows—multi-step, tool-heavy, reliability-wrapped workflows—and they need workflow-level cost accounting.

Companies with accurate margin visibility can make informed pricing decisions. Understanding true cost-per-trace at the feature level provides the foundation for sustainable AI product economics.

Cost-per-trace visibility is essential for accurate margin calculations.

What Bear Billing Does

We built Bear Billing because we lived this problem. Per-customer margin visibility, feature-level cost attribution, and pricing scenario modeling—so you can see which customers are profitable, which features are underwater, and what happens to your margins before you ship pricing changes.

If any of this resonated, talk to us. We're onboarding our founding cohort now.