Skip to main content
Back to Blog
technical10 min read

Unit Economics for AI Products: A Complete Cost Framework Beyond Tokens

Most AI startups track costs incompletely. Tokens are not your unit—traces are. Learn the complete cost model for AI products, from orchestration overhead to reliability loops, and how to calculate per-customer margins that reflect reality.

BA

Blaise Albuquerque

Founder, Bear Billing

#unit-economics#ai-margins#cost-tracking#observability#pricing-strategy

Summary: If you're calculating AI costs as "tokens × price," you're seeing 20-40% of your actual spend. The real unit of accounting is the trace—a complete workflow that includes orchestration overhead, tool calls, retrieval costs, reliability loops, and failure paths. Understanding cost-per-trace at the feature level is essential for accurate margin calculations.


The Token Fallacy

Here's how most AI startups think about costs:

Monthly Cost = Total Tokens × Price per Token

Simple. Clear. But incomplete.

This model worked when AI products were single-call wrappers around GPT-3. Send a prompt, get a response, bill the customer. But modern AI products aren't single calls—they're workflows.

Your "simple" customer support agent actually:

  • Embeds the user query (embedding model cost)
  • Retrieves relevant context from a vector database (infra cost)
  • Calls a planner model to decide what to do (hidden tokens)
  • Executes a tool call to your CRM (external API cost)
  • Generates a response (visible tokens)
  • Runs a safety check (guardrail model cost)
  • Logs everything for debugging (observability cost)

That's 7+ cost centers in a single user interaction. Token math captures maybe two of them.

This is why AI startups report "good margins" in spreadsheets while their bank accounts tell a different story. They're only measuring part of the picture.


The Real Unit: The Trace

Here's the mental model shift that separates companies that survive from those that don't:

The unit of accounting is the trace, not the LLM call.

A trace is a complete workflow execution—from user request to final response—containing multiple spans (individual operations). Each span has its own cost profile.

Trace (feature_invocation)
├── Span: Retrieve (embedding lookup, vector DB query)
├── Span: Rerank (relevance scoring model)
├── Span: Tool Call (web search)
├── Span: LLM Generation (planner/router)
├── Span: Tool Call (CRM lookup)
├── Span: LLM Generation (synthesize response)
├── Span: Guardrail Check (safety model)
└── Span: Judge/Eval (quality verification)

This is exactly how observability tools like Langfuse and Arize model AI applications—because it's how costs actually accumulate.

When you think in traces, uncomfortable truths emerge:

  • Your "cheap" Haiku-powered feature costs $0.15/trace because of the retrieval pipeline wrapped around it
  • Your agent workflow costs $2.40/trace, not the $0.08 your token math suggested
  • 40% of your spend is on steps users never see

What Actually Lives Inside a Trace

Let's break down every cost component in a production AI workflow.

1. LLM Generation Costs (What You're Already Tracking)

ComponentNotes
Input tokensThe prompt you send
Output tokensThe response you receive (usually 3-5x input pricing)
Cached tokensReused prefixes (50-90% discount, but mechanics matter)
Hidden tokensPlanning, routing, self-critique steps you don't see in the final output

The hidden tokens trap: Multi-step agents often generate 5-20x more internal tokens than visible output. Planning prompts, tool-routing decisions, memory summarization, self-check passes—all billed, none visible to users.

2. Tool Costs (The Forgotten Line Item)

Modern agents are tool farms. Every external call is a cost:

Tool TypeCost Model
Web search (Serper, Tavily)Per-query ($0.001-0.01)
Browser automationPer-page or per-minute
Database queriesPer-read, varies by provider
CRM/ticketing APIsPer-call or monthly + overage
Payment processingPer-transaction
Maps/geocodingPer-request

If your agent calls three tools per trace, that's three line items your token math ignores.

Track: tool_name, tool_vendor_cost, tool_latency, tool_retry_count

3. Orchestration Overhead (Tokens You Didn't Know You Were Paying For)

Multi-step agents accumulate hidden costs:

Hidden CostWhat It Is
Planner promptsAgent deciding what to do next
Schema/JSON paddingFormatting overhead on every call
Memory summarizationCompressing context between steps
Rubric/critique stepsAgent validating its own work
Fallback passesWhen primary model fails, backup runs

A 5-step agent workflow doesn't cost 5x a single call. It costs 5x plus all the routing, planning, and verification steps in between—often 8-12 actual LLM calls.

4. Retrieval Costs (RAG Isn't Free)

If you're using RAG, you have a parallel cost structure:

ComponentCost Type
Embedding generationOne-time per document chunk
Vector databaseMonthly (Pinecone: $70-500+/month)
Query embeddingPer-search
Retrieval/rerankModel calls for relevance scoring
Context stuffingRetrieved chunks become input tokens

Real example: A mid-sized e-commerce firm saw costs jump from $5K/month in prototyping to $50K/month in staging due to unoptimized RAG queries fetching 10x more context than needed.

5. Caching (Mechanics, Not Just Discounts)

Caching isn't just "cheaper tokens." It has specific behaviors:

FactorImpact
Cache key strategyHow you structure prompts for reuse
Minimum thresholdsOpenAI: 1,024+ tokens to trigger
Cache hit rate% of requests served from cache
TTL/evictionCaches expire; cold starts cost full price
Platform varianceAzure OpenAI has different rules than OpenAI direct

The insight: Stable system prompts with dynamic user content = high cache hits. Fully dynamic prompts = no cache benefit. Prompt architecture directly affects cost.

Track: cache_hit_rate, cached_prefix_tokens, cache_key_strategy

6. Reliability Loops (The Silent Budget Killer)

Production AI adds model calls for quality assurance:

Cost TypeWhat It Is
Offline evalsRegression testing on prompt/model changes
Online judgesLLM-as-judge verifying output quality
GuardrailsSafety/policy checks (often separate model calls)
Replay/debuggingRe-running traces to diagnose issues
Incident reprocessingFixing outputs after failures

This is where prototype → production costs explode. A workflow that cost $0.05/trace in development suddenly costs $0.15/trace in production because you added eval, safety, and monitoring layers.

Track: eval_suite_cost, judge_model_cost, safety_model_cost, incident_cost

7. Failure Paths (You Pay for Mistakes)

Things break. You still pay:

Failure TypeCost
Failed requestsTokens consumed before failure
RetriesRate limits, timeouts = paying 2-3x
Fallback chainsPrimary → secondary → tertiary model
Partial completionsStreaming failures mid-response
Validation failuresOutput rejected, re-run required

If your retry rate is 5% and your fallback rate is 2%, you're paying 7% more than your happy-path math suggests.


The Complete Formula

Shift from token costs to trace costs:

Cost per Feature Invocation =
    Σ(step_cost)           // All LLM calls in the trace
  + Σ(tool_cost)           // All external API calls
  + retrieval_cost         // RAG pipeline overhead
  + eval_cost_allocation   // Reliability loops, amortized
  + failure_cost           // Retries, fallbacks, incidents

Then calculate margins at the feature level, not the product level:

FeatureAvg Trace CostRevenue/UseMargin
Quick chat$0.02$0.0560%
Doc analysis$0.18$0.2528%
Agent workflow$1.85$2.007.5%
Research task$4.20$3.00-40%

That last row illustrates margin compression from feature-level cost variance. A single underwater feature, heavily used by high-usage customers, can offset margins from other features—and this won't be visible when only tracking aggregate token spend.

We've written about this pattern in Usage Variance in AI Products. The same customer segment with highest engagement often has the highest cost-to-serve.


The Attribution Problem (And How to Solve It)

Even if you track all these costs, attributing them is hard. Finance can't tie token spend to business units. Product can't see which features are underwater. Engineering doesn't know if the last deploy made costs better or worse.

The fix is trace-native attribution. Every trace needs a canonical tag set:

TagPurpose
customer_idWho to bill
workspace_idMulti-tenant isolation
feature_nameWhich product feature
agent_name, agent_versionWhich agent, which release
workflow_name, workflow_versionWhich workflow variant
model_provider, model_name, model_versionWhich model
prompt_id, prompt_versionWhich prompt template
environmentprod / staging / dev
experiment_idA/B test variant
trace_idTies everything together

This isn't over-engineering. It's the minimum viable tagging to answer basic questions:

  • "Why did costs spike last Tuesday?" → Filter by prompt_version, find the regression
  • "Which customers are underwater?" → Group by customer_id, compare trace costs to revenue
  • "Is the new agent version cheaper?" → Compare agent_version A vs B

If you're using OpenTelemetry for tracing, these tags flow naturally. Tools like Langfuse have native support for this exact pattern.


Pricing Readiness Checklist for AI Products

Before you set (or change) prices, you should be able to answer these questions:

Cost Visibility

  • Do you know your cost per trace for each major feature?
  • Can you break down trace cost by span (LLM calls, tools, retrieval)?
  • Do you track orchestration overhead separately from visible output?
  • Are tool costs (external APIs) first-class line items?
  • Do you measure cache hit rates and their dollar impact?

Reliability Costs

  • Do you know what your eval/judge pipeline costs per trace?
  • Are guardrail/safety model costs tracked?
  • What's your retry rate? Fallback rate? Incident reprocessing cost?
  • Can you attribute cost spikes to specific prompt or agent versions?

Attribution

  • Is every trace tagged with customer, feature, and version metadata?
  • Can finance see cost-per-customer without engineering help?
  • Can product see cost-per-feature in near real time?
  • Do deploys automatically surface cost deltas?

Margin Safety

  • Do you know which features are underwater?
  • Do you know which customer segments are unprofitable?
  • Do you have alerts when a customer's trace costs exceed their revenue?
  • Can you model the margin impact before launching a new feature?

If you answered "no" to more than two of these, there are gaps in your pricing visibility.


Summary

Token-based cost thinking reflects an earlier era of wrapper apps. Modern AI products are workflows—multi-step, tool-heavy, reliability-wrapped workflows—and they need workflow-level cost accounting.

Companies with accurate margin visibility can make informed pricing decisions. Understanding true cost-per-trace at the feature level provides the foundation for sustainable AI product economics.

Cost-per-trace visibility is essential for accurate margin calculations.


What Bear Billing Does

We built Bear Billing because we lived this problem. Per-customer margin visibility, feature-level cost attribution, and pricing scenario modeling—so you can see which customers are profitable, which features are underwater, and what happens to your margins before you ship pricing changes.

If any of this resonated, talk to us. We're onboarding our founding cohort now.

Share this article