Why token logging breaks AI SaaS billing

You added AI features. You started logging tokens. Everything seemed fine — until it wasn’t.

It’s not wrong, exactly. It captures something real. For the first few weeks — maybe months — it tells you roughly what’s happening. You can query it, build a dashboard, do rough cost attribution.

Not all at once. Usually one at a time, each arriving as a subtle discrepancy, a confused support ticket, or a finance question you can’t cleanly answer. By the time you realize the log line was never a billing system, you’ve built product features on top of it that assume it is. Unwinding that is painful.

This article explains specifically why that happens, and the mental model you need before building AI billing that holds up.

The Five Ways It Breaks

What a Log Line Actually Captures

What You Actually Need to Model

Where the Architecture Lives

The Immutable Event at the Center

Rating Is Not Enrichment

What This Article Argues

Frequently Asked Questions

The Five Ways It Breaks

Routing is a fact of life in multi-provider AI systems. You request gpt-4o. The provider is slow, the feature routes to a cheaper model for cost reasons, or a fallback fires on an error. The call succeeds — but it ran on claude-3-haiku, which has a meaningfully different cost structure.

Your log still says model: "gpt-4o". Because that’s what you passed in, not what ran.

If billing is derived from that log, you’re billing based on the requested model. The provider is invoicing you for the resolved model. These can diverge significantly — not by a rounding error, but by an order of magnitude if the fallback goes in the wrong direction.

The fix sounds simple: log the resolved model, not the requested one. But it requires your execution layer to capture what actually happened — provider response metadata, the resolved routing decision, and which attempt in a retry chain succeeded. A logger.info call at the top of the function, before any of that is known, cannot give you this. The log needs to happen at the end, not the beginning.

The failure mode is easy to miss because the log looks complete. You have a model name, you have token counts — what else could you need? What you’re missing is the distinction between what was requested and what actually executed. Those two facts need to exist as separate fields: requestedAlias and resolvedModel. They are not the same thing.

The assumption baked into most early-stage AI billing is that the billing unit maps cleanly to the user’s action: user asks a question → one LLM call → one log entry → one billing event. It’s a clean mental model. It breaks as soon as the product has any complexity.

A “summarize this document” feature might call an extraction model, a summarization model, and a classification model. A research agent might call a model once to plan, then again for each step, then once more to synthesize. A customer support bot might run intent classification, retrieval, generation, and a safety check as four separate provider calls before returning a single response.

A log entry per call helps — but only if you know which calls belong together, what the aggregate cost is, what the user-facing billing unit is, and how to handle partial failure (three of four calls succeeded; what do you bill?). The log line models none of this. In production, it usually shows up as a billing dashboard that looks plausible until you reconcile it against a provider invoice and find your totals off by whatever your average fan-out ratio is — a number you didn’t know, because nothing measured it.

Most AI SaaS products have a plan structure: included usage, then overage at a per-unit rate. A Pro plan might include 100,000 tokens per month. Usage above that is billed at $0.002 per 1,000 tokens.

The interesting case is at the boundary. A tenant has used 99,700 of their 100,000 included tokens. Their next request consumes 800 tokens. 300 of those are included. 500 are overage.

The naive approach — classify the whole request as included or overage based on where the tenant stood at the start — produces wrong billing in either direction. Over-include, and you’re subsidizing usage beyond the plan limit. Over-bill for overage, and you’re charging for usage the customer already paid for.

The correct behavior splits the request: 300 tokens counted against the remaining allowance, 500 tokens rated as overage. That split has to happen at rating time, against the current allowance state, for each request.

A log line cannot do this. It records what happened at the provider. It has no concept of the tenant’s plan, their current allowance balance, or how to split a single provider call across two financial categories. The split is a rating decision, and rating is a separate operation from logging.

BYOK — Bring Your Own Key — is a common feature in AI SaaS. The customer provides their own API key for the underlying LLM provider, pays the provider directly, and the platform’s provider cost for those calls is zero. The instinct is to treat BYOK usage as invisible: no cost, nothing to track.

The problem is that BYOK only eliminates the provider cost line. It does not eliminate the platform’s infrastructure cost. Every BYOK call still consumes compute time on the platform’s servers, retrieval costs if the feature uses RAG or vector search, moderation overhead if the platform runs safety checks, and gateway latency and logging infrastructure.

If the platform charges a service fee on top of BYOK usage, that fee needs to be tracked and billed even when provider cost is zero. If there’s no service fee, the platform is absorbing real overhead with no revenue — a deliberate business-model choice, but one that should be visible, not an accidental consequence of not recording BYOK calls.

keySource: 'customer' is a first-class field on the execution record, not a special case to skip.

When you integrate with Stripe, OpenMeter, or any external metering system, there’s a natural temptation to treat that system as the billing record. You sent them the data. They process invoices. They have dashboards.

But external billing systems are designed to be eventually consistent, not to serve as the authoritative record of what happened inside your product.

Stripe’s Billing Meters documentation is explicit about this: meter events are aggregated asynchronously, and meter event summaries may not yet reflect recently received events. A preview invoice can change between when you generate it and when it finalises. These are reasonable design choices for a billing platform — not the right properties for a system you’re relying on to answer “what actually happened?”

When a billing sync fails and retries, the external system may receive the event twice. When a provider invoice arrives three days later with a different token count than you recorded, which one is right? When a customer disputes a charge, can you reconstruct the exact billing logic that produced it — including the pricing version in effect at the time, the allowance balance at the moment of the request, and the fallback model that actually ran?

If your source of truth is the external billing system, the answer is “probably, with some manual reconstruction.” If your source of truth is your own ledger, the answer is “yes, and here’s the audit trail.”

The external billing system is a settlement partner. It receives confirmed billing data and generates invoices. It is not the record of what your product did.

What a Log Line Actually Captures

This captures one thing: that at some point in the execution of some request, a provider was called, and these were the token counts returned. It does not capture:

None of these are data-engineering problems. They are modeling problems. The log line is modeling the wrong thing — it’s an observability artifact, not a financial record.

What You Actually Need to Model

AI billing involves three distinct kinds of facts that need to be stored and managed separately.

What happened. The execution facts: which provider call was made, which model actually ran, what token dimensions were returned, which attempt succeeded, what the provider call ID was. This is operational data. It should be immutable once written — it records reality, and reality doesn’t get revised.

What it means financially. The rated interpretation: given the current pricing catalog, this tenant’s plan, their current allowance balance, and whether this is platform-paid or customer-key-paid — how much did this call cost the platform, and how much does it add to the customer’s invoice? This changes when pricing changes, when plans change, when entitlements reset. It should be versioned, replayable, and separate from the execution record.

What external systems confirm. The settlement facts: what the provider invoice says, what the billing provider has recorded, what credits have been applied. This arrives asynchronously — sometimes hours later, sometimes days. It should be compared against your internal records and reconciled, not used as the primary source of truth.

These three categories have a name in the architecture this series describes: Execution Truth, Rated Truth, and Settlement Truth.

The diagram below shows how the three truths relate to the hot path, the cold path, and the settlement layer:

The three truths of AI billing. Execution records what happened. Rating interprets financial meaning. Settlement confirms against external systems. Collapsing them into one record makes any one of them unreliable.

Without separate Execution Truth, you cannot replay billing history after a pricing correction. You cannot reconstruct what actually ran when a fallback occurred. You cannot compare your records against a provider invoice that arrived late.

Without separate Rated Truth, you cannot change pricing without retroactively altering billing history. You cannot split a single provider call across an included-quota and overage boundary at rating time. You cannot tell finance what the platform’s cost was versus what the customer was billed.

Without separate Settlement Truth, you cannot detect drift between your records and what external systems show. You cannot make corrections that are auditable and append-only. You cannot answer “why is the provider invoice different from what we expected?”

Where the Architecture Lives

Once you accept that three truths need to exist separately, a natural structure emerges.

The hot path is the request lifecycle. It does the minimum necessary to return a response and preserve execution truth: create a budget reservation, route to the provider, write one immutable record — the UsageEvent — with what actually happened. Then return. The hot path does not rate the usage, does not aggregate billing, and does not sync with an external billing provider. It records what happened and defers everything else.

The cold path is where rated truth is produced. Asynchronously, a rating worker reads immutable UsageEvent records, applies versioned pricing, splits usage against allowances, and produces RatedUsageLine records. These flow into a ledger, then into billing aggregation, then into sync with the external billing provider. The cold path can fall behind under load. Billing aggregates may be minutes behind. That’s acceptable. The important invariant is that no execution fact gets dropped.

The settlement layer runs on a longer clock. Provider invoices arrive. Billing-provider summaries are exported. The reconciliation engine compares them against internal rated truth, classifies any drift, and creates append-only correction records. Closed billing periods are never modified retroactively.

The diagram below shows the separation between request path, async cold path, and settlement:

The request path records execution truth and returns. Rating, aggregation, and sync happen asynchronously. Settlement compares internal and external truth and corrects forward — never backward.

The Immutable Event at the Center

The concrete anchor for all of this is the UsageEvent. The key things it must contain:

resolvedModel and requestedAlias are separate fields. The event knows both what was asked for and what actually ran.

keySource is explicit. BYOK is a first-class property of the execution fact, not something inferred later.

idempotencyKey is derived from three components — operation, provider call, attempt number — so replaying a failed request doesn’t create duplicate billing records, but two separate successful provider calls on different retries are correctly recorded as two separate events.

There are no cost fields. No billability classification. No plan information. Those are rating decisions that depend on context that changes over time. Baking them into the execution record is the mistake that makes pricing changes painful.

pricingVersion records which pricing catalog was active at the moment of the call, allowing re-rating to produce correct historical numbers even after pricing changes.

Rating Is Not Enrichment

Rating turns an immutable UsageEvent into financial meaning, and it’s worth being precise about why this is different from enrichment.

Enrichment would be: take the event, look up the current price, add a cost field, store it back. This seems efficient. It’s wrong because it makes the event mutable in response to external state. When pricing changes, you either rewrite historical events (dangerous) or live with mixed pricing in the same table with no clean boundary.

Rating keeps the event immutable and produces a separate RatedUsageLine that has its own version, references the source event by ID, and can be superseded when pricing changes.

Because rating is a pure function, you can run it in shadow mode before cutting over to new pricing, run it counterfactually to answer “what would last month have cost on the Enterprise tier?”, and replay it exactly when a worker crashes and restarts. None of these capabilities are possible if rating is baked into the execution record.

What This Article Argues

Token logging is not a billing system, and treating it as one creates structural problems that don’t show up until the product is at a stage where fixing them is expensive.

The model that ran is not the model you logged. One user action is not one provider call. Included credits and overage are not binary. BYOK does not mean free. Your billing provider is not your source of truth.

All of the answers rest on the same foundation: keeping three kinds of facts — execution truth, rated truth, settlement truth — in separate, appropriately-mutable records, rather than collapsing them into one log line.

Part 2 of this series covers the architecture of that ledger in detail: BudgetReservation as a state machine that prevents concurrent overspend, the Rating Engine as a versioned pure function, the Cost Risk Engine that watches platform margin independently of customer caps, reconciliation as a first-class settlement layer, and how to trace any customer-facing charge backward to the exact provider call that created it — without ad-hoc queries or reconstructed logic. Read Part 2: Building the Reservation-Aware AI Usage Ledger

Frequently Asked Questions

No. A token log is an observability artifact, not a financial record. It captures the token counts a provider returned, but not which model actually ran after routing or fallback, which retry attempt succeeded, which user operation a call belongs to, the tenant’s plan and allowance state, or whether the call was platform-paid or customer-key-paid. Billing derived directly from logs breaks once the product grows beyond a one-call-per-request model.

Why should AI billing use the resolved model instead of the requested model?

Because routing and fallback can change which model actually runs. You may request gpt-4o and have the call resolve to a cheaper or more expensive model. The provider invoices you for the model that ran, not the one you requested, so billing must be based on resolvedModel. The requested alias is kept only for debugging.

Yes. BYOK only zeroes out the provider cost line. The platform still pays for compute, retrieval, moderation, gateway, and logging on every BYOK call. Recording the execution with keySource: 'customer' is the only way to know whether BYOK customers are profitable, break-even, or losing money for the platform.

Execution Truth (what happened: the immutable record of which provider call ran), Rated Truth (what it means financially: the versioned interpretation against pricing and plan), and Settlement Truth (what external systems confirm: provider invoices and billing-provider summaries). Keeping them in separate, appropriately-mutable records is what makes the system replayable, auditable, and reconcilable.

No. External metering systems are designed to be eventually consistent — meter events aggregate asynchronously and summaries may lag recently received events. They are settlement partners that receive confirmed data and generate invoices. The authoritative record of what your product did should be your own internal ledger.

Why token logging breaks AI SaaS billing

The Five Ways It Breaks

What a Log Line Actually Captures

What You Actually Need to Model

Where the Architecture Lives

The Immutable Event at the Center

Rating Is Not Enrichment

What This Article Argues

Frequently Asked Questions

Get more insights:

Building the reservation-aware AI usage ledger

Unit Testing iOS in 2026: How We Built a Rulebook with Swift Testing + Sourcery