От халепа... Ця сторінка ще не має українського перекладу, але ми вже над цим працюємо!

Hi! This website uses cookies. By continuing to browse or by clicking “I agree”, you accept this use. For more information, please see our Privacy Policy

Building the reservation-aware AI usage ledger

Part 1 made the case that AI billing needs a ledger, not a log. This part shows what that ledger looks like — and why each design decision is harder than it first appears.

We’ll go component by component: how budget reservations prevent concurrent overspend, how the Rating Engine separates financial interpretation from execution facts, how the Cost Risk Engine defends platform margin, how billing sync stays correct under failure, how reconciliation closes the loop with external systems, and how any customer-facing charge stays traceable backward to the raw provider call.

New here? Start with Part 1: Why Token Logging Breaks AI SaaS Billing

The Invariant Everything Else Depends On

The Data Model

A Full Ledger Flow Example

BudgetReservation: Why a Cap Check Isn’t Enough

Rating Engine: The Boundary Between Fact and Meaning

Cost Risk Engine: The Margin Defender

Billing Sync and Operational Correctness

Reconciliation: Closing the Settlement Loop

Explainability: The Backward Trace

The Gotchas

What to Build First vs What to Defer

What Production Looks Like

Practical Takeaways

Frequently Asked Questions

The Invariant Everything Else Depends On

Before individual components, there’s one mechanical property the whole architecture depends on. Every financially meaningful movement in the ledger — holding budget before a call, capturing actual usage after, releasing unused budget, recording overage, applying a correction — should appear as a paired entry that nets to zero. This is double-entry accounting applied to a usage ledger, and it’s the property that makes the system auditable by construction rather than by convention.

Define the accounts:

Tenant.Allowance           // granted credits for this billing period
Tenant.HeldByReservation   // held before execution, not yet captured
Tenant.Available           // Allowance minus Held minus Spent
Tenant.OverageBilled        // overage charges accumulating 
on the active invoice
AdjustmentEntry.signed     // net of all open corrections

The invariant, for any tenant at any point:

Tenant.Allowance
  = Tenant.HeldByReservation
  + Tenant.Available
  + Tenant.OverageBilled
  + AdjustmentEntry.signed

No credit appears or disappears without a paired entry. A reservation hold debits Available and credits HeldByReservation. A capture debits HeldByReservation and credits AccruedRevenue. A release debits HeldByReservation and credits Available back. A correction creates entries that return the system to balance.

The practical use: run this as a scheduled probe.

If money_movement_residual != 0 for any tenant, something is wrong — a duplicate capture, a lost release, a missing adjustment. Catching it continuously is far better than discovering the discrepancy at a quarterly close when the trail is cold.

The Data Model

The five core entities, their key fields, and the constraints that enforce correctness. This is a minimal production-oriented model — not exhaustive, but containing the constraints that matter.

UsageEvent — append-only execution record

Column	Type	Constraint / Note
id	UUID	PK
idempotency_key	VARCHAR	UNIQUE — hash(operationId, providerCallId, attemptNumber)
tenant_id	UUID	NOT NULL, FK → Tenant
operation_id	UUID	NOT NULL, FK → UsageOperation
provider_call_id	VARCHAR	NOT NULL — provider’s own request ID
requested_alias	VARCHAR	NOT NULL — for debugging only, never billed
resolved_provider	VARCHAR	NOT NULL — what actually ran
resolved_model	VARCHAR	NOT NULL — what actually ran
key_source	ENUM	NOT NULL — ‘platform’ \| ‘customer’
input_tokens	INTEGER	NOT NULL
output_tokens	INTEGER	NOT NULL
cached_input_tokens	INTEGER	NOT NULL DEFAULT 0
tool_call_count	INTEGER	NOT NULL DEFAULT 0
pricing_version	VARCHAR	NOT NULL — catalog version active at record time
recorded_at	TIMESTAMPTZ	NOT NULL

Immutability constraint: no UPDATE or DELETE permitted. Append-only. Enforced at the application layer and, where the database supports it, via row-level security or a trigger.

RatedUsageLine — financial interpretation of one UsageEvent

Column	Type	Constraint / Note
id	UUID	PK
usage_event_id	UUID	NOT NULL, FK → UsageEvent — backward traceability anchor
rating_version	VARCHAR	NOT NULL — which pricing ruleset produced this line
line_type	ENUM	NOT NULL — ‘platform_cost’ \| ‘included’ \| ‘overage’ \| ‘customer_billable’
unit_count	DECIMAL	NOT NULL
unit_price	DECIMAL	NOT NULL
amount	DECIMAL	NOT NULL
currency	CHAR(3)	NOT NULL
is_counterfactual	BOOLEAN	NOT NULL DEFAULT FALSE — shadow / what-if lines excluded from billing
superseded_at	TIMESTAMPTZ	NULL — set when re-rating produces a replacement
created_at	TIMESTAMPTZ	NOT NULL

Unique constraint: (usage_event_id, rating_version, line_type) — prevents duplicate lines on replay. The first write succeeds; replays are no-ops.

LedgerEntry — double-entry record of every financial movement

Column	Type	Constraint / Note
id	UUID	PK
tenant_id	UUID	NOT NULL, FK → Tenant
account	ENUM	NOT NULL — ‘allowance’ \| ‘held’ \| ‘available’ \| ‘overage_billed’ \| ‘adjustment’
direction	ENUM	NOT NULL — ‘debit’ \| ‘credit’
amount	DECIMAL	NOT NULL
source_type	ENUM	NOT NULL — ‘reservation’ \| ‘capture’ \| ‘release’ \| ‘rating’ \| ‘adjustment’
source_id	UUID	NOT NULL — FK to BudgetReservation, RatedUsageLine, or AdjustmentEntry
created_at	TIMESTAMPTZ	NOT NULL

Immutability constraint: append-only. No UPDATE or DELETE. The invariant probe queries this table.

BudgetReservation — state machine for pre-execution budget holds

Column	Type	Constraint / Note
id	UUID	PK
idempotency_key	VARCHAR	UNIQUE — prevents duplicate reservations on retry
tenant_id	UUID	NOT NULL, FK → Tenant
operation_id	UUID	NOT NULL, FK → UsageOperation
estimated_amount	DECIMAL	NOT NULL
captured_amount	DECIMAL	NULL — set on capture
released_amount	DECIMAL	NULL — set on release
state	ENUM	NOT NULL — ‘requested’ \| ‘reserved’ \| ‘partially_captured’ \| ‘captured’ \| ‘released’ \| ‘overrun’ \| ‘expired’
expires_at	TIMESTAMPTZ	NOT NULL — TTL for long-running operations
created_at	TIMESTAMPTZ	NOT NULL
updated_at	TIMESTAMPTZ	NOT NULL

AdjustmentEntry — forward-only correction record

Column	Type	Constraint / Note
id	UUID	PK
tenant_id	UUID	NOT NULL, FK → Tenant
drift_entry_id	UUID	NULL, FK → DriftEntry — links to reconciliation source
reason_code	ENUM	NOT NULL — ‘late_event_after_period_close’ \| ‘provider_invoice_delta’ \| ‘pricing_correction’ \| ‘classification_correction’ \| ‘manual_override’
signed_amount	DECIMAL	NOT NULL — positive = credit to tenant, negative = charge
period_reference	VARCHAR	NOT NULL — original period corrected, e.g. “2025-03”
applied_to_period	VARCHAR	NOT NULL — period entry appears on, e.g. “2025-04”
approver_id	UUID	NULL — required for manual_override
created_at	TIMESTAMPTZ	NOT NULL

Forward-only constraint: applied_to_period must be ≥ the current open period. Corrections never enter closed periods.

A Full Ledger Flow Example

A concrete trace of what the system does for a single operation, from reservation to billing aggregate. Numbers are illustrative.

Scenario: a tenant on a Pro plan (100,000 tokens included, $0.002/1K overage) sends a request. They have 99,700 tokens remaining. The request uses a research agent that makes two provider calls totalling 800 tokens.

Step 1 — Reservation hold (hot path)

Before any provider call, the system creates a BudgetReservation with estimated_amount = $0.002. State: requested → reserved. Paired ledger entries:

DEBIT Tenant.Available $0.002
CREDIT Tenant.HeldByReservation $0.002

Step 2 — Provider call 1 (hot path)

First provider call: 350 input + 150 output tokens on gpt-4o. Provider response metadata: resolvedModel: "gpt-4o", providerCallId: "prov_abc123". UsageEvent written (immutable, append-only):

id:               evt_001
idempotency_key:  hash(op_xyz, prov_abc123, attempt_1)
operation_id:     op_xyz
resolved_model:   "gpt-4o"
requested_alias:  "gpt-4o"
key_source:       "platform"
input_tokens:     350
output_tokens:    150
pricing_version:  "v2025-04"

Step 3 — Provider call 2 (hot path)

Synthesis step: 200 input + 100 output tokens. New UsageEvent written with id: evt_002, same operation_id: op_xyz.

Step 4 — Reservation capture and release (hot path end)

Actual usage: 800 tokens total. captured_amount = $0.0016, released_amount = $0.0004 (unused estimate returned). State: reserved → captured. Paired ledger entries:

DEBIT Tenant.HeldByReservation $0.002 (full hold released)
CREDIT Tenant.Available $0.0004 (unused returned)
CREDIT AccruedRevenue $0.0016 (captured)

Step 5 — Rating (cold path, async)

The rating worker processes both events. Allowance state at rating time: 99,700 tokens remaining. 800 total tokens: 300 included (consuming the last of the allowance), 500 overage. Four RatedUsageLine records produced:

line_type	unit_count	unit_price	amount	source
platform_cost	800	$0.000002	$0.0016	evt_001 + evt_002
included	300	$0.00	$0.00	allowance draw-down
overage	500	$0.002/1K	$0.001	evt_001 + evt_002
customer_billable	500	$0.002/1K	$0.001	overage line

Step 6 — Ledger entries for overage

DEBIT Tenant.Available $0.001 (effective overage)
CREDIT Tenant.OverageBilled $0.001

Step 7 — Billing aggregate and sync (cold path)

The billing aggregate is synced to the billing provider (e.g., Stripe Meters) with a stable idempotency key:

const idempotencyKey = hash(
  aggregate.id, aggregate.tenantId,
  aggregate.period, "overage_tokens"
);
await billingProvider.sendMeterEvent({
  customerId: aggregate.tenantId,
  meterKey:   "overage_tokens",
  value:      500,
  idempotencyKey,
});

Stripe meter event summaries are eventually consistent because meter events are aggregated asynchronously. The internal ledger is the source of truth; the external summary is a projection.

Step 8 — Later adjustment (if needed)

If the provider invoice arrives showing 810 tokens (not 800), the reconciliation engine creates a DriftEntry of type TOKEN_COUNT_DRIFT. If above the configured threshold, a human approves an AdjustmentEntry:

AdjustmentEntry:
  reason_code:       "provider_invoice_delta"
  signed_amount:     -$0.00002
  period_reference:  "2025-04"
  applied_to_period: "2025-04"
  drift_entry_id:    drift_xyz

The UsageEvent and RatedUsageLine records are not modified. The correction is its own document with full lineage.

BudgetReservation: Why a Cap Check Isn’t Enough

The simplest approach to budget control is a preflight check: read the tenant’s current spend, compare to their limit, allow or deny. This works until you have concurrency. Two requests arrive simultaneously at $9.80 of a $10.00 cap. Both read $9.80. Both pass the check. Both execute. Spend lands at $10.60. The check was useless.

The correct primitive is a reservation: before the provider call, hold the estimated budget so the available balance drops immediately. Other concurrent requests see the held balance and cannot also claim it.

[Request A arrives]
  Available: $10.00  →  Hold $0.50  →  Available: $9.50, Held: $0.50
[Request B arrives, concurrent]
  Available: $9.50   ← sees the hold
  →  Hold $0.80  →  Available: $8.70, Held: $1.30
[Request A completes, actual $0.43]
  →  Capture $0.43, Release $0.07 unused
  →  Available: $8.77, Held: $0.80, Spent: $0.43

The diagram below shows the BudgetReservation state machine and how each transition creates paired ledger entries:

BudgetReservation is a state machine. Hold → Reserved → Captured / Released / Overrun. Each transition is a concrete financial event that participates in the money movement invariant.

States and transitions: requested → reserved (estimate held, available reduced immediately); reserved → partially_captured (interim capture in a multi-step workflow); partially_captured → captured (final capture, unused released); reserved → released (operation aborted, full hold returned); reserved/captured → overrun (actual usage exceeded reserved — explicit, typed); reserved → expired (TTL elapsed).

The overrun state makes this useful for agentic workflows. When a workflow exceeds its initial cost estimate, the reservation records that fact explicitly — not a silent discrepancy discovered at invoice time, but a typed financial event in the reservation lifecycle.

The harder case: multi-step workflows. A single user action spawning thirty provider calls needs a reservation at the operation level, not the individual-call level. Each provider call captures partial usage against the operation’s reservation. Per-call cap checks cannot enforce operation-level budgets.

Fail-open vs fail-closed. When the reservation-check system is unavailable, fail-closed for high-cost model tiers (margin protection) and fail-open for low-cost requests (availability). The threshold for “high-cost” should be product-specific configuration, not a hardcoded constant.

Rating Engine: The Boundary Between Fact and Meaning

The Rating Engine turns immutable execution facts into financial meaning. It’s also the component most likely to be designed wrong — because the wrong design feels almost right.

The wrong design: enrich the event. Take the UsageEvent, look up the current price, compute costs, write them back as fields on the event. This feels efficient. When pricing changes, you either rewrite historical events or live with mixed pricing in the same table with no clean boundary.

The correct design: leave the event immutable. Produce a separate RatedUsageLine that has its own version, references the source event by ID, and can be superseded when pricing changes.

// Rating is a pure function.
// Same inputs → same outputs. Always.
function rateUsageEvent(
  event:   UsageEvent,
  catalog: PricingCatalog,
  plan:    TenantPlan,
): RatedUsageLine[] {
  const entry = catalog.lookup(
    event.resolvedProvider,
    event.resolvedModel,
    event.pricingVersion,
  );
  const platformCostLine  = computePlatformCost(event, entry);
  const { includedLine, overageLine } = 
splitAgainstAllowance(event, plan);
  const billableLine = 
computeCustomerBillable(overageLine, entry, plan);
  return [platformCostLine, includedLine, overageLine, billableLine].
filter(Boolean);
}

One UsageEvent produces up to four RatedUsageLine records: platform cost, included consumption, overage, and customer-billable amount. These are four distinct financial facts that happen to come from the same execution event. Keeping them separate is what allows the ledger to answer “what did this cost the platform?” and “what is the customer billed for?” independently.

The diagram below shows how one event fans out into multiple rated lines:

One immutable UsageEvent expands into multiple financial lines through the Rating Engine. Platform cost, included allowance consumption, overage, and customer-billable are separate financial artifacts.

Because rating is a pure function, three things become possible:

Shadow rating. Before cutting over to new pricing, run new rating rules in shadow mode against live events. Shadow-mode RatedUsageLine records are flagged (is_counterfactual = true) and excluded from billing. Diff the shadow output against production output for hours or days before cutover. If the diff is clean, cut over with confidence.

Re-rating. When pricing changes, produce new RatedUsageLine versions from the same old events. Old versions are marked superseded with a timestamp. Historical billing history stays intact alongside the new interpretation, clearly versioned.

Counterfactual invoices. Because rating is pure, you can answer “what would last month have cost on the Enterprise tier?” Mark the result counterfactual, don’t persist it as financial truth, and use it for pricing conversations.

Trade-off — aggregate sync vs raw-event sync. Aggregating by (customerId, meterKey, period) before syncing gives a stable idempotency unit and lower API volume. The cost: external billing records are less granular. OpenMeter is designed around this pattern and handles deduplication at the event level before aggregation; m3ter takes a similar approach with ingest-time deduplication and batch aggregation.

Cost Risk Engine: The Margin Defender

Customer spending caps protect customers from unexpected charges. They do not protect platform margin. These are different problems, and conflating them is a design error that doesn’t become visible until a bad quarter.

When a fallback routes a request from a cheap model to an expensive one, the customer is still within their monthly cap. The platform’s cost just went negative on that call. The customer cap didn’t fire — it’s not designed to fire here.

When a customer uses their own API key, the platform’s provider cost is zero. But the platform is still paying for compute, retrieval, moderation, and gateway overhead. If that overhead exceeds service-fee revenue, the platform loses money on every BYOK call.

When an agentic workflow spawns 200 provider calls instead of the expected 5, the customer may still be within their budget cap. The platform’s cost structure for that tenant has changed dramatically.

The Cost Risk Engine answers a different question than the customer cap: not “is the customer overspending?” but “is the platform losing money on this tenant, and why?”

The diagram below shows the engine’s margin-risk signals and the circuit-breaker states it manages:

The Cost Risk Engine reads margin-risk signals — fallback cost delta, workflow fan-out, BYOK overhead, systematic drift, hot-tenant pressure — and operates a circuit breaker with four states: HEALTHY, ELEVATED, DEGRADED, CRITICAL.

The engine runs in three modes:

Preflight (hot path, before the provider call): checks estimated cost against the reservation, plus current tenant risk signals. Can block, degrade routing to cheaper models, or require manual review. A practical target is under 50ms end-to-end for this check, though the right threshold is specific to your product’s latency budget. If the engine is unavailable, fail-closed for premium model tiers and fail-open for low-cost requests.

Midflight (cold path, after rating): analyses rated lines for margin per operation, fan-out ratio, and BYOK overhead. Emits a typed CostRiskVerdict with a closed-enum reason code. Does not retroactively block completed operations — it shapes future routing decisions.

Postflight (settlement layer, after reconciliation): detects systematic patterns across billing periods — a pricing or contract risk, not an operational error.

// R-NEG-MARGIN-FALLBACK
// Fires when the resolved model costs more than what 
the customer is billed.
if (marginPerCall < 0 && fallbackCostDelta > 0) {
  return { decision: 'WARN', reasonCode: 
'NEGATIVE_MARGIN_FALLBACK', severity: 'warning' };
}
// R-RUNAWAY-AGENT
// fanOutRatio measured against p95 baseline for the feature;
// overrunRatio > 1.5 is an example threshold — 
calibrate to your variance.
if (fanOutRatio > p95ForFeature && reservationOverrunRatio > 1.5) {
  return { decision: 'BLOCK_CONTINUATION', 
reasonCode: 'RUNAWAY_AGENT_FAN_OUT', severity: 'critical' };
}

No ML. No probabilistic scoring. A rule fires or it doesn’t. The reason code is always from a closed enum — so support, finance, and on-call engineers can all reason about it without ambiguity.

The Margin Circuit Breaker is the stateful version, per (tenantId, featureKey): HEALTHY → ELEVATED → DEGRADED → CRITICAL, driven by rolling margin averages. At DEGRADED, routing restricts to safe model tiers. At CRITICAL, premium calls require manual finance approval. Recovery uses hysteresis to prevent oscillation from borderline signals.

The architectural difference this creates: “we noticed negative margin in the quarterly review” versus “the system stopped routing expensive models before the quarter ended.”

Billing Sync and Operational Correctness

The external billing provider is a downstream projection of your internal ledger. It receives confirmed billing data and generates invoices. It is not the authoritative record of what your product did.

This distinction has one critical operational consequence: billing sync must never be on the hot path.

If sync is on the request path and the billing provider is slow or unavailable, your product is slow or unavailable. The transactional outbox pattern breaks this coupling: write the execution fact to your own store and an outbox entry atomically, in a single transaction. Return the response. A background worker drains the outbox to the billing provider asynchronously. If the billing provider is down, your outbox accumulates. When it recovers, the outbox drains. No execution facts are lost.

Idempotency in sync is not optional. At-least-once delivery means the billing provider may receive the same meter event twice. The idempotency key for each sync operation must be stable across all retries:

// Computed from content, not generated at send time
const idempotencyKey = hash(
  aggregate.id, aggregate.tenantId,
  aggregate.period, aggregate.metricKey,
);
await billingProvider.sendMeterEvent({
  customerId:     aggregate.tenantId,
  meterKey:       aggregate.metricKey,
  value:          aggregate.totalBillableUnits,
  idempotencyKey, // same on every retry
});

When retries exhaust, the aggregate enters a dead-letter state. An alert fires. A human reviews and replays. The dead-letter state is an explicit signal — not a silent failure.

Correctness under high load

Under load, the cold path will fall behind. Rating workers accumulate a queue. Billing aggregates are minutes stale. This is acceptable.

What is not acceptable: dropping a UsageEvent, skipping a LedgerEntry, or silently losing a sync event from the outbox. The invariant is not “no lag.” The invariant is “no dropped events.” Freshness degrades. Correctness is preserved.

Events flow from the Outbox through a Partition Router into per-tenant shards. Hot tenants get isolated shards. Rating Workers consume into Read Models. No dropped events is the invariant. Lag is observable but tolerable.

Partitioning by tenantId keeps one high-volume tenant from corrupting another’s event ordering. But it creates hot-partition risk: a tenant running many concurrent agentic workflows may monopolize a partition.

Hot-tenant detection triggers shadow shard assignment for tenants above a configurable threshold. The sustained lag itself becomes a Cost Risk signal: a tenant whose usage pattern consistently overwhelms their shard may be underpriced relative to the infrastructure they consume.

Replay safety. When a rating worker crashes and restarts, it replays events from the last committed offset. Every rating operation must be idempotent. The mechanism: a unique constraint on (usage_event_id, rating_version, line_type). The first write succeeds; replays are no-ops.

Replay type	What it does	Safe?
Re-aggregation	Recomputes aggregates from rated lines	Yes — replaces aggregate, lineage reference updates
Re-rating	Runs the Rating Engine with a new ruleset over the same event	Yes — new RatedUsageLine version, old marked superseded
Re-execution	Repeats the actual provider call	Never — this is new spending, not replay

Re-execution is not a billing replay operation. Retrying a failed provider call may be necessary operationally. From the billing system’s perspective, it’s a new event that needs to be handled through the idempotency mechanism, not treated as a safe replay.

Reconciliation: Closing the Settlement Loop

Settlement truth arrives asynchronously and is never complete at arrival. Provider invoices come days later. Billing-provider summaries are eventually consistent. When they arrive, the Reconciliation Engine compares three sources and produces typed DriftEntry records.

Drift is not a binary condition. It has a taxonomy:

Drift type	What it means	Likely cause
TOKEN_COUNT_DRIFT	Provider billed a different token count	Rounding, provider reclassification
PRICING_VERSION_MISMATCH	Same call, different rate applied	Mid-period catalog update
TIMING_DRIFT	Event crossed a billing-period boundary	Late event after period close
CLASSIFICATION_DRIFT	We marked it customer-key; provider billed us	Misrouted credential
MISSING_EVENT	Provider invoice has a line we don’t	Silent fallback, recorder failure
ORPHAN_EVENT	We have an event the provider doesn’t	Rejected retry, provider-side issue
PROVIDER_COST_DRIFT	Our cost estimate diverges from invoice	Cached-token reclassification, batch pricing
MISSING_BILLING_SYNC	Our aggregate never reached the billing provider	DLQ not replayed

The diagram below shows how the three truth layers feed into the reconciliation engine and the append-only adjustment ledger:

Reconciliation compares Execution Truth, Rated Truth, and Settlement Truth. Corrections flow into an append-only Adjustment Ledger. Raw history is never modified.

When a correction is approved, an AdjustmentEntry is created. It never modifies the original UsageEvent or RatedUsageLine. It creates new ledger entries — with a reason code, an approver reference, and a link to the drift entry — that bring the ledger back to balance.

This mirrors the model Stripe uses for credit notes: a credit note reduces the amount owed on an invoice without voiding the invoice itself. The original invoice stands. The correction is its own document. Past financial truth stays past.

Gotcha — late events. When a UsageEvent arrives after the billing period it belongs to has closed, resist the instinct to retroactively update the closed aggregate.

Create an AdjustmentEntry in the current open period with reason code LATE_EVENT_AFTER_PERIOD_CLOSE. Retroactive mutation of closed aggregates breaks explainability, replay determinism, and the reconciliation process for the closed period.

Reconciliation as an operational SLO. Per (provider, meterKey, tenantTier), track a rolling currentDriftBurned against a configured acceptedDriftBudget. A team might configure an alert at 80% of drift budget consumed — the exact threshold should be calibrated to the provider’s billing-accuracy characteristics. This turns reconciliation from a post-mortem into an operational signal.

Explainability: The Backward Trace

Support tickets about billing are inevitable. The question is whether answering them takes thirty seconds or three engineering-hours.

Thirty seconds: support queries the Explainability API with an invoice-line identifier. The API traverses the lineage backward:

Invoice Line
  → BillingAggregate
    → RatedUsageLines
      → LedgerEntries
        → UsageEvent
          → ProviderCall
            → UsageOperation

Each arrow is a foreign key in the data model. The traversal requires no recomputation, no reconstructed logic, no raw table queries. The lineage is the schema.

The diagram below illustrates the full backward trace from a customer-visible invoice line to the root provider call:

Explainability traverses backward from a customer-visible charge through aggregation, rating, ledger entries, usage events, and provider calls to the root operation. The traversal is a data model property, not a special feature.

This backward traversal is an emergent property of the architecture. Because every RatedUsageLine references its source UsageEvent, and every UsageEvent references its ProviderCall, the lineage is always intact — through re-rating, through reconciliation adjustments, through pricing changes.

Time-travel is also possible. Because the ledger is append-only and every record has created_at and superseded_at timestamps, the API can answer “what did this invoice line look like before the adjustment?” with a date filter — not a special feature, just a range condition on an append-only store.

The Gotchas

Bill by resolved model, not requested alias. If routing or fallback changes the model that actually ran, pricing must be based on resolved_model. requested_alias is for debugging only.

BYOK usage is tracked, not invisible. The UsageEvent still gets written with key_source: 'customer' and platform overhead rated explicitly. If you skip recording BYOK events, you have no way to know whether BYOK customers are profitable, break-even, or costing you money.

A retry after provider execution ≠ duplicate usage recording. The idempotency key prevents recording the same provider execution twice. It does not prevent the provider from executing again on a full end-to-end retry. These are different failure modes: one is a billing-integrity problem (duplicate recording), the other is an operational problem (double spend) requiring provider-level caching or explicit retry policies.

Closed periods are corrected forward, never modified. Retroactive modification breaks replay determinism, reconciliation accuracy, and the audit trail for the closed period.

Two fields for two costs. Provider cost and customer-billable amount are always stored separately. They can be equal — that’s coincidence, not design. One field for both makes it impossible to track platform margin, apply plan subsidies, or audit separately what the platform pays versus what customers owe.

Reconciliation drift is not a bug. Provider-side rounding, mid-period pricing changes, timing differences, and batch-discount reclassification all produce drift within normal ranges. The taxonomy of drift types is more useful than an aggregate drift percentage.

What to Build First vs What to Defer

This architecture can look over-engineered if you’re in the early stages. Not all of it needs to exist on day one.

MVP: build these first

Immutable `UsageEvent` store. The append-only execution record is the foundation. Once you’ve accumulated mutable records, correcting them is painful.
Resolved-model tracking. resolved_model and requested_alias as separate fields, from provider response metadata. Cheap early, expensive late.
Separate provider cost and customer-billable. Two fields from the start. Even if equal today, the separation is what allows margin analysis later.
Async billing sync with idempotency keys. Transactional outbox from day one. Billing-provider downtime should not affect request handling.
Basic `BudgetReservation` state machine. Hold / capture / release for cap enforcement. Preflight cap checks without reservations break under concurrency.

Defer to a mature system

Full Cost Risk Engine with circuit-breaker states. Initially, a simple alert on negative-margin calls is enough. The circuit breaker comes later, when you have data to calibrate thresholds.
Drift SLOs and drift-budget tracking. Useful once you have historical drift data to calibrate against.
Shadow rating. Essential before major pricing changes, but not needed until you have pricing changes to shadow-test.
Counterfactual invoices. A sales tool for enterprise pricing conversations. Build when you’re having those conversations.
Hot-tenant shard isolation. Needed when you have tenants measurably disrupting partition throughput.
Advanced reconciliation workflow (approval queues, escalation paths, finance integration). A manual review process with a shared spreadsheet is fine until volume justifies automation.

The invariant — that execution facts, rating, and settlement are separate concerns — holds from day one even in an MVP. The MVP is simpler, not structurally different.

What Production Looks Like

Component	Production version
Execution fact store	Append-only store (Postgres with insert-only policy, event store, or similar)
Cold-path event bus	Partitioned message queue (Kafka, Kinesis, Pub/Sub) with consumer groups
Ledger store	Append-only relational table — no UPDATE or DELETE operations
Billing sync	Stripe Meters API, OpenMeter, m3ter, or a custom metering backend
Read model for dashboards	Materialized view or OLAP projection, refreshed async from the ledger
Money-movement invariant probe	Scheduled query against the live ledger, alerting on non-zero residual
Reconciliation engine	Scheduled job with access to provider invoice exports and billing-provider summaries

What doesn’t change across those substitutions: the structural separation between execution truth, rated truth, and settlement truth; the money-movement invariant; the append-only correction model; the rating-as-pure-function property; and the reservation state machine. These are design decisions, not technology choices.

Practical Takeaways

The decisions that matter, in order of “most likely to be skipped and later regretted”:

Store resolved model, not requested alias. Fallback changes what ran. Billing must reflect what ran.
Keep `UsageEvent` records immutable. Once written, the execution fact is final. Corrections go into AdjustmentEntry records.
Provider cost and customer-billable amount are always separate fields. They can be equal. Equality is coincidental, not structural.
Rating is a separate step, not enrichment. Rate after recording. Keep the pure function. Make it versioned.
BYOK usage is tracked usage. Platform overhead on BYOK calls is real cost. Track it, even when provider cost is zero.
Budget control needs a reservation state machine. Preflight cap checks are not safe under concurrency or multi-step workflows.
Billing sync is off the hot path. Transactional outbox. Async drain. The billing provider’s availability is not your product’s availability.
Closed periods are corrected forward. Late events and reconciliation corrections go into AdjustmentEntry records in the current period.
Every customer-facing charge is traceable backward. If support can’t answer “why $X?” in 30 seconds, you have a forensics problem.
Margin risk and customer cap are different concerns. Monitor them separately. One Cost Risk Engine failure mode is noise on the customer-cap dashboard. A very different failure mode is systematic negative margin that only appears in the quarterly close.

AI billing done right is not a token logger with a billing API bolted on. It is a reservation-aware ledger where execution, rating, and settlement are first-class concerns with different SLAs, different stores, and different consumers — and where correctness is a runnable invariant, not an assumption.

Frequently Asked Questions

What is a reservation-aware AI usage ledger?

It’s a billing architecture where every AI provider call is recorded as an immutable execution event, interpreted financially by a separate versioned rating step, and settled against external invoices through append-only corrections. “Reservation-aware” means budget is held before each call (and captured or released after), so concurrent requests can’t overspend a cap.

Why isn’t a preflight cap check enough for budget control?

Because it breaks under concurrency. If two requests both read the same remaining balance before either completes, both pass the check and both execute, pushing spend past the cap. A reservation holds the estimated budget immediately, so other concurrent requests see the reduced available balance and cannot also claim it.

How does separating rating from execution help when pricing changes?

Because the execution record stays immutable and rating is a pure function, you can produce new RatedUsageLine versions from the same old events when pricing changes — marking old versions superseded rather than rewriting history. The same property enables shadow rating before cutover and counterfactual “what would this have cost on another tier?” calculations.

What’s the difference between a customer spending cap and a Cost Risk Engine?

A customer cap answers “is the customer overspending?” A Cost Risk Engine answers “is the platform losing money on this tenant, and why?” A fallback to an expensive model, BYOK overhead, or runaway agent fan-out can all destroy platform margin while the customer stays well within their cap. They are separate concerns and must be monitored separately.

How do you correct an AI billing error after a period has closed?

Forward, never backward. Create an AdjustmentEntry in the current open period — with a reason code, approver reference, and link to the drift entry — rather than mutating the closed aggregate. The original UsageEvent and RatedUsageLine are never modified. This mirrors how Stripe credit notes correct an invoice without voiding it.

What should an AI billing MVP build first?

An immutable UsageEvent store; resolved-model tracking (resolved_model and requested_alias as separate fields); separate provider-cost and customer-billable fields; async billing sync with stable idempotency keys via a transactional outbox; and a basic BudgetReservation state machine. The separation of execution, rating, and settlement holds even in an MVP — the MVP is simpler, not structurally different.

#AI #AI usage ledger