От халепа... Ця сторінка ще не має українського перекладу, але ми вже над цим працюємо!
От халепа... Ця сторінка ще не має українського перекладу, але ми вже над цим працюємо!
Dmytro Ivasikiv
/
NodeJS Engineer
19 min read
Part 1 made the case that AI billing needs a ledger, not a log. This part shows what that ledger looks like — and why each design decision is harder than it first appears.
We’ll go component by component: how budget reservations prevent concurrent overspend, how the Rating Engine separates financial interpretation from execution facts, how the Cost Risk Engine defends platform margin, how billing sync stays correct under failure, how reconciliation closes the loop with external systems, and how any customer-facing charge stays traceable backward to the raw provider call.
New here? Start with Part 1: Why Token Logging Breaks AI SaaS Billing
The Invariant Everything Else Depends On
BudgetReservation: Why a Cap Check Isn’t Enough
Rating Engine: The Boundary Between Fact and Meaning
Cost Risk Engine: The Margin Defender
Billing Sync and Operational Correctness
Reconciliation: Closing the Settlement Loop
Explainability: The Backward Trace
Before individual components, there’s one mechanical property the whole architecture depends on. Every financially meaningful movement in the ledger — holding budget before a call, capturing actual usage after, releasing unused budget, recording overage, applying a correction — should appear as a paired entry that nets to zero. This is double-entry accounting applied to a usage ledger, and it’s the property that makes the system auditable by construction rather than by convention.
Define the accounts:
Tenant.Allowance // granted credits for this billing period
Tenant.HeldByReservation // held before execution, not yet captured
Tenant.Available // Allowance minus Held minus Spent
Tenant.OverageBilled // overage charges accumulating
on the active invoice
AdjustmentEntry.signed // net of all open corrections
The invariant, for any tenant at any point:
Tenant.Allowance
= Tenant.HeldByReservation
+ Tenant.Available
+ Tenant.OverageBilled
+ AdjustmentEntry.signed
No credit appears or disappears without a paired entry. A reservation hold debits Available and credits HeldByReservation. A capture debits HeldByReservation and credits AccruedRevenue. A release debits HeldByReservation and credits Available back. A correction creates entries that return the system to balance.
The practical use: run this as a scheduled probe.
If money_movement_residual != 0 for any tenant, something is wrong — a duplicate capture, a lost release, a missing adjustment. Catching it continuously is far better than discovering the discrepancy at a quarterly close when the trail is cold.
The five core entities, their key fields, and the constraints that enforce correctness. This is a minimal production-oriented model — not exhaustive, but containing the constraints that matter.
UsageEvent — append-only execution record
|
Column |
Type |
Constraint / Note |
|
id |
UUID |
PK |
|
idempotency_key |
VARCHAR |
UNIQUE — hash(operationId, providerCallId, attemptNumber) |
|
tenant_id |
UUID |
NOT NULL, FK → Tenant |
|
operation_id |
UUID |
NOT NULL, FK → UsageOperation |
|
provider_call_id |
VARCHAR |
NOT NULL — provider’s own request ID |
|
requested_alias |
VARCHAR |
NOT NULL — for debugging only, never billed |
|
resolved_provider |
VARCHAR |
NOT NULL — what actually ran |
|
resolved_model |
VARCHAR |
NOT NULL — what actually ran |
|
key_source |
ENUM |
NOT NULL — ‘platform’ | ‘customer’ |
|
input_tokens |
INTEGER |
NOT NULL |
|
output_tokens |
INTEGER |
NOT NULL |
|
cached_input_tokens |
INTEGER |
NOT NULL DEFAULT 0 |
|
tool_call_count |
INTEGER |
NOT NULL DEFAULT 0 |
|
pricing_version |
VARCHAR |
NOT NULL — catalog version active at record time |
|
recorded_at |
TIMESTAMPTZ |
NOT NULL |
Immutability constraint: no UPDATE or DELETE permitted. Append-only. Enforced at the application layer and, where the database supports it, via row-level security or a trigger.
RatedUsageLine — financial interpretation of one UsageEvent
|
Column |
Type |
Constraint / Note |
|
id |
UUID |
PK |
|
usage_event_id |
UUID |
NOT NULL, FK → UsageEvent — backward traceability anchor |
|
rating_version |
VARCHAR |
NOT NULL — which pricing ruleset produced this line |
|
line_type |
ENUM |
NOT NULL — ‘platform_cost’ | ‘included’ | ‘overage’ | ‘customer_billable’ |
|
unit_count |
DECIMAL |
NOT NULL |
|
unit_price |
DECIMAL |
NOT NULL |
|
amount |
DECIMAL |
NOT NULL |
|
currency |
CHAR(3) |
NOT NULL |
|
is_counterfactual |
BOOLEAN |
NOT NULL DEFAULT FALSE — shadow / what-if lines excluded from billing |
|
superseded_at |
TIMESTAMPTZ |
NULL — set when re-rating produces a replacement |
|
created_at |
TIMESTAMPTZ |
NOT NULL |
Unique constraint: (usage_event_id, rating_version, line_type) — prevents duplicate lines on replay. The first write succeeds; replays are no-ops.
LedgerEntry — double-entry record of every financial movement
|
Column |
Type |
Constraint / Note |
|
id |
UUID |
PK |
|
tenant_id |
UUID |
NOT NULL, FK → Tenant |
|
account |
ENUM |
NOT NULL — ‘allowance’ | ‘held’ | ‘available’ | ‘overage_billed’ | ‘adjustment’ |
|
direction |
ENUM |
NOT NULL — ‘debit’ | ‘credit’ |
|
amount |
DECIMAL |
NOT NULL |
|
source_type |
ENUM |
NOT NULL — ‘reservation’ | ‘capture’ | ‘release’ | ‘rating’ | ‘adjustment’ |
|
source_id |
UUID |
NOT NULL — FK to BudgetReservation, RatedUsageLine, or AdjustmentEntry |
|
created_at |
TIMESTAMPTZ |
NOT NULL |
Immutability constraint: append-only. No UPDATE or DELETE. The invariant probe queries this table.
BudgetReservation — state machine for pre-execution budget holds
|
Column |
Type |
Constraint / Note |
|
id |
UUID |
PK |
|
idempotency_key |
VARCHAR |
UNIQUE — prevents duplicate reservations on retry |
|
tenant_id |
UUID |
NOT NULL, FK → Tenant |
|
operation_id |
UUID |
NOT NULL, FK → UsageOperation |
|
estimated_amount |
DECIMAL |
NOT NULL |
|
captured_amount |
DECIMAL |
NULL — set on capture |
|
released_amount |
DECIMAL |
NULL — set on release |
|
state |
ENUM |
NOT NULL — ‘requested’ | ‘reserved’ | ‘partially_captured’ | ‘captured’ | ‘released’ | ‘overrun’ | ‘expired’ |
|
expires_at |
TIMESTAMPTZ |
NOT NULL — TTL for long-running operations |
|
created_at |
TIMESTAMPTZ |
NOT NULL |
|
updated_at |
TIMESTAMPTZ |
NOT NULL |
AdjustmentEntry — forward-only correction record
|
Column |
Type |
Constraint / Note |
|
id |
UUID |
PK |
|
tenant_id |
UUID |
NOT NULL, FK → Tenant |
|
drift_entry_id |
UUID |
NULL, FK → DriftEntry — links to reconciliation source |
|
reason_code |
ENUM |
NOT NULL — ‘late_event_after_period_close’ | ‘provider_invoice_delta’ | ‘pricing_correction’ | ‘classification_correction’ | ‘manual_override’ |
|
signed_amount |
DECIMAL |
NOT NULL — positive = credit to tenant, negative = charge |
|
period_reference |
VARCHAR |
NOT NULL — original period corrected, e.g. “2025-03” |
|
applied_to_period |
VARCHAR |
NOT NULL — period entry appears on, e.g. “2025-04” |
|
approver_id |
UUID |
NULL — required for manual_override |
|
created_at |
TIMESTAMPTZ |
NOT NULL |
Forward-only constraint: applied_to_period must be ≥ the current open period. Corrections never enter closed periods.
A concrete trace of what the system does for a single operation, from reservation to billing aggregate. Numbers are illustrative.
Scenario: a tenant on a Pro plan (100,000 tokens included, $0.002/1K overage) sends a request. They have 99,700 tokens remaining. The request uses a research agent that makes two provider calls totalling 800 tokens.
Step 1 — Reservation hold (hot path)
Before any provider call, the system creates a BudgetReservation with estimated_amount = $0.002. State: requested → reserved. Paired ledger entries:
Step 2 — Provider call 1 (hot path)
First provider call: 350 input + 150 output tokens on gpt-4o. Provider response metadata: resolvedModel: "gpt-4o", providerCallId: "prov_abc123". UsageEvent written (immutable, append-only):
id: evt_001
idempotency_key: hash(op_xyz, prov_abc123, attempt_1)
operation_id: op_xyz
resolved_model: "gpt-4o"
requested_alias: "gpt-4o"
key_source: "platform"
input_tokens: 350
output_tokens: 150
pricing_version: "v2025-04"
Step 3 — Provider call 2 (hot path)
Synthesis step: 200 input + 100 output tokens. New UsageEvent written with id: evt_002, same operation_id: op_xyz.
Step 4 — Reservation capture and release (hot path end)
Actual usage: 800 tokens total. captured_amount = $0.0016, released_amount = $0.0004 (unused estimate returned). State: reserved → captured. Paired ledger entries:
Step 5 — Rating (cold path, async)
The rating worker processes both events. Allowance state at rating time: 99,700 tokens remaining. 800 total tokens: 300 included (consuming the last of the allowance), 500 overage. Four RatedUsageLine records produced:
|
line_type |
unit_count |
unit_price |
amount |
source |
|
platform_cost |
800 |
$0.000002 |
$0.0016 |
evt_001 + evt_002 |
|
included |
300 |
$0.00 |
$0.00 |
allowance draw-down |
|
overage |
500 |
$0.002/1K |
$0.001 |
evt_001 + evt_002 |
|
customer_billable |
500 |
$0.002/1K |
$0.001 |
overage line |
Step 6 — Ledger entries for overage
Step 7 — Billing aggregate and sync (cold path)
The billing aggregate is synced to the billing provider (e.g., Stripe Meters) with a stable idempotency key:
const idempotencyKey = hash(
aggregate.id, aggregate.tenantId,
aggregate.period, "overage_tokens"
);
await billingProvider.sendMeterEvent({
customerId: aggregate.tenantId,
meterKey: "overage_tokens",
value: 500,
idempotencyKey,
});
Stripe meter event summaries are eventually consistent because meter events are aggregated asynchronously. The internal ledger is the source of truth; the external summary is a projection.
Step 8 — Later adjustment (if needed)
If the provider invoice arrives showing 810 tokens (not 800), the reconciliation engine creates a DriftEntry of type TOKEN_COUNT_DRIFT. If above the configured threshold, a human approves an AdjustmentEntry:
AdjustmentEntry:
reason_code: "provider_invoice_delta"
signed_amount: -$0.00002
period_reference: "2025-04"
applied_to_period: "2025-04"
drift_entry_id: drift_xyz
The UsageEvent and RatedUsageLine records are not modified. The correction is its own document with full lineage.
The simplest approach to budget control is a preflight check: read the tenant’s current spend, compare to their limit, allow or deny. This works until you have concurrency. Two requests arrive simultaneously at $9.80 of a $10.00 cap. Both read $9.80. Both pass the check. Both execute. Spend lands at $10.60. The check was useless.
The correct primitive is a reservation: before the provider call, hold the estimated budget so the available balance drops immediately. Other concurrent requests see the held balance and cannot also claim it.
[Request A arrives]
Available: $10.00 → Hold $0.50 → Available: $9.50, Held: $0.50
[Request B arrives, concurrent]
Available: $9.50 ← sees the hold
→ Hold $0.80 → Available: $8.70, Held: $1.30
[Request A completes, actual $0.43]
→ Capture $0.43, Release $0.07 unused
→ Available: $8.77, Held: $0.80, Spent: $0.43
The diagram below shows the BudgetReservation state machine and how each transition creates paired ledger entries:
BudgetReservation is a state machine. Hold → Reserved → Captured / Released / Overrun. Each transition is a concrete financial event that participates in the money movement invariant.
States and transitions: requested → reserved (estimate held, available reduced immediately); reserved → partially_captured (interim capture in a multi-step workflow); partially_captured → captured (final capture, unused released); reserved → released (operation aborted, full hold returned); reserved/captured → overrun (actual usage exceeded reserved — explicit, typed); reserved → expired (TTL elapsed).
The overrun state makes this useful for agentic workflows. When a workflow exceeds its initial cost estimate, the reservation records that fact explicitly — not a silent discrepancy discovered at invoice time, but a typed financial event in the reservation lifecycle.
The harder case: multi-step workflows. A single user action spawning thirty provider calls needs a reservation at the operation level, not the individual-call level. Each provider call captures partial usage against the operation’s reservation. Per-call cap checks cannot enforce operation-level budgets.
Fail-open vs fail-closed. When the reservation-check system is unavailable, fail-closed for high-cost model tiers (margin protection) and fail-open for low-cost requests (availability). The threshold for “high-cost” should be product-specific configuration, not a hardcoded constant.
The Rating Engine turns immutable execution facts into financial meaning. It’s also the component most likely to be designed wrong — because the wrong design feels almost right.
The wrong design: enrich the event. Take the UsageEvent, look up the current price, compute costs, write them back as fields on the event. This feels efficient. When pricing changes, you either rewrite historical events or live with mixed pricing in the same table with no clean boundary.
The correct design: leave the event immutable. Produce a separate RatedUsageLine that has its own version, references the source event by ID, and can be superseded when pricing changes.
// Rating is a pure function.
// Same inputs → same outputs. Always.
function rateUsageEvent(
event: UsageEvent,
catalog: PricingCatalog,
plan: TenantPlan,
): RatedUsageLine[] {
const entry = catalog.lookup(
event.resolvedProvider,
event.resolvedModel,
event.pricingVersion,
);
const platformCostLine = computePlatformCost(event, entry);
const { includedLine, overageLine } =
splitAgainstAllowance(event, plan);
const billableLine =
computeCustomerBillable(overageLine, entry, plan);
return [platformCostLine, includedLine, overageLine, billableLine].
filter(Boolean);
}
One UsageEvent produces up to four RatedUsageLine records: platform cost, included consumption, overage, and customer-billable amount. These are four distinct financial facts that happen to come from the same execution event. Keeping them separate is what allows the ledger to answer “what did this cost the platform?” and “what is the customer billed for?” independently.
The diagram below shows how one event fans out into multiple rated lines:
One immutable UsageEvent expands into multiple financial lines through the Rating Engine. Platform cost, included allowance consumption, overage, and customer-billable are separate financial artifacts.
Because rating is a pure function, three things become possible:
Shadow rating. Before cutting over to new pricing, run new rating rules in shadow mode against live events. Shadow-mode RatedUsageLine records are flagged (is_counterfactual = true) and excluded from billing. Diff the shadow output against production output for hours or days before cutover. If the diff is clean, cut over with confidence.
Re-rating. When pricing changes, produce new RatedUsageLine versions from the same old events. Old versions are marked superseded with a timestamp. Historical billing history stays intact alongside the new interpretation, clearly versioned.
Counterfactual invoices. Because rating is pure, you can answer “what would last month have cost on the Enterprise tier?” Mark the result counterfactual, don’t persist it as financial truth, and use it for pricing conversations.
Trade-off — aggregate sync vs raw-event sync. Aggregating by (customerId, meterKey, period) before syncing gives a stable idempotency unit and lower API volume. The cost: external billing records are less granular. OpenMeter is designed around this pattern and handles deduplication at the event level before aggregation; m3ter takes a similar approach with ingest-time deduplication and batch aggregation.
Customer spending caps protect customers from unexpected charges. They do not protect platform margin. These are different problems, and conflating them is a design error that doesn’t become visible until a bad quarter.
When a fallback routes a request from a cheap model to an expensive one, the customer is still within their monthly cap. The platform’s cost just went negative on that call. The customer cap didn’t fire — it’s not designed to fire here.
When a customer uses their own API key, the platform’s provider cost is zero. But the platform is still paying for compute, retrieval, moderation, and gateway overhead. If that overhead exceeds service-fee revenue, the platform loses money on every BYOK call.
When an agentic workflow spawns 200 provider calls instead of the expected 5, the customer may still be within their budget cap. The platform’s cost structure for that tenant has changed dramatically.
The Cost Risk Engine answers a different question than the customer cap: not “is the customer overspending?” but “is the platform losing money on this tenant, and why?”
The diagram below shows the engine’s margin-risk signals and the circuit-breaker states it manages:
The Cost Risk Engine reads margin-risk signals — fallback cost delta, workflow fan-out, BYOK overhead, systematic drift, hot-tenant pressure — and operates a circuit breaker with four states: HEALTHY, ELEVATED, DEGRADED, CRITICAL.
The engine runs in three modes:
Preflight (hot path, before the provider call): checks estimated cost against the reservation, plus current tenant risk signals. Can block, degrade routing to cheaper models, or require manual review. A practical target is under 50ms end-to-end for this check, though the right threshold is specific to your product’s latency budget. If the engine is unavailable, fail-closed for premium model tiers and fail-open for low-cost requests.
Midflight (cold path, after rating): analyses rated lines for margin per operation, fan-out ratio, and BYOK overhead. Emits a typed CostRiskVerdict with a closed-enum reason code. Does not retroactively block completed operations — it shapes future routing decisions.
Postflight (settlement layer, after reconciliation): detects systematic patterns across billing periods — a pricing or contract risk, not an operational error.
// R-NEG-MARGIN-FALLBACK
// Fires when the resolved model costs more than what
the customer is billed.
if (marginPerCall < 0 && fallbackCostDelta > 0) {
return { decision: 'WARN', reasonCode:
'NEGATIVE_MARGIN_FALLBACK', severity: 'warning' };
}
// R-RUNAWAY-AGENT
// fanOutRatio measured against p95 baseline for the feature;
// overrunRatio > 1.5 is an example threshold —
calibrate to your variance.
if (fanOutRatio > p95ForFeature && reservationOverrunRatio > 1.5) {
return { decision: 'BLOCK_CONTINUATION',
reasonCode: 'RUNAWAY_AGENT_FAN_OUT', severity: 'critical' };
}
No ML. No probabilistic scoring. A rule fires or it doesn’t. The reason code is always from a closed enum — so support, finance, and on-call engineers can all reason about it without ambiguity.
The Margin Circuit Breaker is the stateful version, per (tenantId, featureKey): HEALTHY → ELEVATED → DEGRADED → CRITICAL, driven by rolling margin averages. At DEGRADED, routing restricts to safe model tiers. At CRITICAL, premium calls require manual finance approval. Recovery uses hysteresis to prevent oscillation from borderline signals.
The architectural difference this creates: “we noticed negative margin in the quarterly review” versus “the system stopped routing expensive models before the quarter ended.”
The external billing provider is a downstream projection of your internal ledger. It receives confirmed billing data and generates invoices. It is not the authoritative record of what your product did.
This distinction has one critical operational consequence: billing sync must never be on the hot path.
If sync is on the request path and the billing provider is slow or unavailable, your product is slow or unavailable. The transactional outbox pattern breaks this coupling: write the execution fact to your own store and an outbox entry atomically, in a single transaction. Return the response. A background worker drains the outbox to the billing provider asynchronously. If the billing provider is down, your outbox accumulates. When it recovers, the outbox drains. No execution facts are lost.
Idempotency in sync is not optional. At-least-once delivery means the billing provider may receive the same meter event twice. The idempotency key for each sync operation must be stable across all retries:
// Computed from content, not generated at send time
const idempotencyKey = hash(
aggregate.id, aggregate.tenantId,
aggregate.period, aggregate.metricKey,
);
await billingProvider.sendMeterEvent({
customerId: aggregate.tenantId,
meterKey: aggregate.metricKey,
value: aggregate.totalBillableUnits,
idempotencyKey, // same on every retry
});
When retries exhaust, the aggregate enters a dead-letter state. An alert fires. A human reviews and replays. The dead-letter state is an explicit signal — not a silent failure.
Correctness under high load
Under load, the cold path will fall behind. Rating workers accumulate a queue. Billing aggregates are minutes stale. This is acceptable.
What is not acceptable: dropping a UsageEvent, skipping a LedgerEntry, or silently losing a sync event from the outbox. The invariant is not “no lag.” The invariant is “no dropped events.” Freshness degrades. Correctness is preserved.
Events flow from the Outbox through a Partition Router into per-tenant shards. Hot tenants get isolated shards. Rating Workers consume into Read Models. No dropped events is the invariant. Lag is observable but tolerable.
Partitioning by tenantId keeps one high-volume tenant from corrupting another’s event ordering. But it creates hot-partition risk: a tenant running many concurrent agentic workflows may monopolize a partition.
Hot-tenant detection triggers shadow shard assignment for tenants above a configurable threshold. The sustained lag itself becomes a Cost Risk signal: a tenant whose usage pattern consistently overwhelms their shard may be underpriced relative to the infrastructure they consume.
Replay safety. When a rating worker crashes and restarts, it replays events from the last committed offset. Every rating operation must be idempotent. The mechanism: a unique constraint on (usage_event_id, rating_version, line_type). The first write succeeds; replays are no-ops.
|
Replay type |
What it does |
Safe? |
|
Re-aggregation |
Recomputes aggregates from rated lines |
Yes — replaces aggregate, lineage reference updates |
|
Re-rating |
Runs the Rating Engine with a new ruleset over the same event |
Yes — new RatedUsageLine version, old marked superseded |
|
Re-execution |
Repeats the actual provider call |
Never — this is new spending, not replay |
Re-execution is not a billing replay operation. Retrying a failed provider call may be necessary operationally. From the billing system’s perspective, it’s a new event that needs to be handled through the idempotency mechanism, not treated as a safe replay.
Settlement truth arrives asynchronously and is never complete at arrival. Provider invoices come days later. Billing-provider summaries are eventually consistent. When they arrive, the Reconciliation Engine compares three sources and produces typed DriftEntry records.
Drift is not a binary condition. It has a taxonomy:
|
Drift type |
What it means |
Likely cause |
|
TOKEN_COUNT_DRIFT |
Provider billed a different token count |
Rounding, provider reclassification |
|
PRICING_VERSION_MISMATCH |
Same call, different rate applied |
Mid-period catalog update |
|
TIMING_DRIFT |
Event crossed a billing-period boundary |
Late event after period close |
|
CLASSIFICATION_DRIFT |
We marked it customer-key; provider billed us |
Misrouted credential |
|
MISSING_EVENT |
Provider invoice has a line we don’t |
Silent fallback, recorder failure |
|
ORPHAN_EVENT |
We have an event the provider doesn’t |
Rejected retry, provider-side issue |
|
PROVIDER_COST_DRIFT |
Our cost estimate diverges from invoice |
Cached-token reclassification, batch pricing |
|
MISSING_BILLING_SYNC |
Our aggregate never reached the billing provider |
DLQ not replayed |
The diagram below shows how the three truth layers feed into the reconciliation engine and the append-only adjustment ledger:
Reconciliation compares Execution Truth, Rated Truth, and Settlement Truth. Corrections flow into an append-only Adjustment Ledger. Raw history is never modified.
When a correction is approved, an AdjustmentEntry is created. It never modifies the original UsageEvent or RatedUsageLine. It creates new ledger entries — with a reason code, an approver reference, and a link to the drift entry — that bring the ledger back to balance.
This mirrors the model Stripe uses for credit notes: a credit note reduces the amount owed on an invoice without voiding the invoice itself. The original invoice stands. The correction is its own document. Past financial truth stays past.
Gotcha — late events. When a UsageEvent arrives after the billing period it belongs to has closed, resist the instinct to retroactively update the closed aggregate.
Create an AdjustmentEntry in the current open period with reason code LATE_EVENT_AFTER_PERIOD_CLOSE. Retroactive mutation of closed aggregates breaks explainability, replay determinism, and the reconciliation process for the closed period.
Reconciliation as an operational SLO. Per (provider, meterKey, tenantTier), track a rolling currentDriftBurned against a configured acceptedDriftBudget. A team might configure an alert at 80% of drift budget consumed — the exact threshold should be calibrated to the provider’s billing-accuracy characteristics. This turns reconciliation from a post-mortem into an operational signal.
Support tickets about billing are inevitable. The question is whether answering them takes thirty seconds or three engineering-hours.
Thirty seconds: support queries the Explainability API with an invoice-line identifier. The API traverses the lineage backward:
Invoice Line
→ BillingAggregate
→ RatedUsageLines
→ LedgerEntries
→ UsageEvent
→ ProviderCall
→ UsageOperation
Each arrow is a foreign key in the data model. The traversal requires no recomputation, no reconstructed logic, no raw table queries. The lineage is the schema.
The diagram below illustrates the full backward trace from a customer-visible invoice line to the root provider call:
Explainability traverses backward from a customer-visible charge through aggregation, rating, ledger entries, usage events, and provider calls to the root operation. The traversal is a data model property, not a special feature.
This backward traversal is an emergent property of the architecture. Because every RatedUsageLine references its source UsageEvent, and every UsageEvent references its ProviderCall, the lineage is always intact — through re-rating, through reconciliation adjustments, through pricing changes.
Time-travel is also possible. Because the ledger is append-only and every record has created_at and superseded_at timestamps, the API can answer “what did this invoice line look like before the adjustment?” with a date filter — not a special feature, just a range condition on an append-only store.
Bill by resolved model, not requested alias. If routing or fallback changes the model that actually ran, pricing must be based on resolved_model. requested_alias is for debugging only.
BYOK usage is tracked, not invisible. The UsageEvent still gets written with key_source: 'customer' and platform overhead rated explicitly. If you skip recording BYOK events, you have no way to know whether BYOK customers are profitable, break-even, or costing you money.
A retry after provider execution ≠ duplicate usage recording. The idempotency key prevents recording the same provider execution twice. It does not prevent the provider from executing again on a full end-to-end retry. These are different failure modes: one is a billing-integrity problem (duplicate recording), the other is an operational problem (double spend) requiring provider-level caching or explicit retry policies.
Closed periods are corrected forward, never modified. Retroactive modification breaks replay determinism, reconciliation accuracy, and the audit trail for the closed period.
Two fields for two costs. Provider cost and customer-billable amount are always stored separately. They can be equal — that’s coincidence, not design. One field for both makes it impossible to track platform margin, apply plan subsidies, or audit separately what the platform pays versus what customers owe.
Reconciliation drift is not a bug. Provider-side rounding, mid-period pricing changes, timing differences, and batch-discount reclassification all produce drift within normal ranges. The taxonomy of drift types is more useful than an aggregate drift percentage.
This architecture can look over-engineered if you’re in the early stages. Not all of it needs to exist on day one.
MVP: build these first
resolved_model and requested_alias as separate fields, from provider response metadata. Cheap early, expensive late.Defer to a mature system
The invariant — that execution facts, rating, and settlement are separate concerns — holds from day one even in an MVP. The MVP is simpler, not structurally different.
|
Component |
Production version |
|
Execution fact store |
Append-only store (Postgres with insert-only policy, event store, or similar) |
|
Cold-path event bus |
Partitioned message queue (Kafka, Kinesis, Pub/Sub) with consumer groups |
|
Ledger store |
Append-only relational table — no UPDATE or DELETE operations |
|
Billing sync |
Stripe Meters API, OpenMeter, m3ter, or a custom metering backend |
|
Read model for dashboards |
Materialized view or OLAP projection, refreshed async from the ledger |
|
Money-movement invariant probe |
Scheduled query against the live ledger, alerting on non-zero residual |
|
Reconciliation engine |
Scheduled job with access to provider invoice exports and billing-provider summaries |
What doesn’t change across those substitutions: the structural separation between execution truth, rated truth, and settlement truth; the money-movement invariant; the append-only correction model; the rating-as-pure-function property; and the reservation state machine. These are design decisions, not technology choices.
The decisions that matter, in order of “most likely to be skipped and later regretted”:
AdjustmentEntry records.AdjustmentEntry records in the current period.AI billing done right is not a token logger with a billing API bolted on. It is a reservation-aware ledger where execution, rating, and settlement are first-class concerns with different SLAs, different stores, and different consumers — and where correctness is a runnable invariant, not an assumption.
What is a reservation-aware AI usage ledger?
It’s a billing architecture where every AI provider call is recorded as an immutable execution event, interpreted financially by a separate versioned rating step, and settled against external invoices through append-only corrections. “Reservation-aware” means budget is held before each call (and captured or released after), so concurrent requests can’t overspend a cap.
Why isn’t a preflight cap check enough for budget control?
Because it breaks under concurrency. If two requests both read the same remaining balance before either completes, both pass the check and both execute, pushing spend past the cap. A reservation holds the estimated budget immediately, so other concurrent requests see the reduced available balance and cannot also claim it.
How does separating rating from execution help when pricing changes?
Because the execution record stays immutable and rating is a pure function, you can produce new RatedUsageLine versions from the same old events when pricing changes — marking old versions superseded rather than rewriting history. The same property enables shadow rating before cutover and counterfactual “what would this have cost on another tier?” calculations.
What’s the difference between a customer spending cap and a Cost Risk Engine?
A customer cap answers “is the customer overspending?” A Cost Risk Engine answers “is the platform losing money on this tenant, and why?” A fallback to an expensive model, BYOK overhead, or runaway agent fan-out can all destroy platform margin while the customer stays well within their cap. They are separate concerns and must be monitored separately.
How do you correct an AI billing error after a period has closed?
Forward, never backward. Create an AdjustmentEntry in the current open period — with a reason code, approver reference, and link to the drift entry — rather than mutating the closed aggregate. The original UsageEvent and RatedUsageLine are never modified. This mirrors how Stripe credit notes correct an invoice without voiding it.
What should an AI billing MVP build first?
An immutable UsageEvent store; resolved-model tracking (resolved_model and requested_alias as separate fields); separate provider-cost and customer-billable fields; async billing sync with stable idempotency keys via a transactional outbox; and a basic BudgetReservation state machine. The separation of execution, rating, and settlement holds even in an MVP — the MVP is simpler, not structurally different.