Skip to content

adr: 0002 title: Redis Lua for atomic trade execution (vs Service Bus FIFO) status: accepted date: 2026-05-22 deciders: [@vaishnav-s-01, @amalkrsihna] affects-specs: [amm-pricing] affects-code: - ftl-backend/internal/redis/lua/trade_execute.lua - ftl-backend/internal/trade/service.go supersedes: null superseded-by: null


ADR-0002: Redis Lua for atomic trade execution

Context

Every trade in FTL mutates four pieces of state simultaneously: the user's wallet, the user's position in the instrument, the instrument's net_shares_sold (which moves the price), and the leaderboard's per-instrument PnL hash. With up to 50K concurrent users trading the same player during a goal, we need:

  • Strict ordering per instrument. Two BUYs at the same instant must observe and update net_shares_sold in a defined sequence; otherwise the second one races against the first and the curve walks become wrong.
  • All-or-nothing semantics. If the wallet debit succeeds but the position write fails, the user has lost money for nothing. The four updates must commit together or not at all.
  • Sub-millisecond latency. The hot path can't afford a round-trip to a queue + a worker
  • an ack.
  • Idempotency. Network retries are common; the same clientRequestId arriving twice must not double-execute.

The retired serverless design solved this with Azure Service Bus FIFO sessions per instrument (every trade for instrument X lands in session X, a single worker per session processes them serially) plus a Cosmos DB transactional batch for atomicity. That design works but costs a round-trip to Service Bus + worker spin-up latency per trade, and the Cosmos transactional batch limits the update to one logical partition.

Decision

We will execute the entire trade — wallet debit/credit, position mutation, net_shares_sold update, leaderboard delta, idempotency guard — inside a single Redis Lua script (trade_execute.lua) that runs single-threaded against the Redis instance.

Concrete changes encoded:

  • One Lua script holds the full trade flow (~385 lines). It is registered with SCRIPT LOAD on api-server startup and invoked via EVALSHA thereafter.
  • Redis's single-threaded execution model gives us the ordering guarantee for free — no distributed locks, no FIFO queue, no per-instrument worker pool.
  • Idempotency: SET NX idem:<clientRequestId> at the top of the script; the cached response at idem_result:<id> is returned on replays.
  • Durable persistence is asynchronous. The Lua emits a trade event to a Redis stream; the flusher service drains the stream into Postgres on a 100ms loop. If the PG insert fails, the trade goes to the trade_outbox table for retry — the Lua's authoritative state is already correct.

Consequences

  • Positive. Per-trade latency drops to a single Redis round-trip (~1-2 ms locally, ~5 ms cross-AZ). No queue, no worker pool, no transactional batch.
  • Positive. Cost. The retired stack budgeted ~$200/mo just for Service Bus standard tier with the FIFO sessions feature. Redis is already in the stack for caching; the Lua adds zero new cost.
  • Positive. Ordering and atomicity are guaranteed by Redis itself — no application code to get wrong.
  • Negative. A Redis outage stops trading. We mitigate with a 6-replica Redis Cluster in production; the Lua is loaded on every replica via SCRIPT LOAD at api-server startup.
  • Negative. The Lua is hard to test. We mirror the math in internal/amm/amm.go (pure Go, fully unit-tested) and assert equivalence in integration tests, but the full script behavior (including idempotency and stream emission) needs a real Redis to verify.
  • Negative. Lua is single-threaded across the whole Redis instance, not just per-script. A long-running script blocks every other client. We bound this by keeping the script O(shares) per trade and rejecting orders >MaxSharesPerOrder upstream.
  • Maintenance burden. Anyone changing the trade flow must be comfortable reading + writing Lua. We document the script with section comments and a top-of-file table-of-contents.

Alternatives considered

Alternative A: Service Bus FIFO sessions + Cosmos transactional batch (the retired plan)

Per-instrument FIFO queue, single worker per session, Cosmos transactional batch for atomicity.

Rejected because: extra hop costs ~10-30 ms per trade, the Cosmos transactional batch is limited to one logical partition, Service Bus tier with FIFO sessions adds material monthly cost, and the worker pool needs autoscaling rules we'd otherwise not need. The Redis Lua path gives us the same guarantees in less code at lower latency and cost.

Alternative B: PostgreSQL with SERIALIZABLE isolation

Do the whole trade in a single PG transaction with SERIALIZABLE.

Rejected because: PG SERIALIZABLE aborts under contention, forcing the client to retry — fine for low concurrency, hostile under a 50K-user goal spike. PG would also become the single bottleneck for the entire site under load, where Redis comfortably handles ~100K ops/s per replica.

Alternative C: Application-level distributed lock (Redlock)

Take a Redis-backed lock per instrument, do the work in Go, release the lock.

Rejected because: Redlock has well-documented correctness debates under clock skew and partitions. The Lua approach doesn't need a lock at all — Redis's single-threaded execution IS the lock, with no expiry-window edge cases.