TL;DR
Exactly-once job processing without a coordinator is achievable with Redis Lua lease acquisition, PostgreSQL as source of truth, and idempotency keys. The key insight: make every operation a no-op if the job was already processed.
Using Redis Lua atomics, lease expiry, and idempotency keys to achieve exactly-once semantics in a distributed job queue — without a central coordinator.In this post, I'll walk through the key concepts with code examples drawn from real production implementations.
The Problem with "At Least Once"
Most distributed job queues guarantee at-least-once delivery: if a worker crashes mid-job, the job will be retried. This is usually fine — but for jobs that send emails, charge payment cards, or mutate shared state, processing the same job twice causes real problems.
The naive solution is a distributed lock — a coordinator that prevents two workers from running the same job simultaneously. But coordinators are a single point of failure, hard to scale, and introduce latency on every job acquisition.
Lease-Based Exactly-Once
The approach I used in KaryaFlow: instead of a lock, use a lease with expiry. A worker atomically acquires a job by setting a Redis key with a TTL. If the worker completes, it releases the lease and marks the job done in PostgreSQL. If the worker crashes, the TTL expires and another worker can acquire the same job.
-- Atomically acquire a job if it has no active leaselocal job_id = KEYS[1]local lease_key = "lease:" .. job_idlocal worker_id = ARGV[1]local lease_ttl = tonumber(ARGV[2]) -- secondsif redis.call("EXISTS", lease_key) == 1 then return 0 -- another worker owns this jobendredis.call("SET", lease_key, worker_id, "EX", lease_ttl)return 1 -- lease acquiredIdempotency Keys: The Second Line of Defense
Leases prevent concurrent execution but don't prevent re-execution after a crash: if a worker completes a job but crashes before marking it done in PostgreSQL, the reconciliation loop will re-enqueue it. The second line of defense is an idempotency key — a unique token per job stored in a Redis set with a 24h TTL.
Before executing any job, the worker checks if its idempotency key is already in the set. If yes, it skips execution and marks the job complete. This makes re-execution a no-op — the "exactly-once" guarantee comes from idempotent execution, not from preventing re-execution.
func (w *Worker) ProcessJob(ctx context.Context, job *Job) error { idempKey := "idem:" + job.IdempotencyKey // Check if already processed wasProcessed, err := w.redis.SetNX(ctx, idempKey, "1", 24*time.Hour).Result() if err != nil { return fmt.Errorf("idempotency check: %w", err) } if !wasProcessed { // Already processed — mark complete and return return w.markComplete(ctx, job.ID) } // Execute the job if err := w.execute(ctx, job); err != nil { // On failure, delete idempotency key so retry can proceed w.redis.Del(ctx, idempKey) return err } return w.markComplete(ctx, job.ID)}PostgreSQL as Source of Truth
Redis is ephemeral — a cluster restart or data loss would cause jobs to disappear. PostgreSQL stores the canonical job state: pending, running, completed, failed, dead-letter. Redis stores only the scheduling layer (leases, queues). If Redis data is lost, a reconciliation query re-enqueues all jobs in "running" state whose lease has expired.
| Component | Responsibility | Failure mode |
|---|---|---|
| PostgreSQL | Job state, history, audit trail | Jobs re-enqueued from reconciliation |
| Redis | Lease acquisition, scheduling queue | Leases expire, reconciliation recovers |
| Idempotency key | Re-execution prevention | 24h TTL; re-execution possible after expiry |
The 30-Second Failure Window
The reconciliation loop runs every 30 seconds and re-enqueues jobs whose leases expired without a completion record. This means in the worst case — a worker crashes immediately after completing a job but before writing to PostgreSQL — the job may be re-executed up to 30 seconds later. Idempotency keys handle this correctly, but the 30-second window is a known latency spike.
Improvement
Replacing the polling reconciliation loop with a PostgreSQL LISTEN/NOTIFY trigger would reduce this window from 30 seconds to under 1 second. The trigger fires immediately when a job's lease expiry timestamp is passed — no polling required.