How I Built an Exactly-Once Job Queue Without a Coordinator

TL;DR

Exactly-once job processing without a coordinator is achievable with Redis Lua lease acquisition, PostgreSQL as source of truth, and idempotency keys. The key insight: make every operation a no-op if the job was already processed.

Using Redis Lua atomics, lease expiry, and idempotency keys to achieve exactly-once semantics in a distributed job queue — without a central coordinator.In this post, I'll walk through the key concepts with code examples drawn from real production implementations.

The Problem with "At Least Once"

Most distributed job queues guarantee at-least-once delivery: if a worker crashes mid-job, the job will be retried. This is usually fine — but for jobs that send emails, charge payment cards, or mutate shared state, processing the same job twice causes real problems.

The naive solution is a distributed lock — a coordinator that prevents two workers from running the same job simultaneously. But coordinators are a single point of failure, hard to scale, and introduce latency on every job acquisition.

Lease-Based Exactly-Once

The approach I used in KaryaFlow: instead of a lock, use a lease with expiry. A worker atomically acquires a job by setting a Redis key with a TTL. If the worker completes, it releases the lease and marks the job done in PostgreSQL. If the worker crashes, the TTL expires and another worker can acquire the same job.

acquire_job.lua

-- Atomically acquire a job if it has no active leaselocal job_id   = KEYS[1]local lease_key = "lease:" .. job_idlocal worker_id = ARGV[1]local lease_ttl = tonumber(ARGV[2])  -- secondsif redis.call("EXISTS", lease_key) == 1 then  return 0  -- another worker owns this jobendredis.call("SET", lease_key, worker_id, "EX", lease_ttl)return 1  -- lease acquired

Idempotency Keys: The Second Line of Defense

Leases prevent concurrent execution but don't prevent re-execution after a crash: if a worker completes a job but crashes before marking it done in PostgreSQL, the reconciliation loop will re-enqueue it. The second line of defense is an idempotency key — a unique token per job stored in a Redis set with a 24h TTL.

Before executing any job, the worker checks if its idempotency key is already in the set. If yes, it skips execution and marks the job complete. This makes re-execution a no-op — the "exactly-once" guarantee comes from idempotent execution, not from preventing re-execution.

worker.go

func (w *Worker) ProcessJob(ctx context.Context, job *Job) error {    idempKey := "idem:" + job.IdempotencyKey    // Check if already processed    wasProcessed, err := w.redis.SetNX(ctx, idempKey, "1", 24*time.Hour).Result()    if err != nil {        return fmt.Errorf("idempotency check: %w", err)    }    if !wasProcessed {        // Already processed — mark complete and return        return w.markComplete(ctx, job.ID)    }    // Execute the job    if err := w.execute(ctx, job); err != nil {        // On failure, delete idempotency key so retry can proceed        w.redis.Del(ctx, idempKey)        return err    }    return w.markComplete(ctx, job.ID)}

PostgreSQL as Source of Truth

Redis is ephemeral — a cluster restart or data loss would cause jobs to disappear. PostgreSQL stores the canonical job state: pending, running, completed, failed, dead-letter. Redis stores only the scheduling layer (leases, queues). If Redis data is lost, a reconciliation query re-enqueues all jobs in "running" state whose lease has expired.

Component	Responsibility	Failure mode
PostgreSQL	Job state, history, audit trail	Jobs re-enqueued from reconciliation
Redis	Lease acquisition, scheduling queue	Leases expire, reconciliation recovers
Idempotency key	Re-execution prevention	24h TTL; re-execution possible after expiry

The 30-Second Failure Window

The reconciliation loop runs every 30 seconds and re-enqueues jobs whose leases expired without a completion record. This means in the worst case — a worker crashes immediately after completing a job but before writing to PostgreSQL — the job may be re-executed up to 30 seconds later. Idempotency keys handle this correctly, but the 30-second window is a known latency spike.

Improvement

Replacing the polling reconciliation loop with a PostgreSQL LISTEN/NOTIFY trigger would reduce this window from 30 seconds to under 1 second. The trigger fires immediately when a job's lease expiry timestamp is passed — no polling required.

Rate Limiting in Distributed Systems: Token Bucket vs Sliding Window LeetCode #146 — LRU Cache: Every Way to Solve It