Shows how much of the article you have read

qler: The Honest Benchmark

Do you need Redis for background jobs?

qler is a background job queue built on SQLite. No broker, no daemon, no infrastructure; pip install qler and you have a job queue backed by a file on disk. The question is what that simplicity costs you in throughput, and whether the answer changes if you’re honest about the measurement.


Environment

All measurements on the same machine, same run, same Python process:

  • Python: 3.13.7
  • qler: 0.5.0 (SQLite 3.50.4, WAL mode)
  • Celery: 5.6.2 + redis-py 6.4.0
  • Redis: 7.0, localhost (loopback; no real network latency)
  • Platform: Linux x86_64, 8 cores
  • Celery pool: --pool solo (single-threaded, fair c=1 comparison)

Both systems get equivalent configuration within their architecture. qler uses WAL mode; Celery uses Redis’s default in-memory persistence. Neither gets special tuning the other doesn’t.


Caveats (Read These First)

Before the numbers, understand what this comparison is and isn’t.

  1. Architecture: qler runs in-process with SQLite — zero network hops. Celery talks to a separate Redis process over TCP loopback. These are different tradeoffs, not just different speeds.
  2. Single machine only: qler cannot distribute work across machines. Celery can. That capability isn’t tested here.
  3. SQLite write ceiling: SQLite handles ~1–5K writes/sec. Redis handles 100K+. At extreme throughput, this is not a contest.
  4. Solo pool: Celery uses --pool solo for a fair single-worker comparison. Real deployments use prefork with multiple processes; the throughput gap would widen.
  5. Localhost Redis: No real network latency. Production Redis is often remote (+0.1–1ms per round-trip). This understates the gap Celery would face in production.
  6. Cold state: Fresh DB and fresh Redis on each iteration. No warm caches, no accumulated data.

The framing: “Should I add Redis to my stack just for background jobs, or is SQLite enough?” If you’re on one machine processing fewer than 1K jobs/minute, this shows what you gain (simplicity) and what it costs (throughput ceiling).


The Write Path (C1–C2)

C1: Enqueue Latency

How fast can you submit a single job? qler calls INSERT INTO on a local SQLite file; Celery serializes and publishes to Redis over TCP.

Jobsqler (ms)Celery (ms)Gap
10072571.3x slower
500280275parity
1,0005255541.1x faster
5,0002,6092,7941.1x faster

Rough parity. At small scales Celery’s pipelined Redis connection wins; at larger scales the overhead of 5,000 TCP round-trips catches up and qler’s single-transaction SQLite writes pull ahead slightly.

C2: Batch Enqueue

enqueue_many() wraps an entire batch in one SQLite transaction. Celery’s group().apply_async() pipelines individual Redis publishes.

Jobsqler (ms)Celery (ms)Gap
10034561.6x faster
500752873.8x faster
1,0001215674.7x faster
5,0005013,1746.3x faster

qler dominates batch writes. One SQLite transaction for 5,000 rows is fundamentally cheaper than 5,000 individual Redis publishes, and the advantage scales linearly. This is the scenario where SQLite’s “everything is a file” model wins outright.


The Unfair Shortcuts (C3–C4)

These scenarios were part of the original benchmark suite and we keep them for completeness, but they measure something misleading.

C3: Raw API Round-Trip

Enqueue → claim → complete using qler’s raw API (no Worker dispatch) vs Celery’s delay().get() (which uses a real Worker).

Jobsqler (ms)Celery (ms)Gap
1003361682.0x slower
5001,6168531.9x slower
1,0003,1121,6811.9x slower

C4: Raw API Throughput

Same raw API, sustained sequential cycles.

Jobsqler (ms)Celery (ms)Gap
1003151671.9x slower
5001,5638301.9x slower
1,0003,0311,6801.8x slower
5,00015,8058,1971.9x slower

The gap is a consistent ~1.9x. Redis’s in-memory operations outpace SQLite’s 3-write cycle (enqueue + claim + complete); that’s physics, not a bug. But note the asymmetry: qler bypasses its Worker entirely, calling the raw Queue API. Celery uses real Workers. The 1.9x gap is artificially narrow because qler isn’t doing equivalent work.

We found this bias during Round 1.1’s adversarial audit. The fix was to add C5 and C6: real Workers on both sides.


The Honest Numbers (C5–C6)

C5: Worker Round-Trip

Real end-to-end: enqueue → Worker picks up the job → executes → job.wait() returns. Both sides use actual Workers.

Jobsqler (ms)Celery (ms)Gap
1004421323.4x slower
5002,2307043.2x slower

Celery is ~3.3x faster on real worker round-trips. The gap comes from two places: Redis’s in-memory message dispatch is faster than SQLite’s claim-via-UPDATE, and Celery’s Worker is a battle-hardened event loop optimized for message throughput.

C6: Worker Throughput

Sequential enqueue + job.wait() through real Workers, sustained. This is the throughput number that matters for production workloads.

Jobsqler (ops/s)Celery (ops/s)Gap
10074.66008.0x slower
50077.35967.7x slower

Celery sustains ~8x higher throughput through real Workers. With prefork pool (not tested), the gap would widen further.

This number was 12.8x in Round 1.2. What changed is covered in the next section.


Round 1.3: The Event Fix

Round 1.2 showed a 12.8x throughput gap. That was honest — but the question was whether it was necessary.

The bottleneck was job.wait(). In Round 1.2, waiting for a job to complete meant polling the database: query the job’s status, sleep 50ms, query again, repeat. Every wait paid 0–50ms of unnecessary latency per job, and the poll queries added write contention to the SQLite file.

The fix was an asyncio.Event notification system. When a Worker completes a job, it fires an in-process Event. job.wait() registers for that Event before its first DB check; if the job completes while we’re waiting, the Event wakes us up instantly — no polling, no sleep, no wasted queries.

# _notify.py — the entire module is 47 lines
_registry: dict[str, asyncio.Event] = {}

def register(ulid: str) -> asyncio.Event:
    """Called by wait() BEFORE the first DB check."""
    if ulid not in _registry:
        _registry[ulid] = asyncio.Event()
    return _registry[ulid]

def fire(ulid: str) -> None:
    """Called by complete_job/fail_job/cancel_job."""
    ev = _registry.get(ulid)
    if ev is not None:
        ev.set()

Cross-process callers (or jobs completed before wait() starts) fall back to the existing DB poll transparently. The Event is a fast path, not a requirement.

Before and After

MetricRound 1.2Round 1.3Change
C6 throughput (qler, 100 jobs)50.2 ops/s74.6 ops/s+49%
C6 throughput (qler, 500 jobs)51.4 ops/s77.3 ops/s+50%
C6 gap vs Celery12.8x8.0x37% narrower

A 47-line module cut the throughput gap by a third. The remaining 8x is largely architectural: Redis dispatches messages in memory; qler writes to disk. That gap is real and will not close without fundamental changes to the storage model.


Cold Start (C7)

Time from zero to first completed job, including all initialization.

Jobsqler (ms)Celery (ms)Gap
1343,478102x faster
101,1463,4973.1x faster

qler initializes in 34ms: create an SQLite file, run schema migrations, start an asyncio task. Celery needs ~3.5 seconds: fork a subprocess, connect to Redis, register tasks, start the event loop.

The 102x gap matters for three use cases: CLI tools that run a job and exit, test suites that spin up workers per test, and serverless functions where cold start is billed time.


Full Results Table

All 7 scenarios, Round 1.3, median values. Scale: small (100–5,000 jobs), 3 iterations, 1 warmup.

ScenarioMeasureqlerCeleryGapWinner
C1: Enqueue latencyops/s @ 5K1,9161,7891.1xqler
C2: Batch enqueueops/s @ 5K9,9731,5766.3xqler
C3: Raw API round-tripops/s @ 1K3215951.9xCelery
C4: Raw API throughputops/s @ 5K3166101.9xCelery
C5: Worker round-tripops/s @ 5002247113.2xCelery
C6: Worker throughputops/s @ 500775967.7xCelery
C7: Cold start (1 job)ms343,478102xqler

When to Use Which

ScenarioRecommendation
Single machine, < 1K jobs/minqler — no infrastructure, same-process, debuggable
Single machine, > 5K jobs/minCelery+Redis — SQLite write ceiling becomes the bottleneck
Multi-machine / distributedCelery+Redis — qler doesn’t distribute
Development / prototypingqler — zero setup, pip install qler and go
Already running RedisCelery — marginal cost of adding a queue is near zero
Debugging / observabilityqler — full SQL access to every job, attempt, and failure
CLI tools / test suitesqler — 34ms cold start vs 3.5 seconds

What’s Missing

These benchmarks run on one machine with localhost Redis. We have not tested:

  • Remote Redis: Adding real network latency would slow Celery’s round-trips. The gap on C1/C5 would narrow; C2 (batch) would widen further in qler’s favor.
  • Prefork pool: Celery with --pool prefork -c 4 would multiply its throughput. The C6 gap would widen.
  • Production data: Warm caches, thousands of existing jobs, concurrent writers. SQLite’s single-writer lock becomes a real constraint under contention.
  • Multi-machine: qler’s architecture cannot distribute. This is a feature boundary, not a performance one.
  • Large payloads: All jobs use minimal payloads. Serialization costs matter more with large JSON blobs.

The honest summary: qler wins on simplicity, cold start, and batch writes. Celery wins on sustained throughput and will always win on horizontal scaling. The 8x throughput gap is the price of not running Redis; for many applications, that price is worth paying.


Running It Yourself

git clone https://github.com/gabubelern/qler.git
cd qler
uv sync --all-groups

# Requires Redis running on localhost:6379
uv run --group bench python -m benchmarks run --suite comparison --scale small --warmup 1 --iterations 3 -v
uv run --group bench python -m benchmarks compare

Results land in benchmarks/COMPARISON.md. The suite enforces matched configuration, GC isolation, and measurement parity automatically.