qler: The Honest Benchmark

March 5, 2026

Do you need Redis for background jobs?

qler is a background job queue built on SQLite. No broker, no daemon, no infrastructure; pip install qler and you have a job queue backed by a file on disk. The question is what that simplicity costs you in throughput, and whether the answer changes if you’re honest about the measurement.

Environment

All measurements on the same machine, same run, same Python process:

Python: 3.13.7
qler: 0.5.0 (SQLite 3.50.4, WAL mode)
Celery: 5.6.2 + redis-py 6.4.0
Redis: 7.0, localhost (loopback; no real network latency)
Platform: Linux x86_64, 8 cores
Celery pool: --pool solo (single-threaded, fair c=1 comparison)

Both systems get equivalent configuration within their architecture. qler uses WAL mode; Celery uses Redis’s default in-memory persistence. Neither gets special tuning the other doesn’t.

Caveats (Read These First)

Before the numbers, understand what this comparison is and isn’t.

Architecture: qler runs in-process with SQLite — zero network hops. Celery talks to a separate Redis process over TCP loopback. These are different tradeoffs, not just different speeds.
Single machine only: qler cannot distribute work across machines. Celery can. That capability isn’t tested here.
SQLite write ceiling: SQLite handles ~1–5K writes/sec. Redis handles 100K+. At extreme throughput, this is not a contest.
Solo pool: Celery uses --pool solo for a fair single-worker comparison. Real deployments use prefork with multiple processes; the throughput gap would widen.
Localhost Redis: No real network latency. Production Redis is often remote (+0.1–1ms per round-trip). This understates the gap Celery would face in production.
Cold state: Fresh DB and fresh Redis on each iteration. No warm caches, no accumulated data.

The framing: “Should I add Redis to my stack just for background jobs, or is SQLite enough?” If you’re on one machine processing fewer than 1K jobs/minute, this shows what you gain (simplicity) and what it costs (throughput ceiling).

The Write Path (C1–C2)

C1: Enqueue Latency

How fast can you submit a single job? qler calls INSERT INTO on a local SQLite file; Celery serializes and publishes to Redis over TCP.

Jobs	qler (ms)	Celery (ms)	Gap
100	72	57	1.3x slower
500	280	275	parity
1,000	525	554	1.1x faster
5,000	2,609	2,794	1.1x faster

Rough parity. At small scales Celery’s pipelined Redis connection wins; at larger scales the overhead of 5,000 TCP round-trips catches up and qler’s single-transaction SQLite writes pull ahead slightly.

C2: Batch Enqueue

enqueue_many() wraps an entire batch in one SQLite transaction. Celery’s group().apply_async() pipelines individual Redis publishes.

Jobs	qler (ms)	Celery (ms)	Gap
100	34	56	1.6x faster
500	75	287	3.8x faster
1,000	121	567	4.7x faster
5,000	501	3,174	6.3x faster

qler dominates batch writes. One SQLite transaction for 5,000 rows is fundamentally cheaper than 5,000 individual Redis publishes, and the advantage scales linearly. This is the scenario where SQLite’s “everything is a file” model wins outright.

The Unfair Shortcuts (C3–C4)

These scenarios were part of the original benchmark suite and we keep them for completeness, but they measure something misleading.

C3: Raw API Round-Trip

Enqueue → claim → complete using qler’s raw API (no Worker dispatch) vs Celery’s delay().get() (which uses a real Worker).

Jobs	qler (ms)	Celery (ms)	Gap
100	336	168	2.0x slower
500	1,616	853	1.9x slower
1,000	3,112	1,681	1.9x slower

C4: Raw API Throughput

Same raw API, sustained sequential cycles.

Jobs	qler (ms)	Celery (ms)	Gap
100	315	167	1.9x slower
500	1,563	830	1.9x slower
1,000	3,031	1,680	1.8x slower
5,000	15,805	8,197	1.9x slower

The gap is a consistent ~1.9x. Redis’s in-memory operations outpace SQLite’s 3-write cycle (enqueue + claim + complete); that’s physics, not a bug. But note the asymmetry: qler bypasses its Worker entirely, calling the raw Queue API. Celery uses real Workers. The 1.9x gap is artificially narrow because qler isn’t doing equivalent work.

We found this bias during Round 1.1’s adversarial audit. The fix was to add C5 and C6: real Workers on both sides.

The Honest Numbers (C5–C6)

C5: Worker Round-Trip

Real end-to-end: enqueue → Worker picks up the job → executes → job.wait() returns. Both sides use actual Workers.

Jobs	qler (ms)	Celery (ms)	Gap
100	442	132	3.4x slower
500	2,230	704	3.2x slower

Celery is ~3.3x faster on real worker round-trips. The gap comes from two places: Redis’s in-memory message dispatch is faster than SQLite’s claim-via-UPDATE, and Celery’s Worker is a battle-hardened event loop optimized for message throughput.

C6: Worker Throughput

Sequential enqueue + job.wait() through real Workers, sustained. This is the throughput number that matters for production workloads.

Jobs	qler (ops/s)	Celery (ops/s)	Gap
100	74.6	600	8.0x slower
500	77.3	596	7.7x slower

Celery sustains ~8x higher throughput through real Workers. With prefork pool (not tested), the gap would widen further.

This number was 12.8x in Round 1.2. What changed is covered in the next section.

Round 1.3: The Event Fix

Round 1.2 showed a 12.8x throughput gap. That was honest — but the question was whether it was necessary.

The bottleneck was job.wait(). In Round 1.2, waiting for a job to complete meant polling the database: query the job’s status, sleep 50ms, query again, repeat. Every wait paid 0–50ms of unnecessary latency per job, and the poll queries added write contention to the SQLite file.

The fix was an asyncio.Event notification system. When a Worker completes a job, it fires an in-process Event. job.wait() registers for that Event before its first DB check; if the job completes while we’re waiting, the Event wakes us up instantly — no polling, no sleep, no wasted queries.

# _notify.py — the entire module is 47 lines
_registry: dict[str, asyncio.Event] = {}

def register(ulid: str) -> asyncio.Event:
    """Called by wait() BEFORE the first DB check."""
    if ulid not in _registry:
        _registry[ulid] = asyncio.Event()
    return _registry[ulid]

def fire(ulid: str) -> None:
    """Called by complete_job/fail_job/cancel_job."""
    ev = _registry.get(ulid)
    if ev is not None:
        ev.set()

Cross-process callers (or jobs completed before wait() starts) fall back to the existing DB poll transparently. The Event is a fast path, not a requirement.

Before and After

Metric	Round 1.2	Round 1.3	Change
C6 throughput (qler, 100 jobs)	50.2 ops/s	74.6 ops/s	+49%
C6 throughput (qler, 500 jobs)	51.4 ops/s	77.3 ops/s	+50%
C6 gap vs Celery	12.8x	8.0x	37% narrower

A 47-line module cut the throughput gap by a third. The remaining 8x is largely architectural: Redis dispatches messages in memory; qler writes to disk. That gap is real and will not close without fundamental changes to the storage model.

Cold Start (C7)

Time from zero to first completed job, including all initialization.

Jobs	qler (ms)	Celery (ms)	Gap
1	34	3,478	102x faster
10	1,146	3,497	3.1x faster

qler initializes in 34ms: create an SQLite file, run schema migrations, start an asyncio task. Celery needs ~3.5 seconds: fork a subprocess, connect to Redis, register tasks, start the event loop.

The 102x gap matters for three use cases: CLI tools that run a job and exit, test suites that spin up workers per test, and serverless functions where cold start is billed time.

Full Results Table

All 7 scenarios, Round 1.3, median values. Scale: small (100–5,000 jobs), 3 iterations, 1 warmup.

Scenario	Measure	qler	Celery	Gap	Winner
C1: Enqueue latency	ops/s @ 5K	1,916	1,789	1.1x	qler
C2: Batch enqueue	ops/s @ 5K	9,973	1,576	6.3x	qler
C3: Raw API round-trip	ops/s @ 1K	321	595	1.9x	Celery
C4: Raw API throughput	ops/s @ 5K	316	610	1.9x	Celery
C5: Worker round-trip	ops/s @ 500	224	711	3.2x	Celery
C6: Worker throughput	ops/s @ 500	77	596	7.7x	Celery
C7: Cold start (1 job)	ms	34	3,478	102x	qler

When to Use Which

Scenario	Recommendation
Single machine, < 1K jobs/min	qler — no infrastructure, same-process, debuggable
Single machine, > 5K jobs/min	Celery+Redis — SQLite write ceiling becomes the bottleneck
Multi-machine / distributed	Celery+Redis — qler doesn’t distribute
Development / prototyping	qler — zero setup, `pip install qler` and go
Already running Redis	Celery — marginal cost of adding a queue is near zero
Debugging / observability	qler — full SQL access to every job, attempt, and failure
CLI tools / test suites	qler — 34ms cold start vs 3.5 seconds

What’s Missing

These benchmarks run on one machine with localhost Redis. We have not tested:

Remote Redis: Adding real network latency would slow Celery’s round-trips. The gap on C1/C5 would narrow; C2 (batch) would widen further in qler’s favor.
Prefork pool: Celery with --pool prefork -c 4 would multiply its throughput. The C6 gap would widen.
Production data: Warm caches, thousands of existing jobs, concurrent writers. SQLite’s single-writer lock becomes a real constraint under contention.
Multi-machine: qler’s architecture cannot distribute. This is a feature boundary, not a performance one.
Large payloads: All jobs use minimal payloads. Serialization costs matter more with large JSON blobs.

The honest summary: qler wins on simplicity, cold start, and batch writes. Celery wins on sustained throughput and will always win on horizontal scaling. The 8x throughput gap is the price of not running Redis; for many applications, that price is worth paying.

Running It Yourself

git clone https://github.com/gabubelern/qler.git
cd qler
uv sync --all-groups

# Requires Redis running on localhost:6379
uv run --group bench python -m benchmarks run --suite comparison --scale small --warmup 1 --iterations 3 -v
uv run --group bench python -m benchmarks compare

Results land in benchmarks/COMPARISON.md. The suite enforces matched configuration, GC isolation, and measurement parity automatically.