sqler: Probably Good Enough Benchmarks

March 2, 2026

sqler is a document-oriented JSON store built on SQLite. Pydantic models go in; json_extract queries come out. The question every ORM benchmark needs to answer: what does the abstraction actually cost you?

The first step is making sure the question is fair.

This is Part 1 of a five-part series on sqler’s benchmark journey. Part 2 covers where the overhead actually lives. Part 3 covers five targeted fixes. Part 4 is the final scorecard. Part 5 measures the architectural cost of document storage itself.

The v1.1 Suite: 22 Scenarios, 22 Problems

The first benchmark suite measured 22 scenarios across 5 categories: inserts, queries, JSON operations, full-text search, and operational tasks. The numbers looked good; sqler was within 1–2x of raw sqlite3 on most things, sometimes faster.

That should have been a red flag. An ORM that wraps every call in Python cannot genuinely be faster than the C code underneath it. When your results seem too good, either you’ve built something remarkable or you’ve built a biased benchmark; the odds favor the second explanation.

We ran an adversarial audit: the kind where you pretend a hostile reviewer is about to tear your methodology apart in public, and you ask, for every measurement, “if I wanted to make the other arm win, what would I change?”

The audit found 18 fairness issues (later rounds raised the total to 23; see Part 3 for the full count). Eight were HIGH severity: biases that made sqler look dramatically better than it actually was. Ten were MEDIUM: subtler asymmetries that nudged results in sqler’s favor. Not a single one biased against sqler; the whole suite was a flattery machine.

The 18 Fairness Issues

HIGH Severity (8 issues)

These are the ones that invalidated categories outright.

#	Issue	What was unfair
H-1	PRAGMA mismatch	sqler got 32MB cache, WAL mode, `synchronous=OFF`. Baseline got 2MB cache, rollback journal, `synchronous=FULL`.
H-2	SQL mismatch	sqler used `json_each(data, '$.path')` (direct). Baseline used `json_each(json_extract(data, '$.path'))` (redundant wrap).
H-3	Different insert APIs	sqler used `bulk_upsert()` with per-row `execute()`. Baseline used `executemany()` (C-level batch).
H-4	Commit semantics	sqler’s `insert_document()` auto-committed per row (N commits). Baseline committed once.
H-5	Missing deserialization	sqler’s `.all()` returned parsed dicts. Baseline’s `fetchall()` returned raw `sqlite3.Row` tuples; no `json.loads()`.
H-6	Cold cache bypass	Cold-cache measurement had no warmup, no GC isolation, redefined the decorated function inside the loop.
H-7	Export skip	Baseline wrote raw JSON strings from the `data` column. sqler did `json.loads()` → dict → `json.dumps()` per row.
H-8	Input mutation	`bulk_upsert()` mutated docs in-place (set `doc["_id"]`), turning INSERT into UPDATE on subsequent iterations.

H-1 alone could account for a 2–3x difference. Giving one arm WAL mode with a 16x larger page cache and then claiming “similar performance” is measuring configuration, not code; this is the most common bias in ORM benchmarks, and the easiest to fix.

H-5 is the sneakiest. If the baseline doesn’t call json.loads(), it’s not doing equivalent work; it’s returning a different data type entirely. Any query benchmark that returns deserialized dicts on one side and raw tuples on the other is comparing apples to the concept of fruit.

MEDIUM Severity (10 issues)

#	Issue	What was unfair
M-1	Order bias	sqler always measured first in all 22 scenarios. CPU turbo boost, allocator state, cache warming all favored arm 1.
M-2	Row factory on sqler only	sqler paid `sqlite3.Row` construction overhead; baseline returned cheap tuples.
M-3	Per-query logger overhead	sqler called `time.perf_counter()` twice + `query_logger.log()` on every SQL execution.
M-4	No GC reset between arms	Sequential measurement with accumulated heap state.
M-5	WAL vs rollback	sqler used WAL mode (readers don’t block writers). Baseline used default rollback journal.
M-6	Asymmetric connection handling	sqler pre-opened connections; baseline paid `sqlite3.connect()` + `close()` inside the timed section.
M-7	DDL in timing window	First `db.query("bench")` ran `CREATE TABLE IF NOT EXISTS` inside the timed window.
M-8	Restore PRAGMA overhead	sqler’s `SQLerDB.on_disk(rst)` ran 8 PRAGMA roundtrips. Baseline had zero.
M-9	No baseline for highlights	sqler measured `search_with_highlights` with no sqlite3 counterpart.
M-10	`bool(first())` vs `SELECT 1`	sqler’s `.first()` fetched a full document + `json.loads()`. Baseline selected constant `1`.

Any one of these is a small thumb on the scale; together they add up to a benchmark that systematically flatters one arm. The bias isn’t intentional; it’s what happens when you write benchmarks and never ask “what would make the other arm win?”

The v1.2 Rewrite: Making Every Number Worse

The fix was straightforward but painful: match everything. Every asymmetry the audit found, resolved in the same direction; give the baseline whatever advantage sqler had been getting.

Matched PRAGMAs

Both arms now get identical configuration per storage mode:

Memory mode (both arms):

PRAGMA foreign_keys = ON;
PRAGMA synchronous = OFF;
PRAGMA journal_mode = MEMORY;
PRAGMA temp_store = MEMORY;
PRAGMA cache_size = -32000;
PRAGMA locking_mode = EXCLUSIVE;

Disk mode (both arms):

PRAGMA foreign_keys = ON;
PRAGMA busy_timeout = 5000;
PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL;
PRAGMA cache_size = -64000;
PRAGMA wal_autocheckpoint = 1000;
PRAGMA mmap_size = 268435456;
PRAGMA temp_store = MEMORY;

If sqler sets PRAGMAs internally, the baseline gets the same ones. No exceptions.

Matched SQL

Both arms execute the same SQL patterns. Where sqler generates json_each(data, '$.tags'), the baseline uses json_each(data, '$.tags'), not json_each(json_extract(data, '$.tags')). The redundant json_extract wrapper in v1.1 added measurable overhead to the baseline that sqler didn’t pay.

Matched Serialization

Both arms return deserialized data. The baseline calls json.loads() on every row, just like sqler does; no more comparing parsed dicts against raw tuples.

Arm Alternation

Execution order flips deterministically per scenario via a hash function; no “sqler always runs first” bias from CPU turbo boost, cache warming, or allocator state.

GC Isolation

gc.collect() runs between arms; no cross-contamination from accumulated heap state.

Both Storage Modes

Every scenario runs in memory and on disk. The v1.2 suite uses a 4-arm matrix: sqler_mem, sqler_disk, sqlite_mem, sqlite_disk. No cherry-picking the mode that looks better.

What the Fair Numbers Showed

20 iterations, 3 warmup, time.perf_counter() with GC disabled during timing. Cross-scale ratios at 1M rows, memory mode:

Category	v1.2 (fair) ratio	Trend
Queries (filter, range, complex)	0.95–0.97x	Rock-solid parity
JSON ops (contains, isin)	0.99x	Parity
Aggregates (sum, avg, min, max)	0.94–0.95x	Parity*
Backup/restore	1.00–1.02x	Transparent
Bulk insert	1.87–1.92x	Stable ~1.9x
any_where (array subqueries)	1.47–1.51x	Stable ~1.5x
Export (CSV/JSONL)	2.73–2.85x	Stable ~2.8x
FTS rebuild	3.78–4.65x	Scale-dependent
FTS ranked search	0.70–1.52x	Worsening at scale

*Asterisk on aggregates and queries: the baseline uses sqlite3.Row with string key access (row["data"]), while sqler uses integer index access (row[0]). Both execute identical SQL; the 3–6% gap is the row factory artifact, consistent across all query and aggregate scenarios. We document it rather than pretend sqler’s query layer is genuinely faster than raw sqlite3.

The query layer is effectively free in memory mode; Part 4 reports corrected disk-mode numbers at 1.03–1.15x after accounting for the row factory artifact. The real costs are bulk insert (1.9x), array subqueries (1.5x), exports (2.8x), and FTS operations (3.78–4.65x). These are fair numbers that point at real bottlenecks instead of flattering the tool that generated them.

Pre vs post optimization ratios

The Methodology Checklist

For anyone writing ORM benchmarks (or any paired A/B comparison), here’s the fairness checklist that caught 18 issues in ours:

Configuration parity:

Both arms use the same database settings (PRAGMAs, cache sizes, journal modes)
Both arms get the same warmup treatment

Operation parity:

Both arms execute equivalent SQL
Both arms do the same serialization work
Both arms use the same commit semantics
Both arms return the same result types

Measurement parity:

Both arms use the same timer
GC is disabled during measurement for both arms
Arm execution order alternates
Connection setup is either both inside or both outside the timed window

The single most useful question: “If I wanted to make the other arm win, what would I change?” If the answer reveals an asymmetry, fix it before running a single measurement.

Running It Yourself

uv sync --all-groups
uv run python -m benchmarks run --scale medium --storage both
uv run --group benchmarks python -m benchmarks plot

Environment: Python 3.12 | SQLite 3.50.4 | Linux x86_64

The suite enforces matched PRAGMAs, arm alternation, and GC isolation automatically.

What Comes Next

The fair numbers pointed at four real bottlenecks: bulk insert at 1.9x, exports at 2.8x, FTS rebuild at 4.65x, and FTS ranked worsening at scale. Part 2 traces those costs to their root cause, and the answer, for three of the four, turned out to be the same thing.