sqler is a document-oriented JSON store built on SQLite. Pydantic models go in; json_extract queries come out. The question every ORM benchmark needs to answer: what does the abstraction actually cost you?
The first step is making sure the question is fair.
This is Part 1 of a five-part series on sqler’s benchmark journey. Part 2 covers where the overhead actually lives. Part 3 covers five targeted fixes. Part 4 is the final scorecard. Part 5 measures the architectural cost of document storage itself.
The v1.1 Suite: 22 Scenarios, 22 Problems
The first benchmark suite measured 22 scenarios across 5 categories: inserts, queries, JSON operations, full-text search, and operational tasks. The numbers looked good; sqler was within 1–2x of raw sqlite3 on most things, sometimes faster.
That should have been a red flag. An ORM that wraps every call in Python cannot genuinely be faster than the C code underneath it. When your results seem too good, either you’ve built something remarkable or you’ve built a biased benchmark; the odds favor the second explanation.
We ran an adversarial audit: the kind where you pretend a hostile reviewer is about to tear your methodology apart in public, and you ask, for every measurement, “if I wanted to make the other arm win, what would I change?”
The audit found 18 fairness issues (later rounds raised the total to 23; see Part 3 for the full count). Eight were HIGH severity: biases that made sqler look dramatically better than it actually was. Ten were MEDIUM: subtler asymmetries that nudged results in sqler’s favor. Not a single one biased against sqler; the whole suite was a flattery machine.
The 18 Fairness Issues
HIGH Severity (8 issues)
These are the ones that invalidated categories outright.
| # | Issue | What was unfair |
|---|---|---|
| H-1 | PRAGMA mismatch | sqler got 32MB cache, WAL mode, synchronous=OFF. Baseline got 2MB cache, rollback journal, synchronous=FULL. |
| H-2 | SQL mismatch | sqler used json_each(data, '$.path') (direct). Baseline used json_each(json_extract(data, '$.path')) (redundant wrap). |
| H-3 | Different insert APIs | sqler used bulk_upsert() with per-row execute(). Baseline used executemany() (C-level batch). |
| H-4 | Commit semantics | sqler’s insert_document() auto-committed per row (N commits). Baseline committed once. |
| H-5 | Missing deserialization | sqler’s .all() returned parsed dicts. Baseline’s fetchall() returned raw sqlite3.Row tuples; no json.loads(). |
| H-6 | Cold cache bypass | Cold-cache measurement had no warmup, no GC isolation, redefined the decorated function inside the loop. |
| H-7 | Export skip | Baseline wrote raw JSON strings from the data column. sqler did json.loads() → dict → json.dumps() per row. |
| H-8 | Input mutation | bulk_upsert() mutated docs in-place (set doc["_id"]), turning INSERT into UPDATE on subsequent iterations. |
H-1 alone could account for a 2–3x difference. Giving one arm WAL mode with a 16x larger page cache and then claiming “similar performance” is measuring configuration, not code; this is the most common bias in ORM benchmarks, and the easiest to fix.
H-5 is the sneakiest. If the baseline doesn’t call json.loads(), it’s not doing equivalent work; it’s returning a different data type entirely. Any query benchmark that returns deserialized dicts on one side and raw tuples on the other is comparing apples to the concept of fruit.
MEDIUM Severity (10 issues)
| # | Issue | What was unfair |
|---|---|---|
| M-1 | Order bias | sqler always measured first in all 22 scenarios. CPU turbo boost, allocator state, cache warming all favored arm 1. |
| M-2 | Row factory on sqler only | sqler paid sqlite3.Row construction overhead; baseline returned cheap tuples. |
| M-3 | Per-query logger overhead | sqler called time.perf_counter() twice + query_logger.log() on every SQL execution. |
| M-4 | No GC reset between arms | Sequential measurement with accumulated heap state. |
| M-5 | WAL vs rollback | sqler used WAL mode (readers don’t block writers). Baseline used default rollback journal. |
| M-6 | Asymmetric connection handling | sqler pre-opened connections; baseline paid sqlite3.connect() + close() inside the timed section. |
| M-7 | DDL in timing window | First db.query("bench") ran CREATE TABLE IF NOT EXISTS inside the timed window. |
| M-8 | Restore PRAGMA overhead | sqler’s SQLerDB.on_disk(rst) ran 8 PRAGMA roundtrips. Baseline had zero. |
| M-9 | No baseline for highlights | sqler measured search_with_highlights with no sqlite3 counterpart. |
| M-10 | bool(first()) vs SELECT 1 | sqler’s .first() fetched a full document + json.loads(). Baseline selected constant 1. |
Any one of these is a small thumb on the scale; together they add up to a benchmark that systematically flatters one arm. The bias isn’t intentional; it’s what happens when you write benchmarks and never ask “what would make the other arm win?”
The v1.2 Rewrite: Making Every Number Worse
The fix was straightforward but painful: match everything. Every asymmetry the audit found, resolved in the same direction; give the baseline whatever advantage sqler had been getting.
Matched PRAGMAs
Both arms now get identical configuration per storage mode:
Memory mode (both arms):
PRAGMA foreign_keys = ON;
PRAGMA synchronous = OFF;
PRAGMA journal_mode = MEMORY;
PRAGMA temp_store = MEMORY;
PRAGMA cache_size = -32000;
PRAGMA locking_mode = EXCLUSIVE;
Disk mode (both arms):
PRAGMA foreign_keys = ON;
PRAGMA busy_timeout = 5000;
PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL;
PRAGMA cache_size = -64000;
PRAGMA wal_autocheckpoint = 1000;
PRAGMA mmap_size = 268435456;
PRAGMA temp_store = MEMORY;
If sqler sets PRAGMAs internally, the baseline gets the same ones. No exceptions.
Matched SQL
Both arms execute the same SQL patterns. Where sqler generates json_each(data, '$.tags'), the baseline uses json_each(data, '$.tags'), not json_each(json_extract(data, '$.tags')). The redundant json_extract wrapper in v1.1 added measurable overhead to the baseline that sqler didn’t pay.
Matched Serialization
Both arms return deserialized data. The baseline calls json.loads() on every row, just like sqler does; no more comparing parsed dicts against raw tuples.
Arm Alternation
Execution order flips deterministically per scenario via a hash function; no “sqler always runs first” bias from CPU turbo boost, cache warming, or allocator state.
GC Isolation
gc.collect() runs between arms; no cross-contamination from accumulated heap state.
Both Storage Modes
Every scenario runs in memory and on disk. The v1.2 suite uses a 4-arm matrix: sqler_mem, sqler_disk, sqlite_mem, sqlite_disk. No cherry-picking the mode that looks better.
What the Fair Numbers Showed
20 iterations, 3 warmup, time.perf_counter() with GC disabled during timing. Cross-scale ratios at 1M rows, memory mode:
| Category | v1.2 (fair) ratio | Trend |
|---|---|---|
| Queries (filter, range, complex) | 0.95–0.97x | Rock-solid parity |
| JSON ops (contains, isin) | 0.99x | Parity |
| Aggregates (sum, avg, min, max) | 0.94–0.95x | Parity* |
| Backup/restore | 1.00–1.02x | Transparent |
| Bulk insert | 1.87–1.92x | Stable ~1.9x |
| any_where (array subqueries) | 1.47–1.51x | Stable ~1.5x |
| Export (CSV/JSONL) | 2.73–2.85x | Stable ~2.8x |
| FTS rebuild | 3.78–4.65x | Scale-dependent |
| FTS ranked search | 0.70–1.52x | Worsening at scale |
*Asterisk on aggregates and queries: the baseline uses sqlite3.Row with string key access (row["data"]), while sqler uses integer index access (row[0]). Both execute identical SQL; the 3–6% gap is the row factory artifact, consistent across all query and aggregate scenarios. We document it rather than pretend sqler’s query layer is genuinely faster than raw sqlite3.
The query layer is effectively free in memory mode; Part 4 reports corrected disk-mode numbers at 1.03–1.15x after accounting for the row factory artifact. The real costs are bulk insert (1.9x), array subqueries (1.5x), exports (2.8x), and FTS operations (3.78–4.65x). These are fair numbers that point at real bottlenecks instead of flattering the tool that generated them.

The Methodology Checklist
For anyone writing ORM benchmarks (or any paired A/B comparison), here’s the fairness checklist that caught 18 issues in ours:
Configuration parity:
- Both arms use the same database settings (PRAGMAs, cache sizes, journal modes)
- Both arms get the same warmup treatment
Operation parity:
- Both arms execute equivalent SQL
- Both arms do the same serialization work
- Both arms use the same commit semantics
- Both arms return the same result types
Measurement parity:
- Both arms use the same timer
- GC is disabled during measurement for both arms
- Arm execution order alternates
- Connection setup is either both inside or both outside the timed window
The single most useful question: “If I wanted to make the other arm win, what would I change?” If the answer reveals an asymmetry, fix it before running a single measurement.
Running It Yourself
uv sync --all-groups
uv run python -m benchmarks run --scale medium --storage both
uv run --group benchmarks python -m benchmarks plot
Environment: Python 3.12 | SQLite 3.50.4 | Linux x86_64
The suite enforces matched PRAGMAs, arm alternation, and GC isolation automatically.
What Comes Next
The fair numbers pointed at four real bottlenecks: bulk insert at 1.9x, exports at 2.8x, FTS rebuild at 4.65x, and FTS ranked worsening at scale. Part 2 traces those costs to their root cause, and the answer, for three of the four, turned out to be the same thing.
