sqler: Five Fixes, Five Days

March 4, 2026

Part 1 established the fair methodology. Part 2 traced the overhead to Pydantic and fixed exports (2.8x → parity). This post covers five targeted fixes for the remaining gaps.

Part 3 of 5. Part 1: The Methodology | Part 2: The Overhead | Part 4: The Scorecard | Part 5: The Columnar Baseline

Each fix follows the same loop: measure → find the asymmetry → fix → re-measure. Sometimes the asymmetry was in the code; sometimes it was in the benchmark.

Milestone improvements

M-1: FTS Rebuild (4.65x → 1.03x)

The bug

The 4.65x was never a code problem; it was a benchmark bug. sqler’s fts.rebuild() does a full repopulation: DELETE all FTS rows, then INSERT...SELECT from the source table’s JSON data. The sqlite3 baseline was calling FTS5’s built-in INSERT INTO fts(fts) VALUES('rebuild'), which is a segment merge operation; it reorganizes internal index structures but never re-reads source data.

These are fundamentally different operations. One scales with row count (O(n)); the other scales with segment count (near-constant). Comparing them is like benchmarking “rewrite the book” against “reorganize the table of contents.”

The fix

Made the baseline do the same work: DELETE + INSERT…SELECT from JSON, matching sqler’s actual rebuild implementation.

Results at 1M rows (disk mode)

Scale	sqler	sqlite	Ratio
50K	1,236ms	1,185ms	1.04x
100K	2,376ms	2,320ms	1.02x
500K	14,242ms	13,313ms	1.07x
1M	32,266ms	31,410ms	1.03x

The 4.65x was entirely the benchmark comparing different operations. With matched operations, the overhead is 3–7%: the cost of sqler’s adapter layer over raw SQL execution. If your benchmark ratio drifts with dataset size, check whether the two arms scale the same way; operations that scale differently aren’t comparable at any size.

Bonus discovery: 3 more fairness issues

While fixing M-1, three additional asymmetries surfaced in the FTS benchmarks:

Tokenizer mismatch: sqler used porter unicode61; the baseline used unicode61 only. Different tokenizers produce different index structures and match different document sets.
Search work asymmetry: the baseline returned raw FTS tuples; sqler did FTS query → second SQL query → json.loads() → model_validate() → SearchResult construction.
No result count verification: the arms could be matching different numbers of documents without detection.

All three were fixed in v1.3. Total fairness issues found across the project: 23.

M-2: any_where Array Subqueries (1.50x → 1.04x)

The root cause

sqler was generating redundant SQL for array subqueries. The any_where path extracted the column value first, then passed it to json_each:

-- What sqler generated (redundant)
json_each(json_extract(data, '$.events'))

-- What it should generate (direct)
json_each(data, '$.events')

The json_extract call is unnecessary; json_each accepts a path argument directly. At 1M rows, the extra function call per row added ~3 seconds of pure waste.

Results at 1M rows (disk mode)

Scale	sqler	sqlite	Ratio
50K	294ms	289ms	1.02x
100K	567ms	550ms	1.03x
500K	2,968ms	2,824ms	1.05x
1M	6,476ms	6,219ms	1.04x

From 1.50x to 1.04x. The remaining 4% is sqler’s adapter overhead; the redundant SQL was the entire gap.

M-3: Bulk Insert (1.92x → 0.89x)

The overhead breakdown

Source	% of gap	Fixable?
Per-row `cursor()` + `execute()`	~45%	Yes, batch into chunks
Per-row dict comprehension (filter `_id`)	~25%	Yes, batch in list comp
Per-row `int(lastrowid)` + `append()`	~15%	Yes, compute ID range
`json.dumps()` per row	~15%	No, both arms pay this

sqler was sending one SQL statement per row. The baseline used executemany(), which loops in C. The fix: chunked multi-row INSERT.

The fix

Split documents into inserts (no _id) and updates (has _id), then build multi-row SQL for each batch. One INSERT INTO t (data) VALUES (json(?)), (json(?)), ... per 999-row chunk. One INSERT ... ON CONFLICT(_id) DO UPDATE per 499-row chunk (2 params per row, staying under sqlite’s 999 parameter limit).

At 50K rows, that’s ~50 SQL calls instead of 50,000. The parser overhead per statement is small (~1–2µs), but at 50K rows it compounds.

The crossover

Bulk insert crossover

Rows	sqler (mem)	sqlite (mem)	Ratio
1K	7.4ms	5.9ms	1.25x
5K	30.4ms	30.3ms	1.00x
10K	59.6ms	61.6ms	0.97x
25K	144.8ms	155.8ms	0.93x
50K	299.2ms	315.5ms	0.95x

At 5K rows sqler reaches parity; at 10K it’s faster than executemany().

How can an ORM beat the C-level executemany()? The answer is what SQLite’s parser does with each approach. executemany() sends N separate INSERT INTO t (data) VALUES (json(?)) statements: N parse, N compile, N execute steps. Multi-row INSERT sends one statement with up to 999 rows: 1 parse, 1 compile, 1 execute step. The parser savings compound at scale; at 1M rows, 50 SQL calls versus 1,000,000 is not a close race.

Cross-scale confirmation (disk mode)

Scale	sqler	sqlite	Ratio
50K	351ms	367ms	0.96x
100K	637ms	699ms	0.91x
500K	3,212ms	3,535ms	0.91x
1M	7,006ms	7,858ms	0.89x

At 1M rows, sqler is 11% faster than raw sqlite3’s executemany(). The lesson generalizes: any ORM that batches into multi-row INSERT will beat executemany() at sufficient scale, because the cost shifts from N statement parses to 1.

Fairness note

The comparison is fair but the strategies differ: sqler uses chunked multi-row INSERT; the baseline uses executemany(). Both are legitimate bulk insert approaches. executemany is what most developers use with raw sqlite3, making it the natural baseline. A truly equivalent test would have both arms use the same strategy, but that would measure nothing interesting.

M-4: FTS Ranked Search (1.50x → 1.00x)

The problem

FTS ranked search was the only scenario that worsened at scale:

Scale	Ratio
50K	0.95x
100K	0.70x (noise; sub-50ms values)
500K	1.52x
1M	1.50x

A ratio that grows with dataset size is the signature of an algorithmic difference, not constant overhead. Something in sqler’s implementation was scaling worse than the baseline.

Root cause: two-query pattern

search_ranked() used two separate SQL queries:

-- Query 1: FTS search for rowids + scores
SELECT rowid, bm25(fts_table) as score
FROM fts_table WHERE MATCH ? ORDER BY score LIMIT ?;

-- Query 2: Fetch documents by ID list
SELECT _id, data FROM source_table WHERE _id IN (?, ?, ...);

Plus Python-side overhead: from_ids() → find_documents() → _batch_resolve() → model_validate() per row → dict lookup + sort to match scores back to documents.

The fix: single JOIN

SELECT t._id, t.data, bm25(fts_table) as score
FROM fts_table f
JOIN source_table t ON t._id = f.rowid
WHERE fts_table MATCH ?
ORDER BY score
LIMIT ? OFFSET ?;

This eliminates the second query, the from_ids() path, the _batch_resolve() call, and the Python-side sort. SQL handles the ordering; the JOIN handles the document fetch. The baseline was updated to use the same single-JOIN approach for parity.

Results at 1M rows (disk mode)

Scale	sqler	sqlite	Ratio
50K	30.5ms	31.1ms	0.98x
100K	67.8ms	67.5ms	1.01x
500K	479.1ms	479.8ms	1.00x
1M	1,036.8ms	1,033.3ms	1.00x

Perfect parity at 500K and 1M. The only scenario that previously worsened at scale is now flat.

One more fairness fix (issue #23)

The initial results after M-4 showed 0.73x at 25K; sqler apparently “faster” than raw sqlite3 for FTS ranked search. That’s suspicious; sqler still does model_validate() per row, so it can’t genuinely be faster.

Root cause: the baseline’s create() method populated the FTS table before the rebuild timer started, leaving extra tombstones in FTS5’s shadow tables. After 23 rebuild cycles, these accumulated and made the baseline ~15% slower. Fixed by making create() only create the virtual table (matching sqler’s behavior).

This is a genuinely subtle trap. FTS5 tombstones accumulate across DELETE + INSERT cycles; a fresh FTS5 index is ~2x faster to search than one that’s been through 23 rebuild iterations. Any benchmark that measures post-rebuild search performance must ensure both arms have the same tombstone history.

M-5: The msgspec Prototype (5.1x hydration)

Part 2 identified msgspec as the most promising Pydantic alternative. M-5 built the prototype: SQLerMsgspecModel, a parallel model base backed by msgspec.Struct instead of dataclasses.

Design decisions

_id and _snapshot are declared Struct fields with defaults (Optional[int] = None), not private attributes. This means model_validate() can pass the raw dict straight to msgspec.convert(); no dict filtering, no post-processing. kw_only=True on the base Struct; no from __future__ import annotations (breaks msgspec’s annotation evaluation); no type annotations on class-level attributes (msgspec treats them as Struct fields). Full API compatibility with SQLerLiteModel.

Two optimization rounds

Round 1: The initial prototype showed 2.1x pure hydration, decent but well below the 8x theoretical ceiling. Profiling revealed that dict filtering ({k:v if not k.startswith("_")}) before msgspec.convert() cost 59% of the hydration time. It was completely unnecessary; _id is a declared field and strict=False handles unknown keys. Removing it jumped to 4.3x.

Round 2: After round 1, model_dump() was 16x slower than the dataclass version. Root cause: msgspec.structs.fields() does uncached reflection at ~70µs/call. At 50K calls, field introspection ate 89% of model_dump(). Fix: per-class module-level cache. model_dump() went from 4,062ms to 189ms, making it 1.4x faster than lite.

Final results (50K rows)

Scenario	Lite (dataclass)	Msgspec (Struct)	Speedup
Pure `model_validate()`	151.6ms	29.7ms	5.1x
`model_dump()`	256ms	189ms	1.4x
`queryset.all()` (memory)	473.7ms	324.4ms	1.46x
`queryset.all()` (disk)	454.8ms	311.1ms	1.46x

Isolating hydration from I/O

The end-to-end queryset time includes both SQL I/O and Python hydration. Using as_dicts() (no hydration) as a baseline:

	Lite	Msgspec
`queryset.all()` (mem)	473.7ms	324.4ms
`as_dicts()` (mem, no hydration)	170.9ms	170.9ms
Hydration only (diff)	~303ms	~153ms

SQL + JSON parse takes ~171ms regardless of model backend. The hydration cost dropped from ~303ms to ~153ms; a 2x reduction in the ORM’s contribution to wall time.

Ratios hold at scale

Scale	Pure validate	End-to-end all()	Hydration only
50K	4.5x	1.41x	1.83x
100K	4.2x	1.45x	1.95x
250K	4.3x	1.36x	1.74x
500K	4.2x	1.47x	2.13x

Pure hydration: stable 4.2–4.5x with no degradation at size. End-to-end: stable 1.36–1.47x, bounded by SQL I/O.

The lesson

Cache everything that reflects. msgspec’s structs.fields() looks cheap at the call site but does real work per invocation. At 50K calls, uncached field reflection dominated model_dump(), making the “fast” backend 16x slower than dataclasses. The fix was trivial (module-level dict cache); the bug was invisible without profiling.

Summary: Before and After

Milestone	Target	Before	After	Root cause
M-1	FTS rebuild	4.65x	1.03x	Benchmark comparing different operations
M-2	any_where	1.50x	1.04x	Redundant `json_extract()` in `json_each()`
M-3	Bulk insert	1.92x	0.89x	Per-row SQL → chunked multi-row INSERT
M-4	FTS ranked	1.50x	1.00x	Two-query pattern → single JOIN
M-5	Hydration	1.0x (lite)	5.1x vs lite	msgspec Struct backend (opt-in)

Two of the five “code bugs” were actually benchmark bugs (M-1, M-4’s tombstone issue). The fixes made the numbers worse in the short term but produced data you can trust. That’s the pattern from this whole series: the right answer is rarely the flattering one.

Part 4 runs the final cross-scale validation: 1,725 measurements confirming everything holds at 1M rows.