Shows how much of the article you have read

sqler: Five Fixes, Five Days

Part 1 established the fair methodology. Part 2 traced the overhead to Pydantic and fixed exports (2.8x → parity). This post covers five targeted fixes for the remaining gaps.

Part 3 of 5. Part 1: The Methodology | Part 2: The Overhead | Part 4: The Scorecard | Part 5: The Columnar Baseline

Each fix follows the same loop: measure → find the asymmetry → fix → re-measure. Sometimes the asymmetry was in the code; sometimes it was in the benchmark.

Milestone improvements


M-1: FTS Rebuild (4.65x → 1.03x)

The bug

The 4.65x was never a code problem; it was a benchmark bug. sqler’s fts.rebuild() does a full repopulation: DELETE all FTS rows, then INSERT...SELECT from the source table’s JSON data. The sqlite3 baseline was calling FTS5’s built-in INSERT INTO fts(fts) VALUES('rebuild'), which is a segment merge operation; it reorganizes internal index structures but never re-reads source data.

These are fundamentally different operations. One scales with row count (O(n)); the other scales with segment count (near-constant). Comparing them is like benchmarking “rewrite the book” against “reorganize the table of contents.”

The fix

Made the baseline do the same work: DELETE + INSERT…SELECT from JSON, matching sqler’s actual rebuild implementation.

Results at 1M rows (disk mode)

ScalesqlersqliteRatio
50K1,236ms1,185ms1.04x
100K2,376ms2,320ms1.02x
500K14,242ms13,313ms1.07x
1M32,266ms31,410ms1.03x

The 4.65x was entirely the benchmark comparing different operations. With matched operations, the overhead is 3–7%: the cost of sqler’s adapter layer over raw SQL execution. If your benchmark ratio drifts with dataset size, check whether the two arms scale the same way; operations that scale differently aren’t comparable at any size.

Bonus discovery: 3 more fairness issues

While fixing M-1, three additional asymmetries surfaced in the FTS benchmarks:

  • Tokenizer mismatch: sqler used porter unicode61; the baseline used unicode61 only. Different tokenizers produce different index structures and match different document sets.
  • Search work asymmetry: the baseline returned raw FTS tuples; sqler did FTS query → second SQL query → json.loads()model_validate() → SearchResult construction.
  • No result count verification: the arms could be matching different numbers of documents without detection.

All three were fixed in v1.3. Total fairness issues found across the project: 23.


M-2: any_where Array Subqueries (1.50x → 1.04x)

The root cause

sqler was generating redundant SQL for array subqueries. The any_where path extracted the column value first, then passed it to json_each:

-- What sqler generated (redundant)
json_each(json_extract(data, '$.events'))

-- What it should generate (direct)
json_each(data, '$.events')

The json_extract call is unnecessary; json_each accepts a path argument directly. At 1M rows, the extra function call per row added ~3 seconds of pure waste.

Results at 1M rows (disk mode)

ScalesqlersqliteRatio
50K294ms289ms1.02x
100K567ms550ms1.03x
500K2,968ms2,824ms1.05x
1M6,476ms6,219ms1.04x

From 1.50x to 1.04x. The remaining 4% is sqler’s adapter overhead; the redundant SQL was the entire gap.


M-3: Bulk Insert (1.92x → 0.89x)

The overhead breakdown

Source% of gapFixable?
Per-row cursor() + execute()~45%Yes, batch into chunks
Per-row dict comprehension (filter _id)~25%Yes, batch in list comp
Per-row int(lastrowid) + append()~15%Yes, compute ID range
json.dumps() per row~15%No, both arms pay this

sqler was sending one SQL statement per row. The baseline used executemany(), which loops in C. The fix: chunked multi-row INSERT.

The fix

Split documents into inserts (no _id) and updates (has _id), then build multi-row SQL for each batch. One INSERT INTO t (data) VALUES (json(?)), (json(?)), ... per 999-row chunk. One INSERT ... ON CONFLICT(_id) DO UPDATE per 499-row chunk (2 params per row, staying under sqlite’s 999 parameter limit).

At 50K rows, that’s ~50 SQL calls instead of 50,000. The parser overhead per statement is small (~1–2µs), but at 50K rows it compounds.

The crossover

Bulk insert crossover

Rowssqler (mem)sqlite (mem)Ratio
1K7.4ms5.9ms1.25x
5K30.4ms30.3ms1.00x
10K59.6ms61.6ms0.97x
25K144.8ms155.8ms0.93x
50K299.2ms315.5ms0.95x

At 5K rows sqler reaches parity; at 10K it’s faster than executemany().

How can an ORM beat the C-level executemany()? The answer is what SQLite’s parser does with each approach. executemany() sends N separate INSERT INTO t (data) VALUES (json(?)) statements: N parse, N compile, N execute steps. Multi-row INSERT sends one statement with up to 999 rows: 1 parse, 1 compile, 1 execute step. The parser savings compound at scale; at 1M rows, 50 SQL calls versus 1,000,000 is not a close race.

Cross-scale confirmation (disk mode)

ScalesqlersqliteRatio
50K351ms367ms0.96x
100K637ms699ms0.91x
500K3,212ms3,535ms0.91x
1M7,006ms7,858ms0.89x

At 1M rows, sqler is 11% faster than raw sqlite3’s executemany(). The lesson generalizes: any ORM that batches into multi-row INSERT will beat executemany() at sufficient scale, because the cost shifts from N statement parses to 1.

Fairness note

The comparison is fair but the strategies differ: sqler uses chunked multi-row INSERT; the baseline uses executemany(). Both are legitimate bulk insert approaches. executemany is what most developers use with raw sqlite3, making it the natural baseline. A truly equivalent test would have both arms use the same strategy, but that would measure nothing interesting.


M-4: FTS Ranked Search (1.50x → 1.00x)

The problem

FTS ranked search was the only scenario that worsened at scale:

ScaleRatio
50K0.95x
100K0.70x (noise; sub-50ms values)
500K1.52x
1M1.50x

A ratio that grows with dataset size is the signature of an algorithmic difference, not constant overhead. Something in sqler’s implementation was scaling worse than the baseline.

Root cause: two-query pattern

search_ranked() used two separate SQL queries:

-- Query 1: FTS search for rowids + scores
SELECT rowid, bm25(fts_table) as score
FROM fts_table WHERE MATCH ? ORDER BY score LIMIT ?;

-- Query 2: Fetch documents by ID list
SELECT _id, data FROM source_table WHERE _id IN (?, ?, ...);

Plus Python-side overhead: from_ids()find_documents()_batch_resolve()model_validate() per row → dict lookup + sort to match scores back to documents.

The fix: single JOIN

SELECT t._id, t.data, bm25(fts_table) as score
FROM fts_table f
JOIN source_table t ON t._id = f.rowid
WHERE fts_table MATCH ?
ORDER BY score
LIMIT ? OFFSET ?;

This eliminates the second query, the from_ids() path, the _batch_resolve() call, and the Python-side sort. SQL handles the ordering; the JOIN handles the document fetch. The baseline was updated to use the same single-JOIN approach for parity.

Results at 1M rows (disk mode)

ScalesqlersqliteRatio
50K30.5ms31.1ms0.98x
100K67.8ms67.5ms1.01x
500K479.1ms479.8ms1.00x
1M1,036.8ms1,033.3ms1.00x

Perfect parity at 500K and 1M. The only scenario that previously worsened at scale is now flat.

One more fairness fix (issue #23)

The initial results after M-4 showed 0.73x at 25K; sqler apparently “faster” than raw sqlite3 for FTS ranked search. That’s suspicious; sqler still does model_validate() per row, so it can’t genuinely be faster.

Root cause: the baseline’s create() method populated the FTS table before the rebuild timer started, leaving extra tombstones in FTS5’s shadow tables. After 23 rebuild cycles, these accumulated and made the baseline ~15% slower. Fixed by making create() only create the virtual table (matching sqler’s behavior).

This is a genuinely subtle trap. FTS5 tombstones accumulate across DELETE + INSERT cycles; a fresh FTS5 index is ~2x faster to search than one that’s been through 23 rebuild iterations. Any benchmark that measures post-rebuild search performance must ensure both arms have the same tombstone history.


M-5: The msgspec Prototype (5.1x hydration)

Part 2 identified msgspec as the most promising Pydantic alternative. M-5 built the prototype: SQLerMsgspecModel, a parallel model base backed by msgspec.Struct instead of dataclasses.

Design decisions

_id and _snapshot are declared Struct fields with defaults (Optional[int] = None), not private attributes. This means model_validate() can pass the raw dict straight to msgspec.convert(); no dict filtering, no post-processing. kw_only=True on the base Struct; no from __future__ import annotations (breaks msgspec’s annotation evaluation); no type annotations on class-level attributes (msgspec treats them as Struct fields). Full API compatibility with SQLerLiteModel.

Two optimization rounds

Round 1: The initial prototype showed 2.1x pure hydration, decent but well below the 8x theoretical ceiling. Profiling revealed that dict filtering ({k:v if not k.startswith("_")}) before msgspec.convert() cost 59% of the hydration time. It was completely unnecessary; _id is a declared field and strict=False handles unknown keys. Removing it jumped to 4.3x.

Round 2: After round 1, model_dump() was 16x slower than the dataclass version. Root cause: msgspec.structs.fields() does uncached reflection at ~70µs/call. At 50K calls, field introspection ate 89% of model_dump(). Fix: per-class module-level cache. model_dump() went from 4,062ms to 189ms, making it 1.4x faster than lite.

Final results (50K rows)

ScenarioLite (dataclass)Msgspec (Struct)Speedup
Pure model_validate()151.6ms29.7ms5.1x
model_dump()256ms189ms1.4x
queryset.all() (memory)473.7ms324.4ms1.46x
queryset.all() (disk)454.8ms311.1ms1.46x

Isolating hydration from I/O

The end-to-end queryset time includes both SQL I/O and Python hydration. Using as_dicts() (no hydration) as a baseline:

LiteMsgspec
queryset.all() (mem)473.7ms324.4ms
as_dicts() (mem, no hydration)170.9ms170.9ms
Hydration only (diff)~303ms~153ms

SQL + JSON parse takes ~171ms regardless of model backend. The hydration cost dropped from ~303ms to ~153ms; a 2x reduction in the ORM’s contribution to wall time.

Ratios hold at scale

ScalePure validateEnd-to-end all()Hydration only
50K4.5x1.41x1.83x
100K4.2x1.45x1.95x
250K4.3x1.36x1.74x
500K4.2x1.47x2.13x

Pure hydration: stable 4.2–4.5x with no degradation at size. End-to-end: stable 1.36–1.47x, bounded by SQL I/O.

The lesson

Cache everything that reflects. msgspec’s structs.fields() looks cheap at the call site but does real work per invocation. At 50K calls, uncached field reflection dominated model_dump(), making the “fast” backend 16x slower than dataclasses. The fix was trivial (module-level dict cache); the bug was invisible without profiling.


Summary: Before and After

MilestoneTargetBeforeAfterRoot cause
M-1FTS rebuild4.65x1.03xBenchmark comparing different operations
M-2any_where1.50x1.04xRedundant json_extract() in json_each()
M-3Bulk insert1.92x0.89xPer-row SQL → chunked multi-row INSERT
M-4FTS ranked1.50x1.00xTwo-query pattern → single JOIN
M-5Hydration1.0x (lite)5.1x vs litemsgspec Struct backend (opt-in)

Two of the five “code bugs” were actually benchmark bugs (M-1, M-4’s tombstone issue). The fixes made the numbers worse in the short term but produced data you can trust. That’s the pattern from this whole series: the right answer is rarely the flattering one.

Part 4 runs the final cross-scale validation: 1,725 measurements confirming everything holds at 1M rows.