Part 1 established the fair methodology. Part 2 traced the overhead to Pydantic and fixed exports (2.8x → parity). This post covers five targeted fixes for the remaining gaps.
Part 3 of 5. Part 1: The Methodology | Part 2: The Overhead | Part 4: The Scorecard | Part 5: The Columnar Baseline
Each fix follows the same loop: measure → find the asymmetry → fix → re-measure. Sometimes the asymmetry was in the code; sometimes it was in the benchmark.

M-1: FTS Rebuild (4.65x → 1.03x)
The bug
The 4.65x was never a code problem; it was a benchmark bug. sqler’s fts.rebuild() does a full repopulation: DELETE all FTS rows, then INSERT...SELECT from the source table’s JSON data. The sqlite3 baseline was calling FTS5’s built-in INSERT INTO fts(fts) VALUES('rebuild'), which is a segment merge operation; it reorganizes internal index structures but never re-reads source data.
These are fundamentally different operations. One scales with row count (O(n)); the other scales with segment count (near-constant). Comparing them is like benchmarking “rewrite the book” against “reorganize the table of contents.”
The fix
Made the baseline do the same work: DELETE + INSERT…SELECT from JSON, matching sqler’s actual rebuild implementation.
Results at 1M rows (disk mode)
| Scale | sqler | sqlite | Ratio |
|---|---|---|---|
| 50K | 1,236ms | 1,185ms | 1.04x |
| 100K | 2,376ms | 2,320ms | 1.02x |
| 500K | 14,242ms | 13,313ms | 1.07x |
| 1M | 32,266ms | 31,410ms | 1.03x |
The 4.65x was entirely the benchmark comparing different operations. With matched operations, the overhead is 3–7%: the cost of sqler’s adapter layer over raw SQL execution. If your benchmark ratio drifts with dataset size, check whether the two arms scale the same way; operations that scale differently aren’t comparable at any size.
Bonus discovery: 3 more fairness issues
While fixing M-1, three additional asymmetries surfaced in the FTS benchmarks:
- Tokenizer mismatch: sqler used
porter unicode61; the baseline usedunicode61only. Different tokenizers produce different index structures and match different document sets. - Search work asymmetry: the baseline returned raw FTS tuples; sqler did FTS query → second SQL query →
json.loads()→model_validate()→ SearchResult construction. - No result count verification: the arms could be matching different numbers of documents without detection.
All three were fixed in v1.3. Total fairness issues found across the project: 23.
M-2: any_where Array Subqueries (1.50x → 1.04x)
The root cause
sqler was generating redundant SQL for array subqueries. The any_where path extracted the column value first, then passed it to json_each:
-- What sqler generated (redundant)
json_each(json_extract(data, '$.events'))
-- What it should generate (direct)
json_each(data, '$.events')
The json_extract call is unnecessary; json_each accepts a path argument directly. At 1M rows, the extra function call per row added ~3 seconds of pure waste.
Results at 1M rows (disk mode)
| Scale | sqler | sqlite | Ratio |
|---|---|---|---|
| 50K | 294ms | 289ms | 1.02x |
| 100K | 567ms | 550ms | 1.03x |
| 500K | 2,968ms | 2,824ms | 1.05x |
| 1M | 6,476ms | 6,219ms | 1.04x |
From 1.50x to 1.04x. The remaining 4% is sqler’s adapter overhead; the redundant SQL was the entire gap.
M-3: Bulk Insert (1.92x → 0.89x)
The overhead breakdown
| Source | % of gap | Fixable? |
|---|---|---|
Per-row cursor() + execute() | ~45% | Yes, batch into chunks |
Per-row dict comprehension (filter _id) | ~25% | Yes, batch in list comp |
Per-row int(lastrowid) + append() | ~15% | Yes, compute ID range |
json.dumps() per row | ~15% | No, both arms pay this |
sqler was sending one SQL statement per row. The baseline used executemany(), which loops in C. The fix: chunked multi-row INSERT.
The fix
Split documents into inserts (no _id) and updates (has _id), then build multi-row SQL for each batch. One INSERT INTO t (data) VALUES (json(?)), (json(?)), ... per 999-row chunk. One INSERT ... ON CONFLICT(_id) DO UPDATE per 499-row chunk (2 params per row, staying under sqlite’s 999 parameter limit).
At 50K rows, that’s ~50 SQL calls instead of 50,000. The parser overhead per statement is small (~1–2µs), but at 50K rows it compounds.
The crossover

| Rows | sqler (mem) | sqlite (mem) | Ratio |
|---|---|---|---|
| 1K | 7.4ms | 5.9ms | 1.25x |
| 5K | 30.4ms | 30.3ms | 1.00x |
| 10K | 59.6ms | 61.6ms | 0.97x |
| 25K | 144.8ms | 155.8ms | 0.93x |
| 50K | 299.2ms | 315.5ms | 0.95x |
At 5K rows sqler reaches parity; at 10K it’s faster than executemany().
How can an ORM beat the C-level executemany()? The answer is what SQLite’s parser does with each approach. executemany() sends N separate INSERT INTO t (data) VALUES (json(?)) statements: N parse, N compile, N execute steps. Multi-row INSERT sends one statement with up to 999 rows: 1 parse, 1 compile, 1 execute step. The parser savings compound at scale; at 1M rows, 50 SQL calls versus 1,000,000 is not a close race.
Cross-scale confirmation (disk mode)
| Scale | sqler | sqlite | Ratio |
|---|---|---|---|
| 50K | 351ms | 367ms | 0.96x |
| 100K | 637ms | 699ms | 0.91x |
| 500K | 3,212ms | 3,535ms | 0.91x |
| 1M | 7,006ms | 7,858ms | 0.89x |
At 1M rows, sqler is 11% faster than raw sqlite3’s executemany(). The lesson generalizes: any ORM that batches into multi-row INSERT will beat executemany() at sufficient scale, because the cost shifts from N statement parses to 1.
Fairness note
The comparison is fair but the strategies differ: sqler uses chunked multi-row INSERT; the baseline uses executemany(). Both are legitimate bulk insert approaches. executemany is what most developers use with raw sqlite3, making it the natural baseline. A truly equivalent test would have both arms use the same strategy, but that would measure nothing interesting.
M-4: FTS Ranked Search (1.50x → 1.00x)
The problem
FTS ranked search was the only scenario that worsened at scale:
| Scale | Ratio |
|---|---|
| 50K | 0.95x |
| 100K | 0.70x (noise; sub-50ms values) |
| 500K | 1.52x |
| 1M | 1.50x |
A ratio that grows with dataset size is the signature of an algorithmic difference, not constant overhead. Something in sqler’s implementation was scaling worse than the baseline.
Root cause: two-query pattern
search_ranked() used two separate SQL queries:
-- Query 1: FTS search for rowids + scores
SELECT rowid, bm25(fts_table) as score
FROM fts_table WHERE MATCH ? ORDER BY score LIMIT ?;
-- Query 2: Fetch documents by ID list
SELECT _id, data FROM source_table WHERE _id IN (?, ?, ...);
Plus Python-side overhead: from_ids() → find_documents() → _batch_resolve() → model_validate() per row → dict lookup + sort to match scores back to documents.
The fix: single JOIN
SELECT t._id, t.data, bm25(fts_table) as score
FROM fts_table f
JOIN source_table t ON t._id = f.rowid
WHERE fts_table MATCH ?
ORDER BY score
LIMIT ? OFFSET ?;
This eliminates the second query, the from_ids() path, the _batch_resolve() call, and the Python-side sort. SQL handles the ordering; the JOIN handles the document fetch. The baseline was updated to use the same single-JOIN approach for parity.
Results at 1M rows (disk mode)
| Scale | sqler | sqlite | Ratio |
|---|---|---|---|
| 50K | 30.5ms | 31.1ms | 0.98x |
| 100K | 67.8ms | 67.5ms | 1.01x |
| 500K | 479.1ms | 479.8ms | 1.00x |
| 1M | 1,036.8ms | 1,033.3ms | 1.00x |
Perfect parity at 500K and 1M. The only scenario that previously worsened at scale is now flat.
One more fairness fix (issue #23)
The initial results after M-4 showed 0.73x at 25K; sqler apparently “faster” than raw sqlite3 for FTS ranked search. That’s suspicious; sqler still does model_validate() per row, so it can’t genuinely be faster.
Root cause: the baseline’s create() method populated the FTS table before the rebuild timer started, leaving extra tombstones in FTS5’s shadow tables. After 23 rebuild cycles, these accumulated and made the baseline ~15% slower. Fixed by making create() only create the virtual table (matching sqler’s behavior).
This is a genuinely subtle trap. FTS5 tombstones accumulate across DELETE + INSERT cycles; a fresh FTS5 index is ~2x faster to search than one that’s been through 23 rebuild iterations. Any benchmark that measures post-rebuild search performance must ensure both arms have the same tombstone history.
M-5: The msgspec Prototype (5.1x hydration)
Part 2 identified msgspec as the most promising Pydantic alternative. M-5 built the prototype: SQLerMsgspecModel, a parallel model base backed by msgspec.Struct instead of dataclasses.
Design decisions
_id and _snapshot are declared Struct fields with defaults (Optional[int] = None), not private attributes. This means model_validate() can pass the raw dict straight to msgspec.convert(); no dict filtering, no post-processing. kw_only=True on the base Struct; no from __future__ import annotations (breaks msgspec’s annotation evaluation); no type annotations on class-level attributes (msgspec treats them as Struct fields). Full API compatibility with SQLerLiteModel.
Two optimization rounds
Round 1: The initial prototype showed 2.1x pure hydration, decent but well below the 8x theoretical ceiling. Profiling revealed that dict filtering ({k:v if not k.startswith("_")}) before msgspec.convert() cost 59% of the hydration time. It was completely unnecessary; _id is a declared field and strict=False handles unknown keys. Removing it jumped to 4.3x.
Round 2: After round 1, model_dump() was 16x slower than the dataclass version. Root cause: msgspec.structs.fields() does uncached reflection at ~70µs/call. At 50K calls, field introspection ate 89% of model_dump(). Fix: per-class module-level cache. model_dump() went from 4,062ms to 189ms, making it 1.4x faster than lite.
Final results (50K rows)
| Scenario | Lite (dataclass) | Msgspec (Struct) | Speedup |
|---|---|---|---|
Pure model_validate() | 151.6ms | 29.7ms | 5.1x |
model_dump() | 256ms | 189ms | 1.4x |
queryset.all() (memory) | 473.7ms | 324.4ms | 1.46x |
queryset.all() (disk) | 454.8ms | 311.1ms | 1.46x |
Isolating hydration from I/O
The end-to-end queryset time includes both SQL I/O and Python hydration. Using as_dicts() (no hydration) as a baseline:
| Lite | Msgspec | |
|---|---|---|
queryset.all() (mem) | 473.7ms | 324.4ms |
as_dicts() (mem, no hydration) | 170.9ms | 170.9ms |
| Hydration only (diff) | ~303ms | ~153ms |
SQL + JSON parse takes ~171ms regardless of model backend. The hydration cost dropped from ~303ms to ~153ms; a 2x reduction in the ORM’s contribution to wall time.
Ratios hold at scale
| Scale | Pure validate | End-to-end all() | Hydration only |
|---|---|---|---|
| 50K | 4.5x | 1.41x | 1.83x |
| 100K | 4.2x | 1.45x | 1.95x |
| 250K | 4.3x | 1.36x | 1.74x |
| 500K | 4.2x | 1.47x | 2.13x |
Pure hydration: stable 4.2–4.5x with no degradation at size. End-to-end: stable 1.36–1.47x, bounded by SQL I/O.
The lesson
Cache everything that reflects. msgspec’s structs.fields() looks cheap at the call site but does real work per invocation. At 50K calls, uncached field reflection dominated model_dump(), making the “fast” backend 16x slower than dataclasses. The fix was trivial (module-level dict cache); the bug was invisible without profiling.
Summary: Before and After
| Milestone | Target | Before | After | Root cause |
|---|---|---|---|---|
| M-1 | FTS rebuild | 4.65x | 1.03x | Benchmark comparing different operations |
| M-2 | any_where | 1.50x | 1.04x | Redundant json_extract() in json_each() |
| M-3 | Bulk insert | 1.92x | 0.89x | Per-row SQL → chunked multi-row INSERT |
| M-4 | FTS ranked | 1.50x | 1.00x | Two-query pattern → single JOIN |
| M-5 | Hydration | 1.0x (lite) | 5.1x vs lite | msgspec Struct backend (opt-in) |
Two of the five “code bugs” were actually benchmark bugs (M-1, M-4’s tombstone issue). The fixes made the numbers worse in the short term but produced data you can trust. That’s the pattern from this whole series: the right answer is rarely the flattering one.
Part 4 runs the final cross-scale validation: 1,725 measurements confirming everything holds at 1M rows.
