sqler: Where the Overhead Actually Lives

March 3, 2026

Part 1 established the fair methodology and identified the real bottlenecks: bulk insert at 1.9x, exports at 2.8x, FTS rebuild at 4.65x, and FTS ranked worsening at scale. This post traces those costs to their source.

The answer, for three of the four, was the same: Pydantic.

Part 2 of 5. Part 1: The Methodology | Part 3: The Fixes | Part 4: The Scorecard | Part 5: The Columnar Baseline

Cross-Scale Validation: The Numbers Hold

Before optimizing anything, we ran the v1.2 fair suite at four scales (50K, 100K, 500K, and 1M rows) to confirm the ratios were stable, not artifacts of a particular data size. 1,563 measurements across both storage modes.

The ratios held. Bulk insert was 1.87–1.92x from 50K to 1M; it didn’t spike at scale and didn’t improve either. Exports were 2.73–2.86x at every tested size. The only category that moved was FTS rebuild, improving from 4.65x at 50K to 3.78x at 1M as SQLite’s absolute work grew relative to the fixed Python overhead.

This matters because it means the costs are proportional; per-row overhead, not algorithmic blowup. Whatever was causing the 2.8x export gap was doing constant extra work per row, independent of table size. That pattern points at object construction.

The Pydantic Hydration Pipeline

Every row that sqler reads from SQLite passes through this pipeline:

Per-row cost breakdown

Three steps per row:

json.loads(data) → Python dict (~200ns). Both arms pay this; it’s the cost of reading JSON from SQLite.
model_validate(dict) → Pydantic model (~1,100ns). Type coercion, default injection, nested model construction, validator execution, field alias resolution. This is where sqler turns raw data into typed objects.
_model_to_dict(model) → back to dict (~500ns). getattr(model, field) per field, plus _serialize_value() to undo Pydantic’s type coercion (datetime objects back to ISO strings).

Steps 2 and 3 cancel each other out on read-only paths. Pydantic parses ISO strings into datetime objects; _serialize_value converts them right back to ISO strings. The round-trip produces identical output to just passing the raw dict through, which means 1,600ns per row of object construction and deconstruction that produces nothing the raw dict didn’t already have. For any path where the output format is JSON (or derived from JSON), and the input is the data column (which IS JSON), the full hydration pipeline is pure overhead.

That accounts for the 2.8x export gap. The export path iterated through querysets, running the full ORM pipeline per row: query → fetch → deserialize JSON into a model → re-serialize for output. The baseline read raw JSON strings and wrote them directly.

The Export Fix: Bypass Pydantic Entirely

The fix was direct: if the data is written as JSON and will be read as JSON, skip the Python objects in between. Three tiers depending on what the format needs:

JSONL fast path (include_id=False, all fields): Zero-parse. Raw data column strings written directly to file. The data column is already valid JSON; there’s no reason to parse it.

JSON/JSONL with ID or field filter: One json.loads() to inject _id or filter fields, then json.dumps(). Skips Pydantic entirely.

CSV: json.loads() to extract fields, then CSV writer. Still needs the parse for per-field extraction, but skips model construction.

Export ratios before and after

The results at 1M rows (disk mode):

Format	Before	After	What changed
CSV	2.86x	1.34x	Skip model hydration; `json.loads()` + field extract only
JSON	2.85x	0.97x	Skip model hydration; raw dict passthrough
JSONL	2.85x	1.06x	Skip model hydration; minimal parse
JSONL (no ID)	2.85x	0.98x	Zero-parse fast path; raw strings

JSON export is now at parity with raw sqlite3. JSONL converges to parity at scale. CSV retains a 1.34x gap that’s irreducible; per-field extraction from dicts has a cost that both arms pay differently (sqler does json.loads() + dict comprehension; the baseline reads column values directly).

The cross-scale data confirms the ratios are stable:

Format	50K	100K	500K	1M
CSV	1.37x	1.34x	1.40x	1.34x
JSON	0.88x	0.98x	0.97x	0.97x
JSONL	1.03x	1.10x	1.06x	1.06x
JSONL noid	1.21x	1.09x	0.99x	0.96x

JSONL-no-id reaches 0.96x at 1M on disk. The overhead is genuinely gone.

When is skipping Pydantic safe?

Bypassing validation is safe when all of these hold: the write path is trusted (data written by sqler through model.save() or bulk_upsert()), there’s no external mutation, no schema drift between writes and reads, the output is JSON or derived from JSON, and no validators have side effects.

It’s dangerous when multiple writers touch the database, when schema evolves (new fields with defaults, old rows missing them), or when downstream code assumes types that only Pydantic enforcement guarantees.

The export functions are terminal operations; they produce files, not objects that other code operates on. That makes them safe candidates for the bypass.

The Remaining Overhead Sources

After the export fix, the remaining gaps are:

Source	Cost at 50K	Formats affected
`query._build_query()` SQL construction	~2–5ms	All
Cursor wrapping / adapter layer	~2–5ms	All
`_serialize_value(for_csv=True)` per field	~100ms	CSV only
`json.loads()` + `_id` injection	~50ms	CSV, JSON, JSONL
Per-field dict comprehension	~50ms	CSV only

The CSV overhead is structural. Converting JSON documents to tabular rows requires field-by-field extraction; there’s no shortcut that doesn’t change what CSV export means. The 1.34x is the cost of the format conversion itself, not the ORM.

Alternatives to Pydantic

With hydration identified as the dominant cost on read paths, we evaluated alternatives:

Alternative	Hydration speedup	Tradeoff
msgspec Structs	~8x (raw convert)	Different API; migration cost
`model_construct()`	~3x (skips validation)	No validators, no coercion
cattrs / attrs	~3–5x	Different modeling paradigm
TypedDict + manual	~2x	Lose Pydantic ecosystem
Raw dicts (no hydration)	~5.5x	No type safety at all

The `model_construct()` dead end

Pydantic’s model_construct() skips validation, which sounds like the obvious shortcut for trusted data. In practice, it was slower than model_validate() for sqler’s use case; model_construct() doesn’t do type coercion, so downstream code that expects datetime objects gets strings. Working around that requires enough Python-side fixups to erase the savings. It’s also less safe: no validation means no protection against schema drift.

The msgspec question

msgspec was the most promising alternative: 8x faster validation, full type safety, memory-efficient Structs. But four blockers stood in the way:

PrivateAttr (hard blocker): sqler stores _id and _snapshot as Pydantic PrivateAttr. msgspec Structs have no equivalent private attribute mechanism.
Field validators (medium): msgspec supports __post_init__ only; no per-field validators, no mode='before' preprocessing.
model_fields introspection (medium): Used in ~20 call sites throughout sqler. All need rewriting from cls.model_fields to msgspec.structs.fields(cls).
Computed fields: msgspec has no @computed_field.

A full migration would break every user’s models. But a parallel model base (SQLerMsgspecModel as an opt-in alternative to SQLerLiteModel) could sidestep all four blockers by making _id a declared Struct field with a default.

That prototype is covered in Part 3.

The Picture So Far

After the export fix and cross-scale validation:

Category	Ratio	Status
Queries	0.95–0.98x	No action needed*
JSON ops	0.98–0.99x	No action needed*
Aggregates	0.94–0.99x	No action needed*
Backup/restore	1.00–1.02x	No action needed
Export CSV	1.34x	Fixed (was 2.86x); irreducible remainder
Export JSON	0.97x	Fixed (was 2.85x); at parity
Export JSONL	1.06x	Fixed (was 2.85x); near parity
Bulk insert	1.87–1.92x	Needs work
any_where	1.47–1.51x	Needs work
FTS rebuild	3.78–4.65x	Needs investigation
FTS ranked	1.50x at scale	Needs investigation

*Ratios below 1.0 reflect the row factory artifact and memory-mode measurement from Part 1. Part 4 reports corrected disk-mode numbers at 1.03–1.15x.

Four items remain above 1.15x. Part 3 fixes all four across five milestones, including a new opt-in model backend that sidesteps Pydantic entirely.