Shows how much of the article you have read

sqler: Where the Overhead Actually Lives

Part 1 established the fair methodology and identified the real bottlenecks: bulk insert at 1.9x, exports at 2.8x, FTS rebuild at 4.65x, and FTS ranked worsening at scale. This post traces those costs to their source.

The answer, for three of the four, was the same: Pydantic.

Part 2 of 5. Part 1: The Methodology | Part 3: The Fixes | Part 4: The Scorecard | Part 5: The Columnar Baseline


Cross-Scale Validation: The Numbers Hold

Before optimizing anything, we ran the v1.2 fair suite at four scales (50K, 100K, 500K, and 1M rows) to confirm the ratios were stable, not artifacts of a particular data size. 1,563 measurements across both storage modes.

The ratios held. Bulk insert was 1.87–1.92x from 50K to 1M; it didn’t spike at scale and didn’t improve either. Exports were 2.73–2.86x at every tested size. The only category that moved was FTS rebuild, improving from 4.65x at 50K to 3.78x at 1M as SQLite’s absolute work grew relative to the fixed Python overhead.

This matters because it means the costs are proportional; per-row overhead, not algorithmic blowup. Whatever was causing the 2.8x export gap was doing constant extra work per row, independent of table size. That pattern points at object construction.


The Pydantic Hydration Pipeline

Every row that sqler reads from SQLite passes through this pipeline:

Per-row cost breakdown

Three steps per row:

  1. json.loads(data) → Python dict (~200ns). Both arms pay this; it’s the cost of reading JSON from SQLite.
  2. model_validate(dict) → Pydantic model (~1,100ns). Type coercion, default injection, nested model construction, validator execution, field alias resolution. This is where sqler turns raw data into typed objects.
  3. _model_to_dict(model) → back to dict (~500ns). getattr(model, field) per field, plus _serialize_value() to undo Pydantic’s type coercion (datetime objects back to ISO strings).

Steps 2 and 3 cancel each other out on read-only paths. Pydantic parses ISO strings into datetime objects; _serialize_value converts them right back to ISO strings. The round-trip produces identical output to just passing the raw dict through, which means 1,600ns per row of object construction and deconstruction that produces nothing the raw dict didn’t already have. For any path where the output format is JSON (or derived from JSON), and the input is the data column (which IS JSON), the full hydration pipeline is pure overhead.

That accounts for the 2.8x export gap. The export path iterated through querysets, running the full ORM pipeline per row: query → fetch → deserialize JSON into a model → re-serialize for output. The baseline read raw JSON strings and wrote them directly.


The Export Fix: Bypass Pydantic Entirely

The fix was direct: if the data is written as JSON and will be read as JSON, skip the Python objects in between. Three tiers depending on what the format needs:

JSONL fast path (include_id=False, all fields): Zero-parse. Raw data column strings written directly to file. The data column is already valid JSON; there’s no reason to parse it.

JSON/JSONL with ID or field filter: One json.loads() to inject _id or filter fields, then json.dumps(). Skips Pydantic entirely.

CSV: json.loads() to extract fields, then CSV writer. Still needs the parse for per-field extraction, but skips model construction.

Export ratios before and after

The results at 1M rows (disk mode):

FormatBeforeAfterWhat changed
CSV2.86x1.34xSkip model hydration; json.loads() + field extract only
JSON2.85x0.97xSkip model hydration; raw dict passthrough
JSONL2.85x1.06xSkip model hydration; minimal parse
JSONL (no ID)2.85x0.98xZero-parse fast path; raw strings

JSON export is now at parity with raw sqlite3. JSONL converges to parity at scale. CSV retains a 1.34x gap that’s irreducible; per-field extraction from dicts has a cost that both arms pay differently (sqler does json.loads() + dict comprehension; the baseline reads column values directly).

The cross-scale data confirms the ratios are stable:

Format50K100K500K1M
CSV1.37x1.34x1.40x1.34x
JSON0.88x0.98x0.97x0.97x
JSONL1.03x1.10x1.06x1.06x
JSONL noid1.21x1.09x0.99x0.96x

JSONL-no-id reaches 0.96x at 1M on disk. The overhead is genuinely gone.

When is skipping Pydantic safe?

Bypassing validation is safe when all of these hold: the write path is trusted (data written by sqler through model.save() or bulk_upsert()), there’s no external mutation, no schema drift between writes and reads, the output is JSON or derived from JSON, and no validators have side effects.

It’s dangerous when multiple writers touch the database, when schema evolves (new fields with defaults, old rows missing them), or when downstream code assumes types that only Pydantic enforcement guarantees.

The export functions are terminal operations; they produce files, not objects that other code operates on. That makes them safe candidates for the bypass.


The Remaining Overhead Sources

After the export fix, the remaining gaps are:

SourceCost at 50KFormats affected
query._build_query() SQL construction~2–5msAll
Cursor wrapping / adapter layer~2–5msAll
_serialize_value(for_csv=True) per field~100msCSV only
json.loads() + _id injection~50msCSV, JSON, JSONL
Per-field dict comprehension~50msCSV only

The CSV overhead is structural. Converting JSON documents to tabular rows requires field-by-field extraction; there’s no shortcut that doesn’t change what CSV export means. The 1.34x is the cost of the format conversion itself, not the ORM.


Alternatives to Pydantic

With hydration identified as the dominant cost on read paths, we evaluated alternatives:

AlternativeHydration speedupTradeoff
msgspec Structs~8x (raw convert)Different API; migration cost
model_construct()~3x (skips validation)No validators, no coercion
cattrs / attrs~3–5xDifferent modeling paradigm
TypedDict + manual~2xLose Pydantic ecosystem
Raw dicts (no hydration)~5.5xNo type safety at all

The model_construct() dead end

Pydantic’s model_construct() skips validation, which sounds like the obvious shortcut for trusted data. In practice, it was slower than model_validate() for sqler’s use case; model_construct() doesn’t do type coercion, so downstream code that expects datetime objects gets strings. Working around that requires enough Python-side fixups to erase the savings. It’s also less safe: no validation means no protection against schema drift.

The msgspec question

msgspec was the most promising alternative: 8x faster validation, full type safety, memory-efficient Structs. But four blockers stood in the way:

  1. PrivateAttr (hard blocker): sqler stores _id and _snapshot as Pydantic PrivateAttr. msgspec Structs have no equivalent private attribute mechanism.
  2. Field validators (medium): msgspec supports __post_init__ only; no per-field validators, no mode='before' preprocessing.
  3. model_fields introspection (medium): Used in ~20 call sites throughout sqler. All need rewriting from cls.model_fields to msgspec.structs.fields(cls).
  4. Computed fields: msgspec has no @computed_field.

A full migration would break every user’s models. But a parallel model base (SQLerMsgspecModel as an opt-in alternative to SQLerLiteModel) could sidestep all four blockers by making _id a declared Struct field with a default.

That prototype is covered in Part 3.


The Picture So Far

After the export fix and cross-scale validation:

CategoryRatioStatus
Queries0.95–0.98xNo action needed*
JSON ops0.98–0.99xNo action needed*
Aggregates0.94–0.99xNo action needed*
Backup/restore1.00–1.02xNo action needed
Export CSV1.34xFixed (was 2.86x); irreducible remainder
Export JSON0.97xFixed (was 2.85x); at parity
Export JSONL1.06xFixed (was 2.85x); near parity
Bulk insert1.87–1.92xNeeds work
any_where1.47–1.51xNeeds work
FTS rebuild3.78–4.65xNeeds investigation
FTS ranked1.50x at scaleNeeds investigation

*Ratios below 1.0 reflect the row factory artifact and memory-mode measurement from Part 1. Part 4 reports corrected disk-mode numbers at 1.03–1.15x.

Four items remain above 1.15x. Part 3 fixes all four across five milestones, including a new opt-in model backend that sidesteps Pydantic entirely.