Parts 1–4 measured ORM overhead: sqler versus raw sqlite3, both storing data as _id INTEGER, data JSON. Both arms used json_extract() for queries; both parsed JSON on reads. The gap was 5–15%, and that was the cost of the abstraction layer — query compilation, adapter wrapping, Pydantic hydration.
This post asks a different question. Not “what does the ORM cost?” but “what does the document-oriented architecture itself cost?”
Part 5. Part 1: The Methodology | Part 2: The Overhead | Part 3: The Fixes | Part 4: The Scorecard
The Comparison
sqler arm: Full ORM with Pydantic models, json_extract() queries, JSON blob storage. The developer experience as shipped.
Columnar arm: Traditional typed schema — name TEXT, value INTEGER, category TEXT, score REAL. Direct column access; no json_extract(); no json.loads() on scalar fields. Standard executemany() inserts with per-column binding.
Both arms share:
- Identical connection setup (same
create_conn()function) - Identical PRAGMAs (WAL, 64MB cache, normal sync)
- Same data (DocumentGenerator, seed=42, “small” profile)
- Same measurement methodology (PrecisionTimer, 20 iterations, 3 warmup, GC isolation, arm alternation)
- Dual storage modes (memory and disk)
The tags field stays JSON in both arms — arrays have no good columnar representation without junction tables. The comparison focuses on scalar fields: name, value, category, score.
The Numbers
Eight scenarios, five scales (10K through 1M rows), --storage both. Disk mode numbers reported; memory mode ratios are within 10% of disk in all cases.
Queries and Aggregates
| Scenario | 10K | 50K | 100K | 500K | 1M |
|---|---|---|---|---|---|
| Equality filter | 9.4x | 11.3x | 11.2x | 10.3x | 11.2x |
| Range filter | 3.2x | 3.2x | 3.1x | 3.2x | 3.2x |
| Complex filter | 5.7x | 6.0x | 6.0x | 5.8x | 6.0x |
| Top N | 5.0x | 5.7x | 5.9x | 6.8x | 7.3x |
| Aggregates | 9.2x | 9.5x | 9.4x | 9.5x | 9.5x |
The equality filter is the clearest measure of json_extract() cost: both arms check one field against one value, but sqler runs WHERE json_extract(data, '$.value') = ? while the columnar arm runs WHERE value = ?. The 11x gap is almost entirely the extraction function; Pydantic hydration adds another layer on the result set.
Range filter at 3.2x is rock-solid across every scale. The lower ratio reflects the higher selectivity — more rows pass the filter, so the per-row overhead of JSON extraction gets amortized against result processing that both arms do.
Top N worsens at scale: 5.0x at 10K, 7.3x at 1M. ORDER BY json_extract(data, '$.score') must evaluate the extraction function for every candidate row before sorting; ORDER BY score reads typed values directly. The gap grows because SQLite’s sort buffer management scales differently when extraction is in the hot path.
Aggregates are flat at ~9.5x. SUM(json_extract(data, '$.value')) versus SUM(value) — the extraction function runs on every row, no amortization possible.
Inserts
| Scenario | 10K | 50K | 100K | 500K | 1M |
|---|---|---|---|---|---|
| Bulk insert | 6.8x | 3.1x | 2.5x | 1.7x | 1.3x |
sqler uses chunked multi-row INSERT (INSERT INTO t VALUES (...), (...), ...); the columnar arm uses executemany() with per-column binding. At small scales, sqler’s fixed overhead dominates. At 1M rows, the chunked approach nearly catches up: JSON serialization plus multi-row batching versus per-row binding. The convergence is real but the starting point matters; at typical application scales (10K–100K), the gap is 2.5–6.8x.
Exports
| Scenario | 10K | 50K | 100K | 500K | 1M |
|---|---|---|---|---|---|
| CSV export | 1.7x | 1.9x | 1.9x | 1.9x | 1.9x |
| JSONL export | 1.1x | 1.1x | 1.0x | 1.0x | 1.0x |
JSONL export reaches parity at scale. This is the one scenario where the document architecture has a structural advantage: the data is already JSON. sqler reads data and writes it; the columnar arm must build JSON from individual columns. At 1M rows, they converge to 1.0x.
CSV export at 1.9x is stable. sqler pays for per-field extraction from JSON dicts; the columnar arm reads typed columns directly.
What the Numbers Mean
The document-oriented architecture costs 3–11x on queries and aggregates compared to typed columns. That is the price of json_extract(). It is not a bug; it is the fundamental tradeoff of storing structured data as JSON blobs.
For context: the ORM overhead measured in Parts 1–4 was 1.03–1.15x on top of the same JSON storage. The document architecture itself is the expensive part; the abstraction layer is noise by comparison.
Where the cost comes from
-
json_extract()per row — Every query, filter, sort, and aggregate must calljson_extract(data, '$.field')instead of reading a typed column directly. SQLite’s JSON parser is fast, but it’s doing string parsing on every row. -
Pydantic hydration — sqler returns model instances; the columnar arm returns dicts built from
sqlite3.Row. The hydration cost (documented in Part 2) compounds on top of the extraction cost. -
JSON serialization on insert — sqler runs
json.dumps()on the entire document; the columnar arm binds typed values directly. At scale, sqler’s multi-row INSERT strategy offsets some of this cost.
Where the cost doesn’t come from
- PRAGMAs: identical. Same function, same settings.
- Connection setup: identical. Same
create_conn()from the shared baseline module. - Data: identical. Same generator, same seed, same documents.
The Tradeoffs
A 10x gap on aggregates sounds damning. Whether it matters depends on what you’re building.
When typed columns win clearly: Analytics, reporting, bulk aggregation, any workload that scans most rows and computes over scalar fields. If your queries look like SELECT AVG(score) FROM t WHERE category = 'tech', a columnar schema will always be faster. That query touches every row; json_extract() on every row is pure overhead.
When the gap narrows or disappears: Document export (JSONL is at parity), bulk insert at scale (1.3x at 1M), any workload where the data naturally lives as nested/variable-schema documents and you’d otherwise need junction tables or EAV patterns to represent it relationally.
When the architecture choice isn’t about speed: Schema flexibility. sqler models are Python classes; adding a field is changing the class, not running ALTER TABLE. Documents with optional fields, nested objects, and variable-length arrays map naturally to JSON; flattening them into columns requires design decisions upfront that JSON storage defers.
The cost is real, predictable, and proportional. It doesn’t grow faster than the data. Whether 10x on aggregates is acceptable depends on whether the aggregate takes 1ms or 1 second in absolute terms — and for most application-scale datasets, it’s the former.
Known Caveats
-
Insert strategy asymmetry: sqler’s chunked multi-row INSERT versus
executemany()is not an apples-to-apples SQL comparison. It reflects real usage — this is how each tool actually inserts data — but the convergence at scale is partly an artifact of the different strategies, not purely a storage-model effect. -
Return type asymmetry: sqler returns Pydantic model instances (~1,600ns/row hydration cost); the columnar arm returns plain dicts. This is intentional — it measures the full developer experience — but isolating
json_extract()cost alone would require both arms to return the same type. -
No indexes on the columnar arm: Neither arm uses indexes (except primary key). Adding a
CREATE INDEX ON bench(value)to the columnar arm would widen the gap on equality and range filters; adding a functional index onjson_extract(data, '$.value')to the sqler arm would narrow it. Both were omitted to measure the raw storage model cost. -
Single machine: Same caveat as Parts 1–4. Linux x86_64, 8 cores, Python 3.12, SQLite 3.50.
Running It Yourself
# Install with benchmark dependencies
uv sync --all-groups
# List tabular scenarios
uv run python -m benchmarks list | grep tabular
# Run at medium scale (~20 minutes)
uv run python -m benchmarks run --suite tabular --scale medium --storage both
# Run at large scale
uv run python -m benchmarks run --suite tabular --scale large --storage both
Series Summary
Five posts, one question asked five different ways:
| Post | Question | Answer |
|---|---|---|
| Part 1 | Are these benchmarks fair? | No. 18 issues found; full rewrite. |
| Part 2 | Where does the overhead come from? | Pydantic hydration and export paths. |
| Part 3 | Can it be fixed? | Five fixes; bulk insert now 0.89x. |
| Part 4 | What’s the final cost of the ORM? | 1.03–1.15x on queries; 1.34x on CSV. |
| Part 5 | What’s the cost of the architecture? | 3–11x on queries; 1.0x on JSONL export. |
The ORM is cheap. The document model is expensive. Whether the document model is too expensive depends on your workload; for most applications, the flexibility is worth the cost.
Start from the beginning: Part 1: The Methodology.
