Shows how much of the article you have read

Logler: 246x Faster

The initial benchmarks told us exactly where logler was slow. Hierarchy building took 86 seconds at 50K entries. Smart sampling took 9 seconds. Five of eight core functions were re-parsing files on every call.

Three changes fixed it. Each one independently testable and revertable. Here’s what we found, what we changed, and what the numbers say.


The Bottlenecks

The v1 benchmark suite (14 scenarios, 43 measurements) exposed three categories of waste:

ProblemSymptomRoot Cause
Hierarchy building86s at 50K entriesO(n^2) scan in naming inference
Smart sampling9s at 50K entriesDeserializing 50K entries to sample 50
Repeated parsing260ms per follow_thread() callEach function creates a new Rust investigator

Change 1: Cached Investigator for All Functions

Problem: Five of eight core functions called standalone Rust functions that create a new Investigator instance — re-parsing all files every time. Only search(), extract_ids(), and get_context() used the cached path.

FunctionBeforeAfter
search()cachedno change
extract_ids()cachedno change
get_context()cachedno change
follow_thread()re-parse every callcached
find_patterns()re-parse every callcached
get_metadata()re-parse every callcached
follow_thread_hierarchy()re-parse every callcached
detect_correlation_chains()broken (wrong arg count)fixed + cached

The cache is an LRU keyed on (sorted_file_paths, file_mtimes). Same files, same modification times → same investigator. Files change → new investigator.

The fix for detect_correlation_chains() deserves mention: it was calling logler_rs.search() with 10 arguments, but the standalone function only accepts 3. It would crash at runtime if ever called. Wiring it through the proper search() function fixed both the bug and the performance.

Impact: follow_thread() at 1K entries dropped from 2.6ms to 0.38ms (6.8x faster). At 50K entries, from 259ms to 148ms (1.75x faster).


Change 2: BTreeSet Prefix Index for Hierarchy

Problem: infer_children_from_naming() in the Rust hierarchy builder scanned ALL thread and span IDs for each parent node to find children by naming patterns. For every parent like worker-1, it checked every ID to see if it started with worker-1., worker-1:, or worker-1-. With P parents and N unique IDs, that’s O(P x N) — effectively O(n^2).

At 50K entries with thousands of unique IDs, this took 86 seconds.

Fix: Build a BTreeSet<String> of all IDs during add_entry(). Use range() for O(log n + k) prefix lookups, where k is the number of actual matches.

// Before: O(n) scan per parent
for candidate_id in all_ids.iter() {
    if candidate_id.starts_with(&prefix) { ... }
}

// After: O(log n + k) range query
let end = prefix_successor(&prefix);
for candidate_id in self.all_ids_sorted.range(prefix.clone()..end) { ... }

The prefix_successor() helper computes the exclusive upper bound for a prefix range query by incrementing the last byte. "worker-1." becomes "worker-1/" — the BTreeSet’s range() returns exactly the IDs that start with the prefix.

Impact: Hierarchy building at 50K entries dropped from 85,932ms to 349ms. That’s 246x faster.

Hierarchy Building: Before vs After

This is a definitive improvement — v2’s worst case (360ms) is faster than v1’s best case (85,700ms). Zero overlap in timing distributions.


Change 3: Capped Smart Sample Fetch

Problem: smart_sample() called search(files, level=level, limit=None) — fetching ALL entries into Python as JSON dictionaries, then sampling 50. At 50K entries, Python was deserializing 50,000 JSON objects into dicts, then throwing away 49,950 of them.

Fix: Cap the fetch to max(sample_size * 10, 500) entries. For the errors_focused strategy, use two targeted fetches (errors + context) instead of one huge fetch. The Rust search result already includes total_matches — the true pre-truncation count — so we get the exact population size without a separate count query.

What didn’t work: The first attempt added a separate search(output_format="count") call to get the population count “cheaply.” This caused a regression because output_format="count" still does a full Rust search internally — it’s not a lightweight count query. For errors_focused, three searches instead of one. The fix was removing the count query entirely and reading total_matches from the capped search result.

Impact: Sampling at 10K entries dropped from 778ms to 196ms (4x faster). At 50K entries, from 9,075ms to 6,333ms (1.4x faster).

Smart Sample Scaling: Before vs After

The modest improvement at 50K reflects reality: the Rust search cost dominates. It still collects all matching results internally before truncating to limit. The savings come from not deserializing 50K JSON objects in Python. At 10K, the savings are proportionally larger because deserialization overhead was a bigger fraction of total time.


Full Results

All 43 measurements, ranked by speedup:

Speedup Summary

Top Improvements (2x+ speedup)

ScenarioScalev1v2SpeedupConfidence
hierarchy_building50K85.9s349ms246xdefinitive
hierarchy_building10K3.01s67ms45xdefinitive
follow_thread1K2.6ms0.38ms6.8xdefinitive
hierarchy_building1K42ms6.7ms6.3xdefinitive
search_with_filters10K25ms4.5ms5.6xdefinitive
sampling_scaling10K778ms196ms4.0xdefinitive
compare_threads1K6.2ms1.6ms3.9xdefinitive
compare_threads10K64ms18ms3.5xdefinitive
search_with_filters1K3.6ms1.0ms3.5xdefinitive
follow_thread10K28ms12ms2.3xdefinitive
search_with_filters50K173ms78ms2.2xdefinitive

Moderate Improvements (1.1-2x speedup)

ScenarioScalev1v2Speedup
search_scaling10K39ms20ms2.0x
follow_thread50K259ms148ms1.75x
sampling_scaling1K64ms40ms1.6x
compare_threads50K534ms353ms1.5x
sampling_scaling50K9.07s6.33s1.43x
search_scaling1K6.9ms5.1ms1.3x
search_scaling50K694ms543ms1.28x
sampling (chronological)50K9.13s7.24s1.26x

Within Noise / Unchanged

ScenarioScalev1v2Change
error_flow (large)1.73ms1.68ms~same
tree_formatting (summary)0.011ms0.011ms~same
search_by_level (INFO)50K3.71s3.77s~same

Regressions (Honest)

ScenarioScalev1v2SlowdownNote
max_bytes_truncation16KB1.14s1.51s1.32xNot in optimized path
max_bytes_truncation4KB1.12s1.33s1.18xNot in optimized path
cross_service_timeline2 svcs5.2ms7.6ms1.47xLikely system noise
max_bytes_truncation1KB1.11s1.25s1.13xNot in optimized path

The max_bytes_truncation regressions are consistent across all three parameters, suggesting a real (if small) slowdown. However, none of the three optimizations touch that code path. This is likely variance from background system load during the benchmark run.


Methodology

Test Conditions

Conditionv1v2
Scalesmallsmall
Python3.12.113.12.11
Rust backendyesyes
PlatformLinux x86_64 (8 cores)same
logler version1.2.11.2.1
Data generationLogGenerator(seed=42)same

Confidence Classification

LevelCriteria
Definitivev2 worst case (max) < v1 best case (min). Zero overlap.
Highv2 p95 < v1 median. 95% of v2 runs beat typical v1.
Moderate>10% median change, some overlap.
Marginal3-10% change. Could be noise.
Within noise<3% change. Not a real difference.

All measurements marked “definitive” have zero overlap in timing distributions. This is a conservative bar — even “high confidence” means 95% of v2 runs outperform the typical v1 run.

Statistical Integrity

  • No cherry-picking. Every scenario from v1 is re-run in v2. All results reported.
  • Same data generation seed. Identical synthetic logs across both runs.
  • Same machine. Both runs on identical hardware.
  • Warm caches. Warmup iterations discarded before measurement.
  • Coefficient of variation reported. High CV (>20%) means noisy measurement.
  • Regressions reported honestly. 4 regressions listed with notes.

What’s Left

The three optimizations addressed the worst bottlenecks. What remains:

  1. Rust search is the floor — The engine collects all matching results before applying limit. A true early-termination search would help sampling and large-result queries.
  2. format_tree() is slow — 1.7 seconds to render a hierarchy. The tree traversal + formatting needs optimization.
  3. Python JSON deserialization — Every result crosses the Rust→Python boundary as a JSON string that gets json.loads()’d. A direct PyO3 dict return would skip this.
  4. Persistent indexing — Currently files are re-parsed on first access. A disk-based index would make cold starts instant.

These are real next targets, not hypothetical wish-list items.


Reproducing

# Run both benchmark sets
uv run python -m benchmarks run --scale small --tag v1
uv run python -m benchmarks run --scale small --tag v2

# Generate comparison
uv run python -m benchmarks compare \
  -b benchmarks/results/v1/baseline.json \
  -c benchmarks/results/latest.json \
  -o benchmarks/results/v2 \
  --changes benchmarks/results/v2/CHANGES.md

Every chart and number in this post is reproducible from the benchmark suite.