perf: two-lifetime arenas + streaming XML/JSON output by omerbenamram · Pull Request #268 · omerbenamram/evtx

omerbenamram · 2025-12-14T11:46:05Z

Summary

Implement the plan’s two-lifetime model (chunk vs record arena) across BinXML tokens/values and template expansion.
Stream-expand templates from token iterators and write XML/JSON without building intermediate trees; reuse per-record scratch/output buffers.
Update evtx_dump to use the byte-lending streaming APIs for XML/JSON output.

Performance

Plan reference: two-lifetime_allocator_refactor_(rust_vs_zig)_9016de49.plan.md (local path: /Users/omerba/.cursor/plans/two-lifetime_allocator_refactor_(rust_vs_zig)_9016de49.plan.md).

Benchmark scenario (following the plan’s measure-iterate guidance):

sample: /Users/omerba/Workspace/evtx/samples/security_big_sample.evtx
threads: -t 1
format: -o jsonl
sink: stdout redirected to /dev/null
rust build: cargo build --release --features fast-alloc --bin evtx_dump
zig build: zig build -Doptimize=ReleaseFast
runner: hyperfine --warmup 2 --runs 10

Exact results (mean ± σ, 10 runs):

Case	Mean (ms)	±σ (ms)	Median (ms)	Min..Max (ms)
rust master `aa25de0`	702.5	165.0	630.4	610.9..1,144.6
rust current `1debf92`	396.4	27.1	393.5	365.4..445.5
rust perf `fa0a164`	331.1	37.2	327.4	297.3..423.4
zig	165.1	8.7	161.5	157.8..182.8

Derived speedups (from the same run):

perf vs current: 1.197x faster (16.5% less time)
perf vs master: 2.122x faster (52.9% less time)
zig vs perf: 2.005x faster (50.1% less time)

(Artifacts: benchmarks/perf_pr_20251214_134402.{json,md} on my machine.)

Test plan

cargo test -q

Note

Significantly reduces allocator churn and cloning by introducing arena-backed data and streaming expansion/serialization.

Core: Thread a per-chunk bumpalo arena through deserializer/template cache/value types; BinXmlValue and arrays now arena-owned; APIs updated to accept arena
Template handling: Stream-expand templates (no pre-expanded token Vec), move-on-last-use for substitutions, add borrowed-token path to avoid cloning
JSON output: Streamed writer avoids serde_json::Value on hot paths; manual string escaping; numbers via itoa/ryu; faster datetime formatting; duplicate-key handling tuned
Plumbing/structs: Switch many model types to PartialEq (remove PartialOrd); minor UTF-16 and SID read cleanups
Tooling/docs: Add PERF.md, profile_comparison.sh, scripts/ensure_quiet.sh, and saved benchmark JSON; update README and .gitignore
Deps: Add bumpalo, itoa, ryu

^{Written by Cursor Bugbot for commit 687c0dc. This will update automatically on new commits. Configure here.}

src/json_stream_output.rs

Performance optimizations inspired by Zig EVTX parser: 1. ASCII fast path for UTF-16 to UTF-8 conversion (binxml_utils.rs) - Bypass decode_utf16 iterator for pure ASCII strings (~95% of EVTX strings) - Direct conversion when all code units are <= 0x7F 2. Use hashbrown HashMap for caches (string_cache.rs, template_cache.rs) - Faster lookups with inline optimization 3. Direct JSON string writing (json_stream_output.rs) - Add write_json_string_ncname() for XML NCName strings (no escaping needed) - Replace serde_json::to_writer() with direct byte writes for keys - XML element/attribute names follow NCName rules, safe to write directly Benchmark results: - Master: 194.9 ms - Optimized: 132.4 ms - Improvement: 1.47x faster (47% reduction in execution time)

Detailed markdown document covering: - ASCII fast path for UTF-16 to UTF-8 (~5% improvement) - Hashbrown HashMap for caches (~1% improvement) - Direct JSON string writing (~4% improvement) - Total 1.47x speedup vs master - Remaining opportunities to close gap with Zig

- Reverted hashbrown to std::collections::HashMap (std uses hashbrown internally) - Added detailed profiling analysis showing bottlenecks: - Memory allocation: ~29% of CPU time (170+ samples) - Memory copying: ~6% (cloning during template expansion) - HashMap hashing: ~6% - Template expansion: ~9% - Updated remaining opportunities with architectural solutions needed - Current gap: Zig is 3.46x faster (574ms vs 166ms single-threaded)

Avoid cloning cached template tokens during streaming expansion, and reduce JSON duplicate-key bookkeeping / timestamp formatting overhead to cut CPU time.

src/binxml/assemble.rs

- Add PERF.md (mft-style hypothesis workflow + artifacts)\n- Link from README + profile_comparison.sh\n- Ignore local tmp/ scratch dir\n- Include opt-in perf ablation feature flags (perf_ablate_*)

- Add scripts/ensure_quiet.sh (macOS CPU idle + load1 gate)\n- Integrate with profile_comparison.sh via QUIET_CHECK=1 (hyperfine --prepare + pre-profile wait)\n- Document usage and baseline environment in PERF.md

Support Linux by sampling idle% from /proc/stat deltas and load1 from /proc/loadavg.

- Add omer-pc environment + master-vs-branch baseline numbers (t=1 and t=8)\n- Document quiet-check thresholds used on that machine

- Add BENCH_MT (default on) so multi-thread benchmark can be skipped\n- Keep QUIET_CHECK support via hyperfine --prepare

cursor · 2025-12-27T10:44:55Z

src/json_output.rs

+                        s.push_str(escaped.as_ref());
+                    }
+                    _ => {}
+                }


Entity references dropped in JSON output for attributed elements

The refactored visit_entity_reference in JsonOutput no longer handles the case when separate_json_attributes is false (the default) and the current value is a non-empty object. Previously, calling visit_characters would add entity references to the #text field of objects with attributes. The new inline logic only handles Value::Null, empty objects with separate_json_attributes=true, and Value::String - falling through to silent no-op for other cases. This causes entity references like & to be silently dropped when they appear in elements with attributes, resulting in data loss in JSON output. The streaming output in handle_entity_string replicates this bug, with a comment incorrectly stating it matches legacy behavior.

Additional Locations (1)

src/json_stream_output.rs#L703-L709

cursor · 2025-12-27T10:44:55Z

src/json_stream_output.rs

+            BinXmlValue::Real64Type(n) => {
+                let mut buf = ryu::Buffer::new();
+                self.write_bytes(buf.format(*n).as_bytes())
+            }


NaN and Infinity floats produce invalid JSON output

The streaming JSON output uses ryu::Buffer::format() for Real32Type and Real64Type values, which outputs "NaN", "inf", or "-inf" for special IEEE 754 float values. These are not valid JSON tokens. The legacy path using serde_json::json!() macro converts NaN and Infinity to null, producing valid JSON. If an EVTX file contains float fields with special values, the streaming output produces syntactically invalid JSON that cannot be parsed, while the legacy output would produce null.

Additional Locations (1)

src/json_stream_output.rs#L310-L332

- Add per-optimization writeups + attribution table (omer-pc, -t 1)\n- Remove perf_ablate_* feature flags + code branches\n- Drop UTF-16 ASCII fast-path (slightly regressed on omer-pc)\n- Keep a curated hyperfine JSON artifact

- Add Samply-backed hotspot evidence and a Zig-informed plan to remove remaining JSON output allocations\n- Define success metrics + guardrails for the next big experiment

- Remove smallvec suggestion\n- Incorporate Zig-style memory reuse and a per-record scratch bump (separate from chunk arena)

cursor bot reviewed Dec 14, 2025

View reviewed changes

src/json_stream_output.rs Show resolved Hide resolved

cursor bot reviewed Dec 21, 2025

View reviewed changes

src/json_stream_output.rs Show resolved Hide resolved

omerbenamram added 10 commits December 26, 2025 19:02

chore: add FlameGraph scripts to gitignore

35217c1

arena that kinda compiles

38c664d

wip bumpalo

40cdf9d

profile script

6bba066

profile script

44a8583

perf: speed up streaming expansion and JSON output

348d879

Avoid cloning cached template tokens during streaming expansion, and reduce JSON duplicate-key bookkeeping / timestamp formatting overhead to cut CPU time.

fix: post-rebase build + streaming parity

097dc0e

omerbenamram force-pushed the perf/streaming-output-noalloc branch from e45ba0c to 097dc0e Compare December 26, 2025 17:51

chore: fmt

542e382

omerbenamram changed the base branch from feature/performance-optimizations to master December 26, 2025 18:01

cursor bot reviewed Dec 26, 2025

View reviewed changes

src/binxml/assemble.rs Outdated Show resolved Hide resolved

omerbenamram added 7 commits December 26, 2025 22:13

binxml: preserve substitution values when reused

4340b55

docs: add perf thesis playbook

8c5f19d

- Add PERF.md (mft-style hypothesis workflow + artifacts)\n- Link from README + profile_comparison.sh\n- Ignore local tmp/ scratch dir\n- Include opt-in perf ablation feature flags (perf_ablate_*)

perf: add quiet-benchmark guard

f46aad0

- Add scripts/ensure_quiet.sh (macOS CPU idle + load1 gate)\n- Integrate with profile_comparison.sh via QUIET_CHECK=1 (hyperfine --prepare + pre-profile wait)\n- Document usage and baseline environment in PERF.md

perf: make ensure_quiet work on Linux

5246f89

Support Linux by sampling idle% from /proc/stat deltas and load1 from /proc/loadavg.

docs: record omer-pc baseline

1baeb46

- Add omer-pc environment + master-vs-branch baseline numbers (t=1 and t=8)\n- Document quiet-check thresholds used on that machine

perf: make 8T bench optional

28ff308

- Add BENCH_MT (default on) so multi-thread benchmark can be skipped\n- Keep QUIET_CHECK support via hyperfine --prepare

perf: make reference baselines single-thread

2b8b485

cursor bot reviewed Dec 27, 2025

View reviewed changes

omerbenamram added 3 commits December 27, 2025 13:00

perf: log attribution; remove ablations

7731ff9

- Add per-optimization writeups + attribution table (omer-pc, -t 1)\n- Remove perf_ablate_* feature flags + code branches\n- Drop UTF-16 ASCII fast-path (slightly regressed on omer-pc)\n- Keep a curated hyperfine JSON artifact

perf: add H1 hypothesis (json output churn)

5aa4b29

- Add Samply-backed hotspot evidence and a Zig-informed plan to remove remaining JSON output allocations\n- Define success metrics + guardrails for the next big experiment

perf: refine H1 (reuse scratch + bump arena)

687c0dc

- Remove smallvec suggestion\n- Incorporate Zig-style memory reuse and a per-record scratch bump (separate from chunk arena)

omerbenamram mentioned this pull request Dec 30, 2025

PERF: New intermediate representation #278

Merged

omerbenamram closed this Dec 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: two-lifetime arenas + streaming XML/JSON output#268

perf: two-lifetime arenas + streaming XML/JSON output#268
omerbenamram wants to merge 21 commits intomasterfrom
perf/streaming-output-noalloc

omerbenamram commented Dec 14, 2025 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Dec 27, 2025

Uh oh!

cursor bot Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

omerbenamram commented Dec 14, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Dec 27, 2025

Choose a reason for hiding this comment

Entity references dropped in JSON output for attributed elements

Uh oh!

cursor bot Dec 27, 2025

Choose a reason for hiding this comment

NaN and Infinity floats produce invalid JSON output

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

omerbenamram commented Dec 14, 2025 •

edited by cursor bot

Loading