Skip to content

perf: two-lifetime arenas + streaming XML/JSON output#268

Closed
omerbenamram wants to merge 21 commits intomasterfrom
perf/streaming-output-noalloc
Closed

perf: two-lifetime arenas + streaming XML/JSON output#268
omerbenamram wants to merge 21 commits intomasterfrom
perf/streaming-output-noalloc

Conversation

@omerbenamram
Copy link
Owner

@omerbenamram omerbenamram commented Dec 14, 2025

Summary

  • Implement the plan’s two-lifetime model (chunk vs record arena) across BinXML tokens/values and template expansion.
  • Stream-expand templates from token iterators and write XML/JSON without building intermediate trees; reuse per-record scratch/output buffers.
  • Update evtx_dump to use the byte-lending streaming APIs for XML/JSON output.

Performance

Plan reference: two-lifetime_allocator_refactor_(rust_vs_zig)_9016de49.plan.md (local path: /Users/omerba/.cursor/plans/two-lifetime_allocator_refactor_(rust_vs_zig)_9016de49.plan.md).

Benchmark scenario (following the plan’s measure-iterate guidance):

  • sample: /Users/omerba/Workspace/evtx/samples/security_big_sample.evtx
  • threads: -t 1
  • format: -o jsonl
  • sink: stdout redirected to /dev/null
  • rust build: cargo build --release --features fast-alloc --bin evtx_dump
  • zig build: zig build -Doptimize=ReleaseFast
  • runner: hyperfine --warmup 2 --runs 10

Exact results (mean ± σ, 10 runs):

Case Mean (ms) ±σ (ms) Median (ms) Min..Max (ms)
rust master aa25de0 702.5 165.0 630.4 610.9..1,144.6
rust current 1debf92 396.4 27.1 393.5 365.4..445.5
rust perf fa0a164 331.1 37.2 327.4 297.3..423.4
zig 165.1 8.7 161.5 157.8..182.8

Derived speedups (from the same run):

  • perf vs current: 1.197x faster (16.5% less time)
  • perf vs master: 2.122x faster (52.9% less time)
  • zig vs perf: 2.005x faster (50.1% less time)

(Artifacts: benchmarks/perf_pr_20251214_134402.{json,md} on my machine.)

Test plan

  • cargo test -q

Note

Significantly reduces allocator churn and cloning by introducing arena-backed data and streaming expansion/serialization.

  • Core: Thread a per-chunk bumpalo arena through deserializer/template cache/value types; BinXmlValue and arrays now arena-owned; APIs updated to accept arena
  • Template handling: Stream-expand templates (no pre-expanded token Vec), move-on-last-use for substitutions, add borrowed-token path to avoid cloning
  • JSON output: Streamed writer avoids serde_json::Value on hot paths; manual string escaping; numbers via itoa/ryu; faster datetime formatting; duplicate-key handling tuned
  • Plumbing/structs: Switch many model types to PartialEq (remove PartialOrd); minor UTF-16 and SID read cleanups
  • Tooling/docs: Add PERF.md, profile_comparison.sh, scripts/ensure_quiet.sh, and saved benchmark JSON; update README and .gitignore
  • Deps: Add bumpalo, itoa, ryu

Written by Cursor Bugbot for commit 687c0dc. This will update automatically on new commits. Configure here.

Performance optimizations inspired by Zig EVTX parser:

1. ASCII fast path for UTF-16 to UTF-8 conversion (binxml_utils.rs)
   - Bypass decode_utf16 iterator for pure ASCII strings (~95% of EVTX strings)
   - Direct conversion when all code units are <= 0x7F

2. Use hashbrown HashMap for caches (string_cache.rs, template_cache.rs)
   - Faster lookups with inline optimization

3. Direct JSON string writing (json_stream_output.rs)
   - Add write_json_string_ncname() for XML NCName strings (no escaping needed)
   - Replace serde_json::to_writer() with direct byte writes for keys
   - XML element/attribute names follow NCName rules, safe to write directly

Benchmark results:
- Master: 194.9 ms
- Optimized: 132.4 ms
- Improvement: 1.47x faster (47% reduction in execution time)
Detailed markdown document covering:
- ASCII fast path for UTF-16 to UTF-8 (~5% improvement)
- Hashbrown HashMap for caches (~1% improvement)
- Direct JSON string writing (~4% improvement)
- Total 1.47x speedup vs master
- Remaining opportunities to close gap with Zig
- Reverted hashbrown to std::collections::HashMap (std uses hashbrown internally)
- Added detailed profiling analysis showing bottlenecks:
  - Memory allocation: ~29% of CPU time (170+ samples)
  - Memory copying: ~6% (cloning during template expansion)
  - HashMap hashing: ~6%
  - Template expansion: ~9%
- Updated remaining opportunities with architectural solutions needed
- Current gap: Zig is 3.46x faster (574ms vs 166ms single-threaded)
Avoid cloning cached template tokens during streaming expansion, and reduce JSON
duplicate-key bookkeeping / timestamp formatting overhead to cut CPU time.
@omerbenamram omerbenamram force-pushed the perf/streaming-output-noalloc branch from e45ba0c to 097dc0e Compare December 26, 2025 17:51
@omerbenamram omerbenamram changed the base branch from feature/performance-optimizations to master December 26, 2025 18:01
- Add PERF.md (mft-style hypothesis workflow + artifacts)\n- Link from README + profile_comparison.sh\n- Ignore local tmp/ scratch dir\n- Include opt-in perf ablation feature flags (perf_ablate_*)
- Add scripts/ensure_quiet.sh (macOS CPU idle + load1 gate)\n- Integrate with profile_comparison.sh via QUIET_CHECK=1 (hyperfine --prepare + pre-profile wait)\n- Document usage and baseline environment in PERF.md
Support Linux by sampling idle% from /proc/stat deltas and load1 from /proc/loadavg.
- Add omer-pc environment + master-vs-branch baseline numbers (t=1 and t=8)\n- Document quiet-check thresholds used on that machine
- Add BENCH_MT (default on) so multi-thread benchmark can be skipped\n- Keep QUIET_CHECK support via hyperfine --prepare
s.push_str(escaped.as_ref());
}
_ => {}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entity references dropped in JSON output for attributed elements

The refactored visit_entity_reference in JsonOutput no longer handles the case when separate_json_attributes is false (the default) and the current value is a non-empty object. Previously, calling visit_characters would add entity references to the #text field of objects with attributes. The new inline logic only handles Value::Null, empty objects with separate_json_attributes=true, and Value::String - falling through to silent no-op for other cases. This causes entity references like &amp; to be silently dropped when they appear in elements with attributes, resulting in data loss in JSON output. The streaming output in handle_entity_string replicates this bug, with a comment incorrectly stating it matches legacy behavior.

Additional Locations (1)

Fix in Cursor Fix in Web

BinXmlValue::Real64Type(n) => {
let mut buf = ryu::Buffer::new();
self.write_bytes(buf.format(*n).as_bytes())
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaN and Infinity floats produce invalid JSON output

The streaming JSON output uses ryu::Buffer::format() for Real32Type and Real64Type values, which outputs "NaN", "inf", or "-inf" for special IEEE 754 float values. These are not valid JSON tokens. The legacy path using serde_json::json!() macro converts NaN and Infinity to null, producing valid JSON. If an EVTX file contains float fields with special values, the streaming output produces syntactically invalid JSON that cannot be parsed, while the legacy output would produce null.

Additional Locations (1)

Fix in Cursor Fix in Web

- Add per-optimization writeups + attribution table (omer-pc, -t 1)\n- Remove perf_ablate_* feature flags + code branches\n- Drop UTF-16 ASCII fast-path (slightly regressed on omer-pc)\n- Keep a curated hyperfine JSON artifact
- Add Samply-backed hotspot evidence and a Zig-informed plan to remove remaining JSON output allocations\n- Define success metrics + guardrails for the next big experiment
- Remove smallvec suggestion\n- Incorporate Zig-style memory reuse and a per-record scratch bump (separate from chunk arena)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant