Skip to content

perf: 4-digit SWAR follow-up in ffc_loop_parse_if_eight_digits#23

Merged
kolemannix merged 1 commit into
kolemannix:mainfrom
redis-performance:perf/4digit-swar-followup
May 27, 2026
Merged

perf: 4-digit SWAR follow-up in ffc_loop_parse_if_eight_digits#23
kolemannix merged 1 commit into
kolemannix:mainfrom
redis-performance:perf/4digit-swar-followup

Conversation

@fcostaoliveira

Copy link
Copy Markdown
Contributor

Problem

ffc_loop_parse_if_eight_digits only fires when pend - p >= 8. Numbers
with 5–7 fractional digits — geographic coordinates, mesh data, most
real-world floats — have a 7-digit fraction, so the 8-digit SWAR loop
never triggers for them. All digit scanning fell back to byte-by-byte
iteration. ffc_parse_four_digits_unrolled and
ffc_is_made_of_four_digits_fast were dead code on this common input shape.

A secondary issue: the original loop called ffc_read8_to_u64(*p) twice
per iteration — once to check, once to parse — doing the same 8-byte load
twice.

Fix

Two changes in ffc_loop_parse_if_eight_digits:

  1. Read once: load into a local val, pass to both
    ffc_is_made_of_eight_digits_fast and
    ffc_parse_eight_digits_unrolled_swar.

  2. 4-digit SWAR follow-up: after the 8-digit loop exits, if
    pend - p >= 4 and the next 4 bytes are all ASCII digits, consume them
    with ffc_parse_four_digits_unrolled. For a 7-digit fraction this
    converts 7 byte-by-byte iterations into 1×SWAR-4 + 3 byte-by-byte —
    roughly 43% fewer digit-scanning steps on the most common input length.

while (pend - *p >= 8) {
  uint64_t val = ffc_read8_to_u64(*p);
  if (!ffc_is_made_of_eight_digits_fast(val)) { break; }
  *i = (*i * 100000000) + ffc_parse_eight_digits_unrolled_swar(val);
  *p += 8;
}
if (pend - *p >= 4) {
  uint32_t val4 = ffc_read4_to_u32(*p);
  if (ffc_is_made_of_four_digits_fast(val4)) {
    *i = (*i * 10000) + ffc_parse_four_digits_unrolled(val4);
    *p += 4;
  }
}

Benchmark

Results measured on dedicated metal VMs using
simple_fastfloat_benchmark,
3-run averages. Baseline = same machine, same binary, pre-patch.

x86 — Intel Xeon Platinum 8488C (AWS m7i.metal-24xl)

Dataset Before After Δ vs fastfloat
random [0,1] 1772 MB/s 1745 MB/s −1.5% (noise) −15%
canada.txt 1299 MB/s 1451 MB/s +11.7% +0.6% (ffc leads)
mesh.txt 1048 MB/s 1081 MB/s +3.1% −11%

ARM — Graviton4 (AWS m8g.metal-24xl)

Dataset Before After Δ vs fastfloat
random [0,1] 1616 MB/s 1566 MB/s −3.1% (noise) +43%
canada.txt 1216 MB/s 1331 MB/s +9.5% +45%
mesh.txt 956 MB/s 1002 MB/s +4.8% +101%

random [0,1] numbers have 14–17 fractional digits — the 8-digit loop
already fires for them, so the 4-digit follow-up is never taken. The
slight dip is within run-to-run noise (the check costs one branch).

canada.txt and mesh.txt both reflect real-world inputs where 7
fractional digits dominate. The improvement is consistent across
architectures and stable across runs (< 0.5% spread).

Methodology

This change was identified and validated using
ffc-agent-workspace,
a structured optimization workspace for ffc.h inspired by
AutoKernel (Jaber & Jaber, 2026).

The workflow:

  1. Profile firstperf record on the benchmark binary identified
    ffc_loop_parse_if_eight_digits as a hot symbol on canada/mesh inputs.
  2. Hypothesis before code — the dead-SWAR-path root cause was stated
    as a falsifiable claim before any editing.
  3. Correctness gating — unit tests + fastfloat reference corpus +
    exhaustive tests all pass before benchmarking.
  4. Two-step validation — benchmark result alone is not enough; the
    profile must also show the target symbol's CPU % decreasing or IPC
    increasing.
  5. Same-machine baselines — baseline and post-patch numbers captured
    on the same metal VM in the same session to eliminate environment noise.
  6. All experiments logged — including the reasoning, both baseline and
    post numbers, and the decision rationale, in
    experiments/EXPERIMENTS.md.

Full experiment log: EXP-001 in ffc-agent-workspace

Also fix double-read of ffc_read8_to_u64(*p) in the 8-digit loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kolemannix kolemannix merged commit b1894aa into kolemannix:main May 27, 2026
6 checks passed
fcostaoliveira added a commit to redis-performance/ffc.h that referenced this pull request Jun 3, 2026
…/force-inline-ffc-impl

Resolves the ffc_loop_parse_if_eight_digits conflict by keeping both changes:
- our Clang/AArch64 manual 2x (16-digit) unroll of the SWAR loop, and
- upstream's new 4-digit follow-up block for sub-8-digit remainders.
The follow-up sits after the #if/#else digit loop, so it benefits both the
Clang/AArch64 unrolled path and the GCC/portable while-loop path. ffc.h
regenerated; unit + supplemental tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fcostaoliveira added a commit to redis-performance/ffc-agent-workspace that referenced this pull request Jun 9, 2026
kolemannix/ffc.h#23 (4-digit SWAR follow-up) merged 2026-05-27 is OUR PR but was
omitted (README listed ffc as "lands directly, no PRs"). Add a kolemannix/ffc.h
row (1 merged, 3 open), add fast_float #387, fix total 4 -> 6. IMPACT.md and the
upstream-prs memory brought in sync with a cross-repo merged ledger.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
nickva added a commit to davisp/jiffy that referenced this pull request Jun 13, 2026
A quick microbench showed 10% speedup on numbers.json.

Upstream PR: kolemannix/ffc.h#23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants