perf: 4-digit SWAR follow-up in ffc_loop_parse_if_eight_digits by fcostaoliveira · Pull Request #23 · kolemannix/ffc.h

fcostaoliveira · 2026-05-26T16:48:18Z

Problem

ffc_loop_parse_if_eight_digits only fires when pend - p >= 8. Numbers
with 5–7 fractional digits — geographic coordinates, mesh data, most
real-world floats — have a 7-digit fraction, so the 8-digit SWAR loop
never triggers for them. All digit scanning fell back to byte-by-byte
iteration. ffc_parse_four_digits_unrolled and
ffc_is_made_of_four_digits_fast were dead code on this common input shape.

A secondary issue: the original loop called ffc_read8_to_u64(*p) twice
per iteration — once to check, once to parse — doing the same 8-byte load
twice.

Fix

Two changes in ffc_loop_parse_if_eight_digits:

Read once: load into a local val, pass to both
ffc_is_made_of_eight_digits_fast and
ffc_parse_eight_digits_unrolled_swar.
4-digit SWAR follow-up: after the 8-digit loop exits, if
pend - p >= 4 and the next 4 bytes are all ASCII digits, consume them
with ffc_parse_four_digits_unrolled. For a 7-digit fraction this
converts 7 byte-by-byte iterations into 1×SWAR-4 + 3 byte-by-byte —
roughly 43% fewer digit-scanning steps on the most common input length.

while (pend - *p >= 8) {
  uint64_t val = ffc_read8_to_u64(*p);
  if (!ffc_is_made_of_eight_digits_fast(val)) { break; }
  *i = (*i * 100000000) + ffc_parse_eight_digits_unrolled_swar(val);
  *p += 8;
}
if (pend - *p >= 4) {
  uint32_t val4 = ffc_read4_to_u32(*p);
  if (ffc_is_made_of_four_digits_fast(val4)) {
    *i = (*i * 10000) + ffc_parse_four_digits_unrolled(val4);
    *p += 4;
  }
}

Benchmark

Results measured on dedicated metal VMs using
simple_fastfloat_benchmark,
3-run averages. Baseline = same machine, same binary, pre-patch.

x86 — Intel Xeon Platinum 8488C (AWS m7i.metal-24xl)

Dataset	Before	After	Δ	vs fastfloat
random [0,1]	1772 MB/s	1745 MB/s	−1.5% (noise)	−15%
canada.txt	1299 MB/s	1451 MB/s	+11.7%	+0.6% (ffc leads)
mesh.txt	1048 MB/s	1081 MB/s	+3.1%	−11%

ARM — Graviton4 (AWS m8g.metal-24xl)

Dataset	Before	After	Δ	vs fastfloat
random [0,1]	1616 MB/s	1566 MB/s	−3.1% (noise)	+43%
canada.txt	1216 MB/s	1331 MB/s	+9.5%	+45%
mesh.txt	956 MB/s	1002 MB/s	+4.8%	+101%

random [0,1] numbers have 14–17 fractional digits — the 8-digit loop
already fires for them, so the 4-digit follow-up is never taken. The
slight dip is within run-to-run noise (the check costs one branch).

canada.txt and mesh.txt both reflect real-world inputs where 7
fractional digits dominate. The improvement is consistent across
architectures and stable across runs (< 0.5% spread).

Methodology

This change was identified and validated using
ffc-agent-workspace,
a structured optimization workspace for ffc.h inspired by
AutoKernel (Jaber & Jaber, 2026).

The workflow:

Profile first — perf record on the benchmark binary identified
ffc_loop_parse_if_eight_digits as a hot symbol on canada/mesh inputs.
Hypothesis before code — the dead-SWAR-path root cause was stated
as a falsifiable claim before any editing.
Correctness gating — unit tests + fastfloat reference corpus +
exhaustive tests all pass before benchmarking.
Two-step validation — benchmark result alone is not enough; the
profile must also show the target symbol's CPU % decreasing or IPC
increasing.
Same-machine baselines — baseline and post-patch numbers captured
on the same metal VM in the same session to eliminate environment noise.
All experiments logged — including the reasoning, both baseline and
post numbers, and the decision rationale, in
experiments/EXPERIMENTS.md.

Full experiment log: EXP-001 in ffc-agent-workspace

Also fix double-read of ffc_read8_to_u64(*p) in the 8-digit loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…/force-inline-ffc-impl Resolves the ffc_loop_parse_if_eight_digits conflict by keeping both changes: - our Clang/AArch64 manual 2x (16-digit) unroll of the SWAR loop, and - upstream's new 4-digit follow-up block for sub-8-digit remainders. The follow-up sits after the #if/#else digit loop, so it benefits both the Clang/AArch64 unrolled path and the GCC/portable while-loop path. ffc.h regenerated; unit + supplemental tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

kolemannix/ffc.h#23 (4-digit SWAR follow-up) merged 2026-05-27 is OUR PR but was omitted (README listed ffc as "lands directly, no PRs"). Add a kolemannix/ffc.h row (1 merged, 3 open), add fast_float #387, fix total 4 -> 6. IMPACT.md and the upstream-prs memory brought in sync with a cross-repo merged ledger. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A quick microbench showed 10% speedup on numbers.json. Upstream PR: kolemannix/ffc.h#23

perf: 4-digit SWAR follow-up in ffc_loop_parse_if_eight_digits

19fe522

Also fix double-read of ffc_read8_to_u64(*p) in the 8-digit loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kolemannix merged commit b1894aa into kolemannix:main May 27, 2026
6 checks passed

fcostaoliveira mentioned this pull request Jun 2, 2026

Use ffc (pure-C99) as the RESP3 double parser instead of strtod redis/hiredis#1328

Merged

nickva mentioned this pull request Jun 13, 2026

Update ffc.h : faster numbers parsing davisp/jiffy#299

Merged

nickva added a commit to davisp/jiffy that referenced this pull request Jun 13, 2026

Update ffc.h : faster numbers parsing

e2411f5

A quick microbench showed 10% speedup on numbers.json. Upstream PR: kolemannix/ffc.h#23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: 4-digit SWAR follow-up in ffc_loop_parse_if_eight_digits#23

perf: 4-digit SWAR follow-up in ffc_loop_parse_if_eight_digits#23
kolemannix merged 1 commit into
kolemannix:mainfrom
redis-performance:perf/4digit-swar-followup

fcostaoliveira commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fcostaoliveira commented May 26, 2026

Problem

Fix

Benchmark

x86 — Intel Xeon Platinum 8488C (AWS m7i.metal-24xl)

ARM — Graviton4 (AWS m8g.metal-24xl)

Methodology

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants