Skip to content

Use ffc (pure-C99) as the RESP3 double parser instead of strtod#1328

Merged
michael-grunder merged 3 commits into
redis:masterfrom
fcostaoliveira:ffc-double-parser
Jun 2, 2026
Merged

Use ffc (pure-C99) as the RESP3 double parser instead of strtod#1328
michael-grunder merged 3 commits into
redis:masterfrom
fcostaoliveira:ffc-double-parser

Conversation

@fcostaoliveira

@fcostaoliveira fcostaoliveira commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

RESP3 double replies (the , type) are parsed in read.c with strtod(). This PR replaces that with ffc, a pure-C99 single-header correctly-rounded float parser (a C port of Daniel Lemire's fast_float), vendored as ffc.h. strtod remains available as a fallback via -DHIREDIS_FLOAT_STRTOD.

Two reasons, in order of importance:

1. Locale correctness (a latent bug)

strtod() honours the process LC_NUMERIC locale, but RESP3 doubles are always .-separated. A hiredis client embedded in a process that has called e.g. setlocale(LC_NUMERIC, "de_DE.UTF-8") (decimal comma) misparses the valid reply ,3.14\r\n: strtod("3.14") returns 3.0, and the existing eptr != &buf[len] check then rejects it as a "Bad double value" PROTOCOL error. So under a comma-locale, hiredis errors on essentially every double reply today.

ffc takes the decimal point as an explicit option (default '.'), so it is locale-independent by construction.

2. Speed

ffc is several times faster than glibc strtod. Parse-only microbenchmark (best of 200 iterations, pinned core). Both paths copy + NUL-terminate identically (see note below), so this is a pure parser comparison:

Platform dataset strtod (hiredis) ffc speedup
x86 (Intel) random [0,1] 108 MB/s 468 MB/s +335%
x86 (Intel) canada (geo coords) 100 MB/s 407 MB/s +306%
x86 (Intel) mesh (short floats) 85 MB/s 378 MB/s +344%
ARM (Graviton4) random [0,1] 194 MB/s 992 MB/s +411%
ARM (Graviton4) canada 170 MB/s 849 MB/s +400%
ARM (Graviton4) mesh 166 MB/s 584 MB/s +251%

Double-heavy reply streams (TS.RANGE/TS.MRANGE, FT.SEARCH ... WITHSCORES, vector-search distances, ZRANGE ... WITHSCORES, GEODIST) are parse-bound on strtod.

Correctness

RESP3 semantics are preserved exactly:

  • The strict inf/-inf/nan/-nan tokens are matched as before (strcasecmp on the copied buffer).
  • ffc runs with NO_INFNAN, so the numeric path only ever yields finite values.
  • The full-consume (res.ptr == buf+len) + isfinite checks mirror the strtod path's eptr/isfinite checks.
  • ffc is in fact stricter than strtod: it rejects the leading-whitespace and C99 hex-float inputs (" 3.14", "0x1p4") that strtod silently accepts -- neither is valid RESP3.

Testing:

  • Existing double reader tests pass (incl. the embedded-NUL invalid case and the array case).
  • Added in-tree tests: bit-exact edge magnitudes (0, -0, 5e-324 subnormal, DBL_MIN, DBL_MAX, ...) and malformed rejections -- all green under both the ffc default and the -DHIREDIS_FLOAT_STRTOD fallback.
  • An out-of-tree sweep of 3,000,000 values (random bit patterns / [0,1) / mixed magnitudes / short floats, multiple printf formats) is bit-identical to strtod with zero accept/reject disagreements.

ffc.h compiles clean under hiredis's existing -std=c99 -pedantic -Werror -Wall -Wextra -Wstrict-prototypes -Wwrite-strings (a clang -Wstatic-in-inline push/ignore/pop is scoped around the vendored header; see the commit).

Notes

  • The buffer copy is retained. read.c still copies the value into a local buffer and NUL-terminates it before handing it to the createDouble callback, exactly as today -- some custom createDouble implementations may rely on str being NUL-terminated. ffc parses that buffer's [start, end) range (it needs no terminator). A copy-free variant (parsing the reader buffer directly) is ~40% faster again on the ffc path, but changes the createDouble string to a counted, non-terminated buffer; happy to add that separately if you're open to that contract.
  • ffc.h is a single vendored header (no build-system change; only read.c includes it). It is tri-licensed Apache-2.0 / MIT / Boost-1.0 and is included here under MIT (compatible with hiredis's BSD-3), marked with an SPDX header.
  • Default build now uses ffc; define HIREDIS_FLOAT_STRTOD to revert to strtod. Happy to keep it opt-in instead if you'd prefer to land the capability before flipping the default.

Note

Medium Risk
Changes core protocol parsing for all RESP3 double replies; behavior is intended to match strtod but relies on a large new vendored parser, with only test coverage and an opt-out macro for rollback.

Overview
RESP3 double parsing in read.c no longer uses strtod() by default. The reader now vendors ffc.h (MIT, C port of fast_float) and parses finite numeric payloads with ffc_from_chars_double_options using a fixed '.' decimal separator and NO_INFNAN, while still handling inf / -inf / nan / -nan via the existing strcasecmp branches. Validation stays aligned with the old path: the parser must consume the whole token and the result must be finite.

HIREDIS_FLOAT_STRTOD restores the previous strtod() implementation for builds that need it. Clang gets a scoped -Wstatic-in-inline suppression around the single-TU FFC_IMPL include.

Tests in test.c add bit-exact checks for edge magnitudes (including -0) and protocol errors for malformed doubles.

Reviewed by Cursor Bugbot for commit f7b670d. Bugbot is set up for automated code reviews on this repo. Configure here.

RESP3 double replies (the ',' type) are parsed in read.c with strtod(). This
replaces that with ffc, a pure-C99 single-header correctly-rounded float parser
(a C port of Daniel Lemire's fast_float), vendored as ffc.h. strtod remains
available as a fallback via -DHIREDIS_FLOAT_STRTOD.

Why:

1. Speed. ffc is several times faster than glibc strtod. Parse-only microbench
   (best of 200, pinned core):
     x86 Intel : strtod 84-108 MB/s  -> ffc 735-799 MB/s  (~7-9x)
     ARM (Graviton4): strtod 164-195 -> ffc 1665-1832     (~9-10x)
   glibc strtod is the bottleneck; eliminating the per-reply NUL-terminated copy
   is only ~4-10% of it. Double-heavy replies (TS.RANGE, FT.SEARCH WITHSCORES,
   vector scores, ZRANGE WITHSCORES, GEODIST) are parse-bound on strtod today.

2. Locale correctness. strtod honours the process LC_NUMERIC locale, but RESP3
   doubles are always '.'-separated. A hiredis client embedded in a process under
   e.g. setlocale(LC_NUMERIC, "de_DE.UTF-8") (decimal comma) misparses the valid
   reply ',3.14' as 3.0 -- which the existing eptr==end check then rejects as a
   "Bad double value" PROTOCOL error. ffc takes the decimal point as an explicit
   option ('.'), so it is locale-independent. This is a latent bug fix.

3. No per-reply copy. ffc parses the reader buffer range [p, p+len) directly;
   the dedicated 326-byte buffer + memcpy + NUL-terminate are gone. createDouble
   copies the textual form from p as before.

RESP3 semantics are preserved exactly: the strict inf/-inf/nan/-nan tokens are
matched in place (case-insensitive, no copy), ffc runs with NO_INFNAN so the
numeric path only yields finite values, and the full-consume + isfinite checks
mirror the strtod path's eptr/isfinite checks. ffc is in fact stricter than
strtod (it rejects the leading-whitespace and C99 hex-float inputs strtod
silently accepts, neither of which is valid RESP3).

Correctness: existing double reader tests pass; added bit-exact edge-magnitude
tests (0, -0, 5e-324 subnormal, DBL_MIN, DBL_MAX, ...) and malformed-rejection
tests, all passing under both the ffc default and the strtod fallback. A
3,000,000-value sweep (random bit patterns / [0,1) / mixed magnitudes / short
floats, multiple formats) is bit-identical to strtod with zero accept/reject
disagreements.

ffc.h is tri-licensed Apache-2.0 / MIT / Boost-1.0; it is included here under MIT
(compatible with hiredis's BSD-3), marked with an SPDX header.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread read.c Outdated
Comment thread ffc.h Outdated
- macOS/Apple-Clang: ffc.h's extern-inline entry points call file-local static
  helpers, which trips -Wstatic-in-inline (default-on in Clang) and fails under
  hiredis's -Werror. Scope a clang diagnostic push/ignore/pop around the vendored
  ffc.h include (GCC has no such warning). Fixes the failing macOS CI job.

- createDouble contract: keep the buf copy + NUL-terminate and hand createDouble
  the terminated buffer (as before), rather than the raw reader pointer. Some
  custom createDouble callbacks may rely on str being NUL-terminated; ffc itself
  parses a [start,end) range and needs no terminator. (Addresses review note.)

- ffc.h: fix a vendored-amalgam bug in ffc_negative_digit_comp where
  ffc_am_to_float() was called with a hardcoded DOUBLE value-kind instead of the
  caller's vk, corrupting the float digit-comparison fallback (8-byte write /
  4-byte read of the value union). Not reachable from hiredis (doubles only);
  fixed upstream in ffc and re-vendored. (Addresses review note.)

Faithful parse microbench, both paths now copy+NUL-terminate identically
(best of 200, pinned core): ffc vs strtod about +300-410% (x86 and ARM).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit a8ca884. Configure here.

Comment thread test.c Outdated
The -0 case used 'dval == -0.0', but -0.0 == 0.0 in IEEE, so it would pass even
if the parser returned +0.0. Also compare signbit() so '-0' must parse to a true
-0.0. (Addresses review note on PR redis#1328.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@michael-grunder michael-grunder left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I didn't attempt to review ffc.h but I know it's been extensively tested.

Great performance win for ZSET heavy workloads!

@fcostaoliveira

Copy link
Copy Markdown
Contributor Author

@kolemannix

Copy link
Copy Markdown

A copy-free variant (parsing the reader buffer directly) is ~40% faster again on the ffc path, but changes the createDouble string to a counted, non-terminated buffer; happy to add that separately if you're open to that contract.

We should certainly try! A huge benefit of this switch is that the parser doesn't require null-termination, we've been able to remove copies from a number of places we've integrated ffc.h, usually the copy is entirely scoped to and only passed to strtod, so there's no blast radius at all.

In curious what's going on in hiredis that the NT string form of the float is being handed to other code, does it support custom parsing routines?

@michael-grunder

Copy link
Copy Markdown
Collaborator

I'm curious what's going on in hiredis that the NT string form of the float is being handed to other code, does it support custom parsing routines?

Yep, a lot of Redis clients are just thin wrappers around hiredis. They set their own callbacks to create domain specific values (e.g. hiredis-rb, etc).

@michael-grunder michael-grunder merged commit 86f5cc0 into redis:master Jun 2, 2026
17 checks passed
@kolemannix

Copy link
Copy Markdown

Thanks, interesting; could you avoid the extra buffer copy in the case that there are no callbacks registered?

@michael-grunder

Copy link
Copy Markdown
Collaborator

Hiredis uses its own callbacks as kind of the "default callbacks" (createDoubleObject in this case), but in principle i'm sure we could engineer around it to avoid the copy.

fcostaoliveira added a commit to redis-performance/ffc.h that referenced this pull request Jun 3, 2026
ffc_negative_digit_comp called ffc_am_to_float(false, am_b, &b,
FFC_VALUE_KIND_DOUBLE) but read the result back with ffc_to_extended_halfway(b,
vk) on the next line. For vk == FLOAT this writes a double (8 bytes) into the
ffc_value union and reads back the float member (4 bytes), producing a garbage
'theor' in the float negative-exponent digit-comparison (slow) path.

Pass the caller's vk consistently, matching the adjacent ffc_to_extended_halfway
call and upstream fast_float (which templates the float type throughout).

Safe by construction: when vk == DOUBLE this is byte-identical to before (vk IS
FFC_VALUE_KIND_DOUBLE), so the double path and the double-parsing fast/slow
paths are unchanged; only vk == FLOAT changes, and only toward correctness.
Validated: unit + 4M-value supplemental tests pass; a 20M-value ffc-vs-strtof
parity sweep over high-digit-count (digit_comp-forcing) inputs is bit-identical.
Found via review of redis/hiredis#1328 (Cursor Bugbot).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fcostaoliveira added a commit to redis-performance/ffc-agent-workspace that referenced this pull request Jun 9, 2026
Default-on ffc + MIT vendoring; submodule -> ffc-double-parser @ f19c4b5
(squashed, pushed to fcostaoliveira/hiredis). PR redis/hiredis#1328 opened with
bench + correctness evidence. Archive PR body; update tracker + memory.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
kolemannix pushed a commit to kolemannix/ffc.h that referenced this pull request Jun 30, 2026
…7% ARM random vs baseline (#24)

Generated by an agent, run by of filipe@redis.com using Claude
https://github.com/redis-performance/ffc-agent-workspace
Reviewed by filipe and kolemannix


-----------

* EXP-001: 4-digit SWAR follow-up in ffc_loop_parse_if_eight_digits

Numbers with 5–7 digit fractions (canada.txt, mesh.txt) never triggered
the 8-digit SWAR loop, falling back to 7 byte-by-byte iterations. A
4-digit SWAR follow-up converts those to 1×SWAR-4 + ≤3 byte-by-byte.

canada: +29% (gap vs fastfloat: -29% → -4%)
mesh:   +18% (gap vs fastfloat: -10% → ±0%)
random: no regression (high variance, within noise)

* EXP-006: use local vars in too_many_digits path, allow GCC DSE of struct stores

In ffc_parse_number_string, int_part_start/len and fraction_part_start/len
were stored to the ffc_parsed answer struct unconditionally. These fields were
then read back in the rare too_many_digits path — preventing GCC's DSE from
eliminating the stores in non-JSON callers where ISRA removes those fields
from the return ABI.

Restructure: hoist 'before' and 'frac_end_local' out of the has_decimal_point
block. In the too_many_digits path, use start_digits, end_of_integer_part,
before, and frac_end_local directly instead of round-tripping through the
answer struct. The stores to answer.int_part_*/fraction_part_* remain (needed
by ffc_parse_json_number callers), but are no longer read within the function,
so GCC ISRA + DSE eliminates them on non-JSON call sites.

Result: x86 canada +2.2% (1414 → 1445 MB/s, 5-run stable). ARM unchanged
(GCC's ARM backend already handled this via ISRA).

* perf: FFC_IMPL_INLINE — force inline ffc_from_chars_double at call sites

Add conditional `__attribute__((always_inline)) inline` to the forward
declarations of `ffc_from_chars_double` and `ffc_from_chars_double_options`
via a new `FFC_IMPL_INLINE` macro.

When a translation unit defines `FFC_IMPL` before including `ffc.h` (the
documented pattern for the implementation TU), GCC's `ipa_early_inline`
pass sees the `always_inline` attribute on those declarations and inlines
both functions at every call site *before* IPA-CP runs. This prevents
GCC from creating a separate out-of-line constprop clone and eliminates
the per-call function-call overhead on every `ffc_from_chars_double` call.

The macro is defined as empty in non-FFC_IMPL TUs, so external linkage
and ABI compatibility for other users of the header are preserved (the
always_inline path does not emit an out-of-line symbol, but the existing
`ffc_parse_double` / `ffc_parse_float` wrappers remain as exported symbols).

* docs: update benchmark results with 2026-05-26 metal VM numbers

Replace Apple Silicon numbers (2026-03-03) with 5-run averages from
dedicated bare-metal servers: Intel Xeon Platinum 8488C (m7i.metal-24xl)
and Graviton4 (m8g.metal-24xl), both GCC -O3.

Post-EXP-009 results: ffc leads or matches fastfloat on all datasets
on both architectures.

* perf: combined exponent range check in ffc_clinger_fast_path_impl (EXP-012)

Replace two signed comparisons (MIN <= e && e <= MAX) with a single unsigned
range check: (uint64_t)(e - MIN) <= (MAX - MIN). This collapses the two-branch
scattered layout into one branch with compact sequential code, matching
fast_float's approach.

ARM Graviton4: +4.6% mesh, +1.5% random, +1.0% canada.

* Unroll fraction byte-by-byte tail: 3 nested ifs replace while loop

After ffc_loop_parse_if_eight_digits returns, at most 3 consecutive digit bytes
remain. Replacing the while loop with straight-line nested ifs eliminates the
back-branch, yielding better IPC on out-of-order cores (+3.9% mesh, +2.1% canada
on ARM Graviton4).

* perf: straight-line integer scan — nested-ifs replace while loop for 1–4 digits

Eliminates back-branches for the common 1–4 digit integer case.
Falls back to while loop for 5+ digits.

ARM Graviton4: random +4.2% (1823→1900 MB/s), canada +7.4% (1562→1677 MB/s),
mesh +2.0% (1366→1394 MB/s). EXP-026.

* perf: extend integer nested-ifs to 5 levels (EXP-028)

Extends the straight-line integer scan from 4 levels to 5 levels,
eliminating the while-loop back-branch for 5-digit integer parts
common in mesh.txt 3D vertex coordinates (e.g. "12345.678").

Results on ARM Graviton4 (m8g.metal-24xl, 2.80 GHz):
- mesh:   +7.9% (1394→1505 MB/s, 14.75→13.66 c/f)
- canada: +0.7% (1677→1688 MB/s, 29.02→28.84 c/f)
- random: +0.4% (1900→1908 MB/s, 30.87→30.75 c/f)

* perf: FFC_ROUNDS_TO_NEAREST macro eliminates FCMP chain (EXP-030)

Add compile-time macro FFC_ROUNDS_TO_NEAREST that makes
ffc_rounds_to_nearest() return a constant true, eliminating the 7-instruction
volatile-float FCMP chain entirely. Guard the rounding-mode test that
asserts ffc_rounds_to_nearest() == false under non-nearest modes.

Benchmark results on ARM Graviton4 (Neoverse V2):
  mesh.txt  +2.4% (1505 → 1541 MB/s, 100.86 → 93.86 i/f)
  canada    +1.8% (1688 → 1718 MB/s, 196.02 → 189.93 i/f)
  random    +0.8% (1908 → 1924 MB/s, 227.04 → 221.04 i/f)

* EXP-033: early exit for exponent==0 in ffc_from_chars_advanced

Adds a fast return path before the Clinger call for pure integers
(exponent==0, mantissa<=2^53). Skips the range check, mantissa check,
pow10 table load, and fmul with 1.0. Safe after EXP-030 eliminated
the volatile FCMP chain.

+12.9% mesh (83.92 i/f from 93.86), +1.1% canada, +0.4% random.
All unit and supplemental tests pass.

* fix: guard zero mantissa under FE_DOWNWARD in exponent==0 fast path

Clang may convert (double)(uint64_t)0 to -0.0 when fegetround() ==
FE_DOWNWARD. The fast path for exponent==0 did not have the same guard
that already existed in the non-nearest Clinger branch. Mirror the
existing #if __clang__ || FFC_32BIT guard.

Also revert readme.md: benchmark table update was not appropriate for
the upstream repository.

* EXP-039/042/044: checkpoint accepted shift-add acc10 + 2x SWAR unroll

These accepted experiments (FFC_DIGIT_ACC10 shift-add asm for Clang/AArch64,
acc10 on the exponent accumulator, and the 2x unroll of
ffc_loop_parse_if_eight_digits) were applied to the working tree but never
committed to the submodule. Checkpoint them so the submodule tip reflects the
best accepted state and revert-on-reject is safe during the race.

* fix: use caller's vk (not hardcoded DOUBLE) in ffc_negative_digit_comp

ffc_negative_digit_comp called ffc_am_to_float(false, am_b, &b,
FFC_VALUE_KIND_DOUBLE) but read the result back with ffc_to_extended_halfway(b,
vk) on the next line. For vk == FLOAT this writes a double (8 bytes) into the
ffc_value union and reads back the float member (4 bytes), producing a garbage
'theor' in the float negative-exponent digit-comparison (slow) path.

Pass the caller's vk consistently, matching the adjacent ffc_to_extended_halfway
call and upstream fast_float (which templates the float type throughout).

Safe by construction: when vk == DOUBLE this is byte-identical to before (vk IS
FFC_VALUE_KIND_DOUBLE), so the double path and the double-parsing fast/slow
paths are unchanged; only vk == FLOAT changes, and only toward correctness.
Validated: unit + 4M-value supplemental tests pass; a 20M-value ffc-vs-strtof
parity sweep over high-digit-count (digit_comp-forcing) inputs is bit-identical.
Found via review of redis/hiredis#1328 (Cursor Bugbot).

* review: address PR feedback — dedup always_inline macro, docs, comment cleanups

- Hoist the ffc_inline three-way ladder into api.h above FFC_IMPL_INLINE and
  collapse FFC_IMPL_INLINE to `#define FFC_IMPL_INLINE ffc_inline`; guard the
  duplicate definition in common.h with `#ifndef ffc_inline`. The two macros
  were byte-identical.
- readme: add a Configuration Macros section documenting FFC_IMPL and
  FFC_ROUNDS_TO_NEAREST (with the don't-define-if-you-change-rounding caveat).
- common.h: apply the suggested wording for the ffc_rounds_to_nearest() comment.
- parse.h: rewrite the integer-scan, fraction-unroll, hoisted-locals and
  too_many_digits comments to describe the current code rather than the diff
  that introduced them; tighten the AArch64/Clang acc10 inline-asm comment so
  the single-axis gating is self-explanatory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants