Use ffc (pure-C99) as the RESP3 double parser instead of strtod#1328
Conversation
RESP3 double replies (the ',' type) are parsed in read.c with strtod(). This
replaces that with ffc, a pure-C99 single-header correctly-rounded float parser
(a C port of Daniel Lemire's fast_float), vendored as ffc.h. strtod remains
available as a fallback via -DHIREDIS_FLOAT_STRTOD.
Why:
1. Speed. ffc is several times faster than glibc strtod. Parse-only microbench
(best of 200, pinned core):
x86 Intel : strtod 84-108 MB/s -> ffc 735-799 MB/s (~7-9x)
ARM (Graviton4): strtod 164-195 -> ffc 1665-1832 (~9-10x)
glibc strtod is the bottleneck; eliminating the per-reply NUL-terminated copy
is only ~4-10% of it. Double-heavy replies (TS.RANGE, FT.SEARCH WITHSCORES,
vector scores, ZRANGE WITHSCORES, GEODIST) are parse-bound on strtod today.
2. Locale correctness. strtod honours the process LC_NUMERIC locale, but RESP3
doubles are always '.'-separated. A hiredis client embedded in a process under
e.g. setlocale(LC_NUMERIC, "de_DE.UTF-8") (decimal comma) misparses the valid
reply ',3.14' as 3.0 -- which the existing eptr==end check then rejects as a
"Bad double value" PROTOCOL error. ffc takes the decimal point as an explicit
option ('.'), so it is locale-independent. This is a latent bug fix.
3. No per-reply copy. ffc parses the reader buffer range [p, p+len) directly;
the dedicated 326-byte buffer + memcpy + NUL-terminate are gone. createDouble
copies the textual form from p as before.
RESP3 semantics are preserved exactly: the strict inf/-inf/nan/-nan tokens are
matched in place (case-insensitive, no copy), ffc runs with NO_INFNAN so the
numeric path only yields finite values, and the full-consume + isfinite checks
mirror the strtod path's eptr/isfinite checks. ffc is in fact stricter than
strtod (it rejects the leading-whitespace and C99 hex-float inputs strtod
silently accepts, neither of which is valid RESP3).
Correctness: existing double reader tests pass; added bit-exact edge-magnitude
tests (0, -0, 5e-324 subnormal, DBL_MIN, DBL_MAX, ...) and malformed-rejection
tests, all passing under both the ffc default and the strtod fallback. A
3,000,000-value sweep (random bit patterns / [0,1) / mixed magnitudes / short
floats, multiple formats) is bit-identical to strtod with zero accept/reject
disagreements.
ffc.h is tri-licensed Apache-2.0 / MIT / Boost-1.0; it is included here under MIT
(compatible with hiredis's BSD-3), marked with an SPDX header.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- macOS/Apple-Clang: ffc.h's extern-inline entry points call file-local static helpers, which trips -Wstatic-in-inline (default-on in Clang) and fails under hiredis's -Werror. Scope a clang diagnostic push/ignore/pop around the vendored ffc.h include (GCC has no such warning). Fixes the failing macOS CI job. - createDouble contract: keep the buf copy + NUL-terminate and hand createDouble the terminated buffer (as before), rather than the raw reader pointer. Some custom createDouble callbacks may rely on str being NUL-terminated; ffc itself parses a [start,end) range and needs no terminator. (Addresses review note.) - ffc.h: fix a vendored-amalgam bug in ffc_negative_digit_comp where ffc_am_to_float() was called with a hardcoded DOUBLE value-kind instead of the caller's vk, corrupting the float digit-comparison fallback (8-byte write / 4-byte read of the value union). Not reachable from hiredis (doubles only); fixed upstream in ffc and re-vendored. (Addresses review note.) Faithful parse microbench, both paths now copy+NUL-terminate identically (best of 200, pinned core): ffc vs strtod about +300-410% (x86 and ARM). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit a8ca884. Configure here.
The -0 case used 'dval == -0.0', but -0.0 == 0.0 in IEEE, so it would pass even if the parser returned +0.0. Also compare signbit() so '-0' must parse to a true -0.0. (Addresses review note on PR redis#1328.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
michael-grunder
left a comment
There was a problem hiding this comment.
LGTM
I didn't attempt to review ffc.h but I know it's been extensively tested.
Great performance win for ZSET heavy workloads!
nice. Notice we're also working on helping improve ffc and fast_float :D so expect a a further boost.
and |
We should certainly try! A huge benefit of this switch is that the parser doesn't require null-termination, we've been able to remove copies from a number of places we've integrated ffc.h, usually the copy is entirely scoped to and only passed to strtod, so there's no blast radius at all. In curious what's going on in hiredis that the NT string form of the float is being handed to other code, does it support custom parsing routines? |
Yep, a lot of Redis clients are just thin wrappers around hiredis. They set their own callbacks to create domain specific values (e.g. |
|
Thanks, interesting; could you avoid the extra buffer copy in the case that there are no callbacks registered? |
|
Hiredis uses its own callbacks as kind of the "default callbacks" ( |
ffc_negative_digit_comp called ffc_am_to_float(false, am_b, &b, FFC_VALUE_KIND_DOUBLE) but read the result back with ffc_to_extended_halfway(b, vk) on the next line. For vk == FLOAT this writes a double (8 bytes) into the ffc_value union and reads back the float member (4 bytes), producing a garbage 'theor' in the float negative-exponent digit-comparison (slow) path. Pass the caller's vk consistently, matching the adjacent ffc_to_extended_halfway call and upstream fast_float (which templates the float type throughout). Safe by construction: when vk == DOUBLE this is byte-identical to before (vk IS FFC_VALUE_KIND_DOUBLE), so the double path and the double-parsing fast/slow paths are unchanged; only vk == FLOAT changes, and only toward correctness. Validated: unit + 4M-value supplemental tests pass; a 20M-value ffc-vs-strtof parity sweep over high-digit-count (digit_comp-forcing) inputs is bit-identical. Found via review of redis/hiredis#1328 (Cursor Bugbot). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Default-on ffc + MIT vendoring; submodule -> ffc-double-parser @ f19c4b5 (squashed, pushed to fcostaoliveira/hiredis). PR redis/hiredis#1328 opened with bench + correctness evidence. Archive PR body; update tracker + memory. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…7% ARM random vs baseline (#24) Generated by an agent, run by of filipe@redis.com using Claude https://github.com/redis-performance/ffc-agent-workspace Reviewed by filipe and kolemannix ----------- * EXP-001: 4-digit SWAR follow-up in ffc_loop_parse_if_eight_digits Numbers with 5–7 digit fractions (canada.txt, mesh.txt) never triggered the 8-digit SWAR loop, falling back to 7 byte-by-byte iterations. A 4-digit SWAR follow-up converts those to 1×SWAR-4 + ≤3 byte-by-byte. canada: +29% (gap vs fastfloat: -29% → -4%) mesh: +18% (gap vs fastfloat: -10% → ±0%) random: no regression (high variance, within noise) * EXP-006: use local vars in too_many_digits path, allow GCC DSE of struct stores In ffc_parse_number_string, int_part_start/len and fraction_part_start/len were stored to the ffc_parsed answer struct unconditionally. These fields were then read back in the rare too_many_digits path — preventing GCC's DSE from eliminating the stores in non-JSON callers where ISRA removes those fields from the return ABI. Restructure: hoist 'before' and 'frac_end_local' out of the has_decimal_point block. In the too_many_digits path, use start_digits, end_of_integer_part, before, and frac_end_local directly instead of round-tripping through the answer struct. The stores to answer.int_part_*/fraction_part_* remain (needed by ffc_parse_json_number callers), but are no longer read within the function, so GCC ISRA + DSE eliminates them on non-JSON call sites. Result: x86 canada +2.2% (1414 → 1445 MB/s, 5-run stable). ARM unchanged (GCC's ARM backend already handled this via ISRA). * perf: FFC_IMPL_INLINE — force inline ffc_from_chars_double at call sites Add conditional `__attribute__((always_inline)) inline` to the forward declarations of `ffc_from_chars_double` and `ffc_from_chars_double_options` via a new `FFC_IMPL_INLINE` macro. When a translation unit defines `FFC_IMPL` before including `ffc.h` (the documented pattern for the implementation TU), GCC's `ipa_early_inline` pass sees the `always_inline` attribute on those declarations and inlines both functions at every call site *before* IPA-CP runs. This prevents GCC from creating a separate out-of-line constprop clone and eliminates the per-call function-call overhead on every `ffc_from_chars_double` call. The macro is defined as empty in non-FFC_IMPL TUs, so external linkage and ABI compatibility for other users of the header are preserved (the always_inline path does not emit an out-of-line symbol, but the existing `ffc_parse_double` / `ffc_parse_float` wrappers remain as exported symbols). * docs: update benchmark results with 2026-05-26 metal VM numbers Replace Apple Silicon numbers (2026-03-03) with 5-run averages from dedicated bare-metal servers: Intel Xeon Platinum 8488C (m7i.metal-24xl) and Graviton4 (m8g.metal-24xl), both GCC -O3. Post-EXP-009 results: ffc leads or matches fastfloat on all datasets on both architectures. * perf: combined exponent range check in ffc_clinger_fast_path_impl (EXP-012) Replace two signed comparisons (MIN <= e && e <= MAX) with a single unsigned range check: (uint64_t)(e - MIN) <= (MAX - MIN). This collapses the two-branch scattered layout into one branch with compact sequential code, matching fast_float's approach. ARM Graviton4: +4.6% mesh, +1.5% random, +1.0% canada. * Unroll fraction byte-by-byte tail: 3 nested ifs replace while loop After ffc_loop_parse_if_eight_digits returns, at most 3 consecutive digit bytes remain. Replacing the while loop with straight-line nested ifs eliminates the back-branch, yielding better IPC on out-of-order cores (+3.9% mesh, +2.1% canada on ARM Graviton4). * perf: straight-line integer scan — nested-ifs replace while loop for 1–4 digits Eliminates back-branches for the common 1–4 digit integer case. Falls back to while loop for 5+ digits. ARM Graviton4: random +4.2% (1823→1900 MB/s), canada +7.4% (1562→1677 MB/s), mesh +2.0% (1366→1394 MB/s). EXP-026. * perf: extend integer nested-ifs to 5 levels (EXP-028) Extends the straight-line integer scan from 4 levels to 5 levels, eliminating the while-loop back-branch for 5-digit integer parts common in mesh.txt 3D vertex coordinates (e.g. "12345.678"). Results on ARM Graviton4 (m8g.metal-24xl, 2.80 GHz): - mesh: +7.9% (1394→1505 MB/s, 14.75→13.66 c/f) - canada: +0.7% (1677→1688 MB/s, 29.02→28.84 c/f) - random: +0.4% (1900→1908 MB/s, 30.87→30.75 c/f) * perf: FFC_ROUNDS_TO_NEAREST macro eliminates FCMP chain (EXP-030) Add compile-time macro FFC_ROUNDS_TO_NEAREST that makes ffc_rounds_to_nearest() return a constant true, eliminating the 7-instruction volatile-float FCMP chain entirely. Guard the rounding-mode test that asserts ffc_rounds_to_nearest() == false under non-nearest modes. Benchmark results on ARM Graviton4 (Neoverse V2): mesh.txt +2.4% (1505 → 1541 MB/s, 100.86 → 93.86 i/f) canada +1.8% (1688 → 1718 MB/s, 196.02 → 189.93 i/f) random +0.8% (1908 → 1924 MB/s, 227.04 → 221.04 i/f) * EXP-033: early exit for exponent==0 in ffc_from_chars_advanced Adds a fast return path before the Clinger call for pure integers (exponent==0, mantissa<=2^53). Skips the range check, mantissa check, pow10 table load, and fmul with 1.0. Safe after EXP-030 eliminated the volatile FCMP chain. +12.9% mesh (83.92 i/f from 93.86), +1.1% canada, +0.4% random. All unit and supplemental tests pass. * fix: guard zero mantissa under FE_DOWNWARD in exponent==0 fast path Clang may convert (double)(uint64_t)0 to -0.0 when fegetround() == FE_DOWNWARD. The fast path for exponent==0 did not have the same guard that already existed in the non-nearest Clinger branch. Mirror the existing #if __clang__ || FFC_32BIT guard. Also revert readme.md: benchmark table update was not appropriate for the upstream repository. * EXP-039/042/044: checkpoint accepted shift-add acc10 + 2x SWAR unroll These accepted experiments (FFC_DIGIT_ACC10 shift-add asm for Clang/AArch64, acc10 on the exponent accumulator, and the 2x unroll of ffc_loop_parse_if_eight_digits) were applied to the working tree but never committed to the submodule. Checkpoint them so the submodule tip reflects the best accepted state and revert-on-reject is safe during the race. * fix: use caller's vk (not hardcoded DOUBLE) in ffc_negative_digit_comp ffc_negative_digit_comp called ffc_am_to_float(false, am_b, &b, FFC_VALUE_KIND_DOUBLE) but read the result back with ffc_to_extended_halfway(b, vk) on the next line. For vk == FLOAT this writes a double (8 bytes) into the ffc_value union and reads back the float member (4 bytes), producing a garbage 'theor' in the float negative-exponent digit-comparison (slow) path. Pass the caller's vk consistently, matching the adjacent ffc_to_extended_halfway call and upstream fast_float (which templates the float type throughout). Safe by construction: when vk == DOUBLE this is byte-identical to before (vk IS FFC_VALUE_KIND_DOUBLE), so the double path and the double-parsing fast/slow paths are unchanged; only vk == FLOAT changes, and only toward correctness. Validated: unit + 4M-value supplemental tests pass; a 20M-value ffc-vs-strtof parity sweep over high-digit-count (digit_comp-forcing) inputs is bit-identical. Found via review of redis/hiredis#1328 (Cursor Bugbot). * review: address PR feedback — dedup always_inline macro, docs, comment cleanups - Hoist the ffc_inline three-way ladder into api.h above FFC_IMPL_INLINE and collapse FFC_IMPL_INLINE to `#define FFC_IMPL_INLINE ffc_inline`; guard the duplicate definition in common.h with `#ifndef ffc_inline`. The two macros were byte-identical. - readme: add a Configuration Macros section documenting FFC_IMPL and FFC_ROUNDS_TO_NEAREST (with the don't-define-if-you-change-rounding caveat). - common.h: apply the suggested wording for the ffc_rounds_to_nearest() comment. - parse.h: rewrite the integer-scan, fraction-unroll, hoisted-locals and too_many_digits comments to describe the current code rather than the diff that introduced them; tighten the AArch64/Clang acc10 inline-asm comment so the single-axis gating is self-explanatory.

Summary
RESP3 double replies (the
,type) are parsed inread.cwithstrtod(). This PR replaces that with ffc, a pure-C99 single-header correctly-rounded float parser (a C port of Daniel Lemire's fast_float), vendored asffc.h.strtodremains available as a fallback via-DHIREDIS_FLOAT_STRTOD.Two reasons, in order of importance:
1. Locale correctness (a latent bug)
strtod()honours the processLC_NUMERIClocale, but RESP3 doubles are always.-separated. A hiredis client embedded in a process that has called e.g.setlocale(LC_NUMERIC, "de_DE.UTF-8")(decimal comma) misparses the valid reply,3.14\r\n:strtod("3.14")returns3.0, and the existingeptr != &buf[len]check then rejects it as a"Bad double value"PROTOCOL error. So under a comma-locale, hiredis errors on essentially every double reply today.ffc takes the decimal point as an explicit option (default
'.'), so it is locale-independent by construction.2. Speed
ffc is several times faster than glibc
strtod. Parse-only microbenchmark (best of 200 iterations, pinned core). Both paths copy + NUL-terminate identically (see note below), so this is a pure parser comparison:[0,1][0,1]Double-heavy reply streams (
TS.RANGE/TS.MRANGE,FT.SEARCH ... WITHSCORES, vector-search distances,ZRANGE ... WITHSCORES,GEODIST) are parse-bound onstrtod.Correctness
RESP3 semantics are preserved exactly:
inf/-inf/nan/-nantokens are matched as before (strcasecmpon the copied buffer).NO_INFNAN, so the numeric path only ever yields finite values.res.ptr == buf+len) +isfinitechecks mirror thestrtodpath'septr/isfinitechecks.strtod: it rejects the leading-whitespace and C99 hex-float inputs (" 3.14","0x1p4") thatstrtodsilently accepts -- neither is valid RESP3.Testing:
0,-0,5e-324subnormal,DBL_MIN,DBL_MAX, ...) and malformed rejections -- all green under both the ffc default and the-DHIREDIS_FLOAT_STRTODfallback.[0,1)/ mixed magnitudes / short floats, multiple printf formats) is bit-identical tostrtodwith zero accept/reject disagreements.ffc.hcompiles clean under hiredis's existing-std=c99 -pedantic -Werror -Wall -Wextra -Wstrict-prototypes -Wwrite-strings(a clang-Wstatic-in-inlinepush/ignore/pop is scoped around the vendored header; see the commit).Notes
read.cstill copies the value into a local buffer and NUL-terminates it before handing it to thecreateDoublecallback, exactly as today -- some customcreateDoubleimplementations may rely onstrbeing NUL-terminated. ffc parses that buffer's[start, end)range (it needs no terminator). A copy-free variant (parsing the reader buffer directly) is ~40% faster again on the ffc path, but changes thecreateDoublestring to a counted, non-terminated buffer; happy to add that separately if you're open to that contract.ffc.his a single vendored header (no build-system change; onlyread.cincludes it). It is tri-licensed Apache-2.0 / MIT / Boost-1.0 and is included here under MIT (compatible with hiredis's BSD-3), marked with an SPDX header.HIREDIS_FLOAT_STRTODto revert tostrtod. Happy to keep it opt-in instead if you'd prefer to land the capability before flipping the default.Note
Medium Risk
Changes core protocol parsing for all RESP3 double replies; behavior is intended to match strtod but relies on a large new vendored parser, with only test coverage and an opt-out macro for rollback.
Overview
RESP3 double parsing in
read.cno longer usesstrtod()by default. The reader now vendorsffc.h(MIT, C port of fast_float) and parses finite numeric payloads withffc_from_chars_double_optionsusing a fixed'.'decimal separator andNO_INFNAN, while still handlinginf/-inf/nan/-nanvia the existingstrcasecmpbranches. Validation stays aligned with the old path: the parser must consume the whole token and the result must be finite.HIREDIS_FLOAT_STRTODrestores the previousstrtod()implementation for builds that need it. Clang gets a scoped-Wstatic-in-inlinesuppression around the single-TUFFC_IMPLinclude.Tests in
test.cadd bit-exact checks for edge magnitudes (including-0) and protocol errors for malformed doubles.Reviewed by Cursor Bugbot for commit f7b670d. Bugbot is set up for automated code reviews on this repo. Configure here.