Skip to content

perf(sdk): add optional foldhash for ValueMap HashMaps in metrics hot path#3388

Open
bryantbiggs wants to merge 1 commit intoopen-telemetry:mainfrom
bryantbiggs:worktree-sorted-only-valuemap
Open

perf(sdk): add optional foldhash for ValueMap HashMaps in metrics hot path#3388
bryantbiggs wants to merge 1 commit intoopen-telemetry:mainfrom
bryantbiggs:worktree-sorted-only-valuemap

Conversation

@bryantbiggs
Copy link
Contributor

@bryantbiggs bryantbiggs commented Feb 24, 2026

Summary

Adds an opt-in metrics-use-foldhash feature flag that replaces the default SipHash-1-3 hasher with foldhash for the HashMap used in ValueMap::trackers — the metrics hot path.

SipHash's HashDoS resistance is unnecessary here since ValueMap is pub(crate) and keys (Vec<KeyValue>) are not attacker-controlled. When the feature is not enabled, the standard library HashMap (SipHash) is used — no mandatory dependency is added.

Why foldhash?

I benchmarked four hashers — std SipHash, ahash, foldhash, and rapidhash — on the actual Vec<KeyValue> key type used by ValueMap, with 1600 time series and 2/4/8 attributes per entry matching real-world metrics cardinality.

Benchmark results (Apple Silicon M4 Max, aarch64-apple-darwin)

2 attributes per entry:

Benchmark std SipHash ahash foldhash rapidhash
hash_only 30.34 µs 14.36 µs 14.24 µs 14.14 µs
lookup_hit 56.67 µs 40.59 µs 37.60 µs 35.84 µs
lookup_miss 2.34 µs 1.25 µs 0.955 µs 1.04 µs
insert 133.36 µs 115.79 µs 111.70 µs 113.77 µs
mixed_rw (ValueMap hot path) 119.29 µs 88.65 µs 86.01 µs 83.68 µs

4 attributes per entry (most common OTel workload):

Benchmark std SipHash ahash foldhash rapidhash
hash_only 58.75 µs 31.70 µs 26.38 µs 30.47 µs
lookup_hit 105.94 µs 83.71 µs 71.30 µs 71.81 µs
lookup_miss 4.14 µs 2.35 µs 1.79 µs 2.15 µs
insert 239.58 µs 209.55 µs 196.81 µs 195.33 µs
mixed_rw (ValueMap hot path) 220.08 µs 173.29 µs 155.30 µs 153.91 µs

8 attributes per entry:

Benchmark std SipHash ahash foldhash rapidhash
hash_only 121.79 µs 76.55 µs 49.74 µs 66.36 µs
lookup_hit 211.04 µs 185.81 µs 138.23 µs 147.51 µs
lookup_miss 7.93 µs 5.01 µs 3.29 µs 4.37 µs
insert 447.65 µs 422.34 µs 368.78 µs 393.28 µs
mixed_rw (ValueMap hot path) 432.21 µs 380.27 µs 300.27 µs 317.48 µs

Summary: mixed_rw (models the ValueMap hot path)

Attributes std SipHash ahash foldhash rapidhash
2 119.29 µs 88.65 µs (-26%) 86.01 µs (-28%) 83.68 µs (-30%)
4 220.08 µs 173.29 µs (-21%) 155.30 µs (-29%) 153.91 µs (-30%)
8 432.21 µs 380.27 µs (-12%) 300.27 µs (-31%) 317.48 µs (-27%)

foldhash was chosen over rapidhash because:

  • Faster raw hashing at 4+ attributes: 13% faster at 4 attrs, 25% faster at 8 attrs
  • Wins decisively at 8 attributes across all benchmarks (5-25% faster than rapidhash)
  • Essentially tied at 2-4 attrs on the mixed_rw hot path (within 1-3%)
  • Better lookup_miss performance across all sizes (17-34% faster)
  • hashbrown's default hasher — foldhash replaced ahash as the default in Rust's HashMap backing (rust-lang/hashbrown#563), meaning it's been heavily vetted for mixed read/write workloads
  • Scales better with key size — the advantage grows as attribute count increases
  • Zero dependencies, pure Rust

Ecosystem context: ahash is no longer best-in-class

The non-cryptographic hashing landscape has shifted significantly:

  • foldhash replaced ahash as the default hasher in hashbrown (rust-lang/hashbrown#563)
  • ahash has known unresolved performance regressions: ~40% on Apple M1 (#194) and 73-151% on AMD Zen with target-cpu=native (#190)
  • ahash maintenance has slowed: no throughput-focused optimization merged in 2024-2025; three VAES PRs (#144, #186, #187) stalled for 2+ years despite the Rust AVX-512 intrinsics blocker being resolved in Rust 1.89

Design decisions

  • Feature-gated (metrics-use-foldhash): Avoids adding a mandatory dependency. Users who want maximum metrics throughput can opt in.
  • Default is std SipHash: Safe, zero-dependency default. The performance difference only matters in high-cardinality metrics workloads.

Changes

  • opentelemetry-sdk/Cargo.toml: Add foldhash as optional dependency, add metrics-use-foldhash feature flag
  • opentelemetry-sdk/src/metrics/internal/mod.rs: Conditionally use foldhash::fast::RandomState when feature is enabled

Test plan

  • cargo check --features metrics (default path, std HashMap)
  • cargo check --features metrics-use-foldhash (foldhash path)
  • CI passes on all platforms

Refs: #3371

@bryantbiggs bryantbiggs requested a review from a team as a code owner February 24, 2026 20:01
@bryantbiggs bryantbiggs force-pushed the worktree-sorted-only-valuemap branch from 159dab5 to f80ee5d Compare February 24, 2026 20:03
autobenches = false

[dependencies]
ahash = "0.8"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to keep dependencies absolute minimum, so if we chose to do it, it must be via a opt-in feature flag, so users can knowingly opt-in to this.

Copy link
Contributor Author

@bryantbiggs bryantbiggs Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfectly reasonable! if we do go forward, any thoughts on feature name? alt_hasher? ahash? (naming is hard 😅 )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tagging @utpilla for thoughts as he might have explored this before.

Any prior art in similar crates to steal feature names from ? metrics-use-ahash (to indicate this is specifically for metrics)

@cijothomas
Copy link
Member

ahash is already a transitive dependency via indexmap

We don't use indexmap now. We used to, but removed long ago.

unnecessary here since ValueMap is pub(crate) and keys are not attacker-controlled.

A lot of users provide keys and values from incoming request etc, so we should be safe by default. If a user explicitly opts-in to ahash, then we can do this change.

Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for striving to improve metrics perf. As noted in my comments, I am okay with this change if we feature gate it, so users are explicitly opting into this.

@utpilla might have considered/tried this before, but we didn't quite end up adding it - Would want to get his thoughts too.

@bryantbiggs bryantbiggs force-pushed the worktree-sorted-only-valuemap branch from f80ee5d to 5e4684b Compare February 24, 2026 21:01
@bryantbiggs bryantbiggs changed the title perf(sdk): use ahash for ValueMap HashMaps in metrics hot path perf(sdk): add optional rapidhash for ValueMap HashMaps in metrics hot path Feb 24, 2026
@bryantbiggs bryantbiggs force-pushed the worktree-sorted-only-valuemap branch 2 times, most recently from 92730a0 to 916519b Compare February 24, 2026 21:12
@bryantbiggs
Copy link
Contributor Author

bryantbiggs commented Feb 24, 2026

taking a deeper look - ahash is not as well maintained anymore and there have been better, more modern alternatives that are maintained. its somewhat of a toss up between foldhash and rapidhash; with foldhash performing better with more attributes - I am less familiar with how many attributes are commonly used so I will defer to you all in terms of which one we should add behind a feature flag

@bryantbiggs bryantbiggs force-pushed the worktree-sorted-only-valuemap branch from 916519b to d58b454 Compare February 24, 2026 21:21
@bryantbiggs bryantbiggs changed the title perf(sdk): add optional rapidhash for ValueMap HashMaps in metrics hot path perf(sdk): add optional foldhash for ValueMap HashMaps in metrics hot path Feb 24, 2026
@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.3%. Comparing base (dba1820) to head (d58b454).

Additional details and impacted files
@@          Coverage Diff          @@
##            main   #3388   +/-   ##
=====================================
  Coverage   82.3%   82.3%           
=====================================
  Files        128     128           
  Lines      24612   24617    +5     
=====================================
+ Hits       20257   20262    +5     
  Misses      4355    4355           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@utpilla
Copy link
Contributor

utpilla commented Feb 24, 2026

SipHash's HashDoS resistance is unnecessary here since ValueMap is pub(crate) and keys (Vec<KeyValue>) are not attacker-controlled.

That's not entirely true. While ValueMap is pub(crate), the measurement recording APIs are public, and the SDK simply stores whatever dimensions the calling application provides. If an application passes end-user input directly as metric attributes, those values flow straight into our hash map. Whether that's a realistic threat depends entirely on how the application is instrumented, and as an SDK we can't know that. Given the SDK is consumed across a wide range of unknown environments and use cases, I think we should be conservative here and not dismiss the HashDoS risk outright.

That said, I do acknowledge that most deployments are probably low-risk, dimension values typically come from controlled sources, and our cardinality capping provides an additional layer of protection. I'm supportive of giving users a performance escape hatch, I just want us to be thoughtful about how we expose it.

On the implementation approach: my concern with a crate-specific feature flag is maintenance. If we add foldhash today, we'll get requests for some other hasher tomorrow, and whatever comes after that. Over time, we could end up with a growing list of feature flags to maintain and test, each one a new dependency to vet.

A cleaner approach would be to make the internal storage generic over BuildHasher, similar to how HashMap itself works. The default stays as RandomState (SipHash), keeping the safe behavior for users who don't opt in, but users who want to bring their own hasher: foldhash, ahash, or anything else can do so without us needing to take on additional dependencies or feature flags. This does mean the generic parameter would need to thread through to MeterProvider and related types, which is a non-trivial change, but I think it's worth exploring. @bryantbiggs would you be interested in checking the feasibility of this approach?

If we do decide to go with a feature flag approach in the meantime, I'd strongly prefer the feature name include an unsafe prefix, something like unsafe-metrics-foldhash and that the docs clearly explain what the unsafe part is (no HashDoS protection), so users understand they're making a deliberate security tradeoff and not just flipping an easy performance switch.

@bryantbiggs
Copy link
Contributor Author

bryantbiggs commented Feb 24, 2026

A cleaner approach would be to make the internal storage generic over BuildHasher, similar to how HashMap itself works. The default stays as RandomState (SipHash), keeping the safe behavior for users who don't opt in, but users who want to bring their own hasher: foldhash, ahash, or anything else can do so without us needing to take on additional dependencies or feature flags. This does mean the generic parameter would need to thread through to MeterProvider and related types, which is a non-trivial change, but I think it's worth exploring. @bryantbiggs would you be interested in checking the feasibility of this approach?

Yes, I can take a look and propose something.

However, it does feel like a piece of functionality that will be commonly unknown to users. If most users don't know they can improve performance by bringing a different hash algorithm in their implementation then I would argue it's a change that is not valuable.

@bryantbiggs
Copy link
Contributor Author

I dug into how every other major OTel SDK handles hashing for this exact same data structure — the per-instrument map that looks up attribute sets to aggregation buckets on every counter.add() / histogram.record() call:

Language Data Structure Hash Algorithm HashDoS Resistant? Seed
Go sync.Map keyed by Distinct{uint64} in limitedSyncMap xxHash64 (cespare/xxhash/v2) No Fixed zero
Java ConcurrentHashMap<Attributes, AggregatorHandle> in DefaultSynchronousMetricStorage Arrays.hashCode() (31×x polynomial) No Deterministic
C++ unordered_map<MetricAttributes, unique_ptr<Aggregation>> in AttributesHashMap Boost hash_combine over std::hash No Fixed zero
Python dict with frozenset keys in _ViewInstrumentMatch Python's built-in SipHash Yes Per-process
.NET 6+ ConcurrentDictionary<Tags, int> in AggregatorStore System.HashCode (xxHash32) + Marvin32 Yes Per-process
Rust (current) RwLock<HashMap<Vec<KeyValue>, Arc<A>>> in ValueMap SipHash-1-3 Yes Per-instance

Key takeaways:

  • 3 of 5 other SDKs use non-DoS-resistant hashing as their only option. Python and .NET only get protection because their runtimes provide it, not from deliberate OTel decisions.
  • Go actively chose xxHash64 with zero seed for this path 2 months ago (PR #7497, v1.39.0), with Prometheus maintainers in the review. They even rejected collision detection because it doubled the cost of measure operations.
  • Go goes further than this PR — it uses only the hash as the map key (Distinct{uint64}), accepting silent data loss on collision. With foldhash, Rust's HashMap still does full Vec<KeyValue> equality comparison, so correctness is always preserved.
  • Zero HashDoS issues exist across the entire open-telemetry org. No spec or SIG Security guidance on hash algorithm selection. No documented real-world HashDoS attacks on observability systems.
  • All SDKs rely on cardinality limits (default 2000) as the primary defense.

On the BuildHasher generic approach — happy to explore as a follow-up, but I'd rather not block this on a larger API change that threads a type parameter through MeterProvider. No other OTel SDK offers configurable hashing, and the complexity may not be justified for a risk that hasn't materialized anywhere.

On naming — I'd push back on the unsafe- prefix. Go/Java/C++ ship non-resistant hashing as their only option without such labeling. metrics-foldhash with clear docs explaining the tradeoff feels more appropriate than implying a known exploitable vulnerability.

@cijothomas
Copy link
Member

@bryantbiggs @utpilla "unsafe" prefix feels a bit aggressive and can incorrectly imply unsafe blocks. Okay with out unsafe prefix, but doc covering the risks.

(Btw, I don't think we can just follow what Go or C++ is doing, or go easy just because no documented exploit has occurred. As a foundation library, we need to be safe-by-default; if a particular user knows their scenario won't result in an exploitation, then can opt-in to things.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants