feat: Support defining custom MetricValues in PhysicalPlans #16195

sfluor · 2025-05-27T11:43:09Z

Which issue does this PR close?

Closes Support distribution as a MetricValue in ExecutionPlan #16044 .

Rationale for this change

See this issue: #16044

The MetricValue enum currently exposes only single-value statistics: counts, gauges, timers, timestamps, and a few hard-coded variants such as SpillCount or OutputRows.

However there's often the need for custom metrics when using custom PhysicalPlans. At Datadog for instance we had the need for tracking the distribution of latencies of the sub-queries issued by a given phyiscal plan to be able to pin-point outliers.

Similarly tracking the topN slowest sub-query is something that has been quite useful to help us debug slow queries.

This PR allows each user to define their own MetricValue types as long as they are aggregatable. A very basic example is included in the PR using a custom counter.

What changes are included in this PR?

Includes a new enum-type in MetricValue to define custom aggregatable metric values.

Are these changes tested?

I added two basic tests to show the usage and ensure it is working as expected for a basic custom counter type.

Are there any user-facing changes?

None

LiaCastaneda

Was passing by to take a look, this lgtm 👍

LiaCastaneda · 2025-05-27T13:01:57Z

datafusion/physical-plan/src/metrics/value.rs

                write!(f, "{timestamp}")
            }
+            Self::Custom { name, value } => {
+                write!(f, "name:{name} {value:?}")


I think we can also require a Display for Custom?

LiaCastaneda · 2025-05-27T13:14:49Z

datafusion/physical-plan/src/metrics/value.rs

+
+        custom_val.aggregate(&other_custom_val);
+
+        if let MetricValue::Custom { value, .. } = custom_val {


nit: I think this can also be written as
if let MetricValue::Custom { value, .. } = custom_val {
let counter = value
.as_any()
.downcast_ref::()
.expect("Expected CustomCounter");
assert_eq!(counter.count.load(Ordering::Relaxed), 31);
}. since custom_val will always be MetricValue::Custom

gabotechs

Nice initiative! I imagine that this would be generally very useful. It would be nice to have a clear API as small an well documented as possible. I think the proposal here goes in the good direction (I'll try a couple of things, but for now I have no other idea better than this)

datafusion/physical-plan/src/metrics/value.rs

gabotechs · 2025-05-27T16:03:42Z

datafusion/physical-plan/src/metrics/value.rs

+            Self::Custom { name, value } => {
+                panic!("MetricValue::as_usize isn't supported for custom metric values. ({name}: {value:?})")
+            }


Maybe some CustomMetricValue implementations would like to implement as_usize(&self)?

Added the possibility to implement it with a default that still throws 👍

datafusion/physical-plan/src/metrics/value.rs

alamb · 2025-06-01T15:18:39Z

Thank you @sfluor for this PR

@gabotechs / @LiaCastaneda please ping me when you think this PR is ready for a review / merge

Thank you for the help getting it ready

gabotechs

Any opinions about https://github.com/apache/datafusion/pull/16195/files#r2109550168?

gabotechs · 2025-06-02T09:49:09Z

datafusion/physical-plan/src/metrics/value.rs

+/// ```
+///
+/// [`MetricValue::Custom`]: super::MetricValue::Custom
+pub trait CustomMetricValue: Display + Debug + Send + Sync {


As this new type of metric is more meaty than the other ones, and we can expect people to come here looking at the docs, what do you think about factoring it out to its own file?

metrics mod.rs baseline.rs builder.rs + custom.rs value.rs

This file is also dangerously approaching the 1000 LOC mark, so it will play in maintainability's favor

gabotechs · 2025-06-02T09:56:14Z

datafusion/physical-plan/src/metrics/value.rs

                .map(|nanos| nanos as usize)
                .unwrap_or(0),
+            Self::Custom { name, value } => {
+                value.as_usize().unwrap_or_else(|| panic!("MetricValue::as_usize isn't supported for custom metric values. ({name}: {value:?})"))


Usually panicking is avoided as much as possible, but it seems like it's not the only place where this can happen, and I have no better alternative without changing several function signatures. I think it's acceptable, but I'd weight other contributors opinion on this if any.

why not use .unwrap_or(0) ?
edit: I guess its unconventional given that StartTimestamp & EndTimestamp defaults to 0

I agree the panic should be removed good as it leaves a tricky to use correctly API: if users don't implement CustomMetric::as_usize then their query may panic.

One way to avoid the panic would be to cange the signature of as_usize to return usize rather than Option<usize> and provide a default value -- maybe 0

In addition to returning usize we could also avoid providing a custom implementation of as_usizeand force the implementers to decide explicitly

sfluor · 2025-06-03T09:16:53Z

I have addressed the remaining comments cc @LiaCastaneda / @gabotechs / @alamb

gabotechs

Nice! I think this one is ready for a review cc @alamb

gabotechs · 2025-06-03T10:44:38Z

datafusion/physical-plan/src/metrics/value.rs

+                if name != other_name {
+                    panic!(
+                        "Unsupported metric aggregation between {name} and {other_name}"
+                    )
+                }
+


Other metrics do not seem to be applying this restriction, for being consistent with other metrics, maybe we can relax this restriction here?

alamb

Thank you @sfluor @gabotechs and @LiaCastaneda -- I think this is quite close

The only thing I think we should fix prior to merge is remove the panic.

Another thing that would be great would be a more integration style test that showed using a CustomMetric in a overall SQL query plan somehow (maybe we could wrap an existing ExecutionPlan with one that captured more metrics or something. This could be done as a follow on PR when someone has the time (we can file a follow on ticket)

Also, The PR description still says

Are there any user-facing changes?

There's one breaking change which is that MetricValue isn't PartialEq anymore (to avoid having this constraint on CustomMetric values).

Which I don't think is accurate anymore so perhaps we can update that

alamb · 2025-06-04T10:43:58Z

datafusion/physical-plan/src/metrics/value.rs

                .map(|nanos| nanos as usize)
                .unwrap_or(0),
+            Self::Custom { name, value } => {
+                value.as_usize().unwrap_or_else(|| panic!("MetricValue::as_usize isn't supported for custom metric values. ({name}: {value:?})"))


I agree the panic should be removed good as it leaves a tricky to use correctly API: if users don't implement CustomMetric::as_usize then their query may panic.

One way to avoid the panic would be to cange the signature of as_usize to return usize rather than Option<usize> and provide a default value -- maybe 0

In addition to returning usize we could also avoid providing a custom implementation of as_usizeand force the implementers to decide explicitly

sfluor · 2025-06-04T13:28:45Z

Thanks for the review @alamb ! I removed the panic and went with a default value of 0

See this issue: apache#16044 The MetricValue enum currently exposes only single-value statistics: counts, gauges, timers, timestamps, and a few hard-coded variants such as SpillCount or OutputRows. However there's often the need for custom metrics when using custom PhysicalPlans. At Datadog for instance we had the need for tracking the distribution of latencies of the sub-queries issued by a given phyiscal plan to be able to pin-point outliers. Similarly tracking the topN slowest sub-query is something that has been quite useful to help us debug slow queries. This PR allows each user to define their own MetricValue types as long as they are aggregatable. A very basic example is included in the PR using a custom counter.

alamb

Thanks everyone -- this looks great to me

sfluor · 2025-06-05T07:57:31Z

There was one remaining test failing, I fixed it and updated the PR to the latest branch. Should it target main or the 47 branch ?

gabotechs · 2025-06-05T08:10:13Z

Should it target main or the 47 branch ?

The main branch is the good one (I don't think the branch-47 is the most recent release branch anyway)

alamb · 2025-06-05T14:11:24Z

Should it target main or the 47 branch ?

The main branch is the good one (I don't think the branch-47 is the most recent release branch anyway)

yeah, let's target the main branch

alamb · 2025-06-05T18:18:49Z

Let's wait to merge this PR until we ship DataFusion 48 to limit the breaking changes

Release DataFusion 48.0.0 (June 2025) #15771

I think we'll be able to merge this in the next few days

alamb · 2025-06-06T20:18:17Z

We have now made the release-48 branch so what is merged into main will be released as part of DataFusion 49.0.0

alamb · 2025-06-06T20:18:27Z

🚀

…6195) See this issue: apache#16044 The MetricValue enum currently exposes only single-value statistics: counts, gauges, timers, timestamps, and a few hard-coded variants such as SpillCount or OutputRows. However there's often the need for custom metrics when using custom PhysicalPlans. At Datadog for instance we had the need for tracking the distribution of latencies of the sub-queries issued by a given phyiscal plan to be able to pin-point outliers. Similarly tracking the topN slowest sub-query is something that has been quite useful to help us debug slow queries. This PR allows each user to define their own MetricValue types as long as they are aggregatable. A very basic example is included in the PR using a custom counter.

NGA-TRAN · 2025-06-09T13:31:59Z

datafusion/physical-plan/src/metrics/custom.rs

+
+    /// Compares this value with another custom value.
+    fn is_eq(&self, other: &Arc<dyn CustomMetricValue>) -> bool;
+}


This is a very simple trait with great example in the comment. I also like the test. Thanks @sfluor

…6195) See this issue: apache#16044 The MetricValue enum currently exposes only single-value statistics: counts, gauges, timers, timestamps, and a few hard-coded variants such as SpillCount or OutputRows. However there's often the need for custom metrics when using custom PhysicalPlans. At Datadog for instance we had the need for tracking the distribution of latencies of the sub-queries issued by a given phyiscal plan to be able to pin-point outliers. Similarly tracking the topN slowest sub-query is something that has been quite useful to help us debug slow queries. This PR allows each user to define their own MetricValue types as long as they are aggregatable. A very basic example is included in the PR using a custom counter. (cherry picked from commit fbafea4)

…6195) (#30)

github-actions bot added the physical-plan Changes to the physical-plan crate label May 27, 2025

sfluor force-pushed the sami/expose-custom-metric-value branch from 201ce87 to 7d35c02 Compare May 27, 2025 12:30

sfluor marked this pull request as ready for review May 27, 2025 12:31

sfluor mentioned this pull request May 27, 2025

Support distribution as a MetricValue in ExecutionPlan #16044

Closed

LiaCastaneda reviewed May 27, 2025

View reviewed changes

sfluor force-pushed the sami/expose-custom-metric-value branch 2 times, most recently from b2f0c90 to 0942254 Compare May 27, 2025 15:48

gabotechs reviewed May 27, 2025

View reviewed changes

datafusion/physical-plan/src/metrics/value.rs Show resolved Hide resolved

sfluor force-pushed the sami/expose-custom-metric-value branch 2 times, most recently from 3cbdc68 to 72bdb36 Compare May 28, 2025 07:48

gabotechs reviewed Jun 2, 2025

View reviewed changes

sfluor force-pushed the sami/expose-custom-metric-value branch 3 times, most recently from bd5e3f6 to 676cda0 Compare June 3, 2025 09:15

gabotechs approved these changes Jun 3, 2025

View reviewed changes

sfluor force-pushed the sami/expose-custom-metric-value branch from 676cda0 to 74e51f1 Compare June 3, 2025 11:21

LiaCastaneda approved these changes Jun 4, 2025

View reviewed changes

alamb reviewed Jun 4, 2025

View reviewed changes

sfluor force-pushed the sami/expose-custom-metric-value branch from 74e51f1 to e14e66f Compare June 4, 2025 13:28

alamb approved these changes Jun 5, 2025

View reviewed changes

sfluor force-pushed the sami/expose-custom-metric-value branch from e14e66f to b3e396b Compare June 5, 2025 07:47

Merge branch 'main' into sami/expose-custom-metric-value

522092b

alamb merged commit fbafea4 into apache:main Jun 6, 2025
27 checks passed

NGA-TRAN reviewed Jun 9, 2025

View reviewed changes

gabotechs mentioned this pull request Jun 12, 2025

[branch-48] Custom Metrics DataDog/datafusion#30

Merged

gabotechs added a commit to DataDog/datafusion that referenced this pull request Jun 12, 2025

feat: Support defining custom MetricValues in PhysicalPlans (apache#1…

d2745b8

…6195) (#30)


		custom_val.aggregate(&other_custom_val);

		if let MetricValue::Custom { value, .. } = custom_val {

feat: Support defining custom MetricValues in PhysicalPlans #16195

feat: Support defining custom MetricValues in PhysicalPlans #16195

Uh oh!

Conversation

sfluor commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

LiaCastaneda left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabotechs left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabotechs left a comment

Choose a reason for hiding this comment

Uh oh!

gabotechs Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabotechs Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LiaCastaneda Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfluor commented Jun 3, 2025

Uh oh!

gabotechs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfluor commented Jun 4, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

sfluor commented Jun 5, 2025

Uh oh!

gabotechs commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jun 5, 2025

Uh oh!

alamb commented Jun 5, 2025

Uh oh!

alamb commented Jun 6, 2025

Uh oh!

sfluor commented May 27, 2025 •

edited

Loading

gabotechs left a comment •

edited

Loading

alamb commented Jun 1, 2025 •

edited

Loading

gabotechs Jun 2, 2025 •

edited

Loading

gabotechs Jun 2, 2025 •

edited

Loading

LiaCastaneda Jun 4, 2025 •

edited

Loading

gabotechs commented Jun 5, 2025 •

edited

Loading