Skip to content

Conversation

@sfluor
Copy link
Contributor

@sfluor sfluor commented May 27, 2025

Which issue does this PR close?

Rationale for this change

See this issue: #16044

The MetricValue enum currently exposes only single-value statistics: counts, gauges, timers, timestamps, and a few hard-coded variants such as SpillCount or OutputRows.

However there's often the need for custom metrics when using custom PhysicalPlans. At Datadog for instance we had the need for tracking the distribution of latencies of the sub-queries issued by a given phyiscal plan to be able to pin-point outliers.

Similarly tracking the topN slowest sub-query is something that has been quite useful to help us debug slow queries.

This PR allows each user to define their own MetricValue types as long as they are aggregatable. A very basic example is included in the PR using a custom counter.

What changes are included in this PR?

Includes a new enum-type in MetricValue to define custom aggregatable metric values.

Are these changes tested?

I added two basic tests to show the usage and ensure it is working as expected for a basic custom counter type.

Are there any user-facing changes?

None

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label May 27, 2025
@sfluor sfluor force-pushed the sami/expose-custom-metric-value branch from 201ce87 to 7d35c02 Compare May 27, 2025 12:30
@sfluor sfluor marked this pull request as ready for review May 27, 2025 12:31
Copy link
Contributor

@LiaCastaneda LiaCastaneda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was passing by to take a look, this lgtm 👍

write!(f, "{timestamp}")
}
Self::Custom { name, value } => {
write!(f, "name:{name} {value:?}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also require a Display for Custom?


custom_val.aggregate(&other_custom_val);

if let MetricValue::Custom { value, .. } = custom_val {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this can also be written as
if let MetricValue::Custom { value, .. } = custom_val {
let counter = value
.as_any()
.downcast_ref::()
.expect("Expected CustomCounter");
assert_eq!(counter.count.load(Ordering::Relaxed), 31);
}. since custom_val will always be MetricValue::Custom

@sfluor sfluor force-pushed the sami/expose-custom-metric-value branch 2 times, most recently from b2f0c90 to 0942254 Compare May 27, 2025 15:48
Copy link
Contributor

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice initiative! I imagine that this would be generally very useful. It would be nice to have a clear API as small an well documented as possible. I think the proposal here goes in the good direction (I'll try a couple of things, but for now I have no other idea better than this)

Comment on lines 463 to 521
Self::Custom { name, value } => {
panic!("MetricValue::as_usize isn't supported for custom metric values. ({name}: {value:?})")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some CustomMetricValue implementations would like to implement as_usize(&self)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the possibility to implement it with a default that still throws 👍

@sfluor sfluor force-pushed the sami/expose-custom-metric-value branch 2 times, most recently from 3cbdc68 to 72bdb36 Compare May 28, 2025 07:48
@alamb
Copy link
Contributor

alamb commented Jun 1, 2025

Thank you @sfluor for this PR

@gabotechs / @LiaCastaneda please ping me when you think this PR is ready for a review / merge

Thank you for the help getting it ready

Copy link
Contributor

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// ```
///
/// [`MetricValue::Custom`]: super::MetricValue::Custom
pub trait CustomMetricValue: Display + Debug + Send + Sync {
Copy link
Contributor

@gabotechs gabotechs Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this new type of metric is more meaty than the other ones, and we can expect people to come here looking at the docs, what do you think about factoring it out to its own file?

metrics
    mod.rs
    baseline.rs
    builder.rs
+   custom.rs
    value.rs

This file is also dangerously approaching the 1000 LOC mark, so it will play in maintainability's favor

.map(|nanos| nanos as usize)
.unwrap_or(0),
Self::Custom { name, value } => {
value.as_usize().unwrap_or_else(|| panic!("MetricValue::as_usize isn't supported for custom metric values. ({name}: {value:?})"))
Copy link
Contributor

@gabotechs gabotechs Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually panicking is avoided as much as possible, but it seems like it's not the only place where this can happen, and I have no better alternative without changing several function signatures. I think it's acceptable, but I'd weight other contributors opinion on this if any.

Copy link
Contributor

@LiaCastaneda LiaCastaneda Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use .unwrap_or(0) ?
edit: I guess its unconventional given that StartTimestamp & EndTimestamp defaults to 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the panic should be removed good as it leaves a tricky to use correctly API: if users don't implement CustomMetric::as_usize then their query may panic.

One way to avoid the panic would be to cange the signature of as_usize to return usize rather than Option<usize> and provide a default value -- maybe 0

In addition to returning usize we could also avoid providing a custom implementation of as_usizeand force the implementers to decide explicitly

@sfluor sfluor force-pushed the sami/expose-custom-metric-value branch 3 times, most recently from bd5e3f6 to 676cda0 Compare June 3, 2025 09:15
@sfluor
Copy link
Contributor Author

sfluor commented Jun 3, 2025

I have addressed the remaining comments cc @LiaCastaneda / @gabotechs / @alamb

Copy link
Contributor

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I think this one is ready for a review cc @alamb

Comment on lines 606 to 611
if name != other_name {
panic!(
"Unsupported metric aggregation between {name} and {other_name}"
)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other metrics do not seem to be applying this restriction, for being consistent with other metrics, maybe we can relax this restriction here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@sfluor sfluor force-pushed the sami/expose-custom-metric-value branch from 676cda0 to 74e51f1 Compare June 3, 2025 11:21
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sfluor @gabotechs and @LiaCastaneda -- I think this is quite close

The only thing I think we should fix prior to merge is remove the panic.

Another thing that would be great would be a more integration style test that showed using a CustomMetric in a overall SQL query plan somehow (maybe we could wrap an existing ExecutionPlan with one that captured more metrics or something. This could be done as a follow on PR when someone has the time (we can file a follow on ticket)

Also, The PR description still says

Are there any user-facing changes?

There's one breaking change which is that MetricValue isn't PartialEq anymore (to avoid having this constraint on CustomMetric values).

Which I don't think is accurate anymore so perhaps we can update that

.map(|nanos| nanos as usize)
.unwrap_or(0),
Self::Custom { name, value } => {
value.as_usize().unwrap_or_else(|| panic!("MetricValue::as_usize isn't supported for custom metric values. ({name}: {value:?})"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the panic should be removed good as it leaves a tricky to use correctly API: if users don't implement CustomMetric::as_usize then their query may panic.

One way to avoid the panic would be to cange the signature of as_usize to return usize rather than Option<usize> and provide a default value -- maybe 0

In addition to returning usize we could also avoid providing a custom implementation of as_usizeand force the implementers to decide explicitly

@sfluor sfluor force-pushed the sami/expose-custom-metric-value branch from 74e51f1 to e14e66f Compare June 4, 2025 13:28
@sfluor
Copy link
Contributor Author

sfluor commented Jun 4, 2025

Thanks for the review @alamb ! I removed the panic and went with a default value of 0

See this issue: apache#16044

The MetricValue enum currently exposes only single-value statistics:
counts, gauges, timers, timestamps, and a few hard-coded variants
such as SpillCount or OutputRows.

However there's often the need for custom metrics when using custom
PhysicalPlans. At Datadog for instance we had the need for tracking the
distribution of latencies of the sub-queries issued by a given phyiscal
plan to be able to pin-point outliers.

Similarly tracking the topN slowest sub-query is something that has been
quite useful to help us debug slow queries.

This PR allows each user to define their own MetricValue types as long
as they are aggregatable. A very basic example is included in the PR
using a custom counter.
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks everyone -- this looks great to me

@sfluor sfluor force-pushed the sami/expose-custom-metric-value branch from e14e66f to b3e396b Compare June 5, 2025 07:47
@sfluor
Copy link
Contributor Author

sfluor commented Jun 5, 2025

There was one remaining test failing, I fixed it and updated the PR to the latest branch. Should it target main or the 47 branch ?

@gabotechs
Copy link
Contributor

gabotechs commented Jun 5, 2025

Should it target main or the 47 branch ?

The main branch is the good one (I don't think the branch-47 is the most recent release branch anyway)

@alamb
Copy link
Contributor

alamb commented Jun 5, 2025

Should it target main or the 47 branch ?

The main branch is the good one (I don't think the branch-47 is the most recent release branch anyway)

yeah, let's target the main branch

@alamb
Copy link
Contributor

alamb commented Jun 5, 2025

Let's wait to merge this PR until we ship DataFusion 48 to limit the breaking changes

I think we'll be able to merge this in the next few days

@alamb
Copy link
Contributor

alamb commented Jun 6, 2025

We have now made the release-48 branch so what is merged into main will be released as part of DataFusion 49.0.0

@alamb alamb merged commit fbafea4 into apache:main Jun 6, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 6, 2025

🚀

kosiew pushed a commit to kosiew/datafusion that referenced this pull request Jun 9, 2025
…6195)

See this issue: apache#16044

The MetricValue enum currently exposes only single-value statistics:
counts, gauges, timers, timestamps, and a few hard-coded variants
such as SpillCount or OutputRows.

However there's often the need for custom metrics when using custom
PhysicalPlans. At Datadog for instance we had the need for tracking the
distribution of latencies of the sub-queries issued by a given phyiscal
plan to be able to pin-point outliers.

Similarly tracking the topN slowest sub-query is something that has been
quite useful to help us debug slow queries.

This PR allows each user to define their own MetricValue types as long
as they are aggregatable. A very basic example is included in the PR
using a custom counter.

/// Compares this value with another custom value.
fn is_eq(&self, other: &Arc<dyn CustomMetricValue>) -> bool;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very simple trait with great example in the comment. I also like the test. Thanks @sfluor

gabotechs pushed a commit to DataDog/datafusion that referenced this pull request Jun 12, 2025
…6195)

See this issue: apache#16044

The MetricValue enum currently exposes only single-value statistics:
counts, gauges, timers, timestamps, and a few hard-coded variants
such as SpillCount or OutputRows.

However there's often the need for custom metrics when using custom
PhysicalPlans. At Datadog for instance we had the need for tracking the
distribution of latencies of the sub-queries issued by a given phyiscal
plan to be able to pin-point outliers.

Similarly tracking the topN slowest sub-query is something that has been
quite useful to help us debug slow queries.

This PR allows each user to define their own MetricValue types as long
as they are aggregatable. A very basic example is included in the PR
using a custom counter.

(cherry picked from commit fbafea4)
gabotechs added a commit to DataDog/datafusion that referenced this pull request Jun 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support distribution as a MetricValue in ExecutionPlan

5 participants