feat: added parallelization on key partitioned data #18919

gene-bordegaray · 2025-11-24T21:44:47Z

Full Report

Issue 18777 Parallelize Key Partitioned Data.pdf

Which issue does this PR close?

Closes Enable Parallel Aggregation for Non-Overlapping Partitioned Data #18777.

Rationale for this change

Optimize aggregations on Hive-partitioned tables by eliminating unnecessary repartitioning/coalescing when grouping by partition columns. This enables parallel computation of complete results without a merge bottleneck.

What changes are included in this PR?

Introduce new partitioning type KeyPartitioned
Save and propagate file partition metadata through query plan
Change aggregation mode selection in physical planner
Update enforce distribution rules to eliminate unnecessary repartitioning

Are these changes tested?

Unit and integration tests added for all new logic

Benchmarking

For tpch it was unaffected as expected (not partitioned):

I create my own benchmark and saw these results:

Benchmarking hive_partitioned_agg/with_key_partitioned: Collecting 100 samples in estimated 6
hive_partitioned_agg/with_key_partitioned
                        time:   [12.356 ms 12.428 ms 12.505 ms]
                        change: [−1.6022% −0.8538% −0.0780%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking hive_partitioned_agg/without_key_partitioned: Collecting 100 samples in estimate
hive_partitioned_agg/without_key_partitioned
                        time:   [13.179 ms 13.278 ms 13.382 ms]
                        change: [−0.8465% +0.2090% +1.2419%] (p = 0.70 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)

These are not huge improvements as in memory hashing is pretty efficient but these are consistent gain (ran many times).

These improvements will be crucial for distributed datafusion as network shuffles are much less efficient than in memory repartitioning.

Are there any user-facing changes?

Yes, new configuration option: listing_table_preserve_partition_values
Changes query plans when activated

…= partition groups

gene-bordegaray · 2025-11-26T05:54:37Z

datafusion/datasource/src/file_scan_config.rs

+    pub preserve_partition_values: bool,
+    /// Cached result of key_partition_exprs computation to avoid repeated work
+    #[allow(clippy::type_complexity)]
+    key_partition_exprs_cache: OnceLock<Option<Vec<Arc<dyn PhysicalExpr>>>>,


Caches results of compute_key_partition_exprs() which is expensive:

loops through file groups and does hash set operations

called multiple times (output_partitioning() and eq_properties())

gene-bordegaray · 2025-11-26T05:58:13Z

datafusion/physical-optimizer/src/enforce_distribution.rs

                }
+                Distribution::KeyPartitioned(_) => {
+                    // Nothing to do: treated as satisfied upstream
+                }


No-op because we can guarantee that our data is correctly distributed

gene-bordegaray · 2025-11-26T06:04:34Z

datafusion/sqllogictest/test_files/agg_func_substitute.slt

-02)--AggregateExec: mode=FinalPartitioned, gby=[a@0 as a], aggr=[nth_value(multiple_ordered_table.c,Int64(1)) ORDER BY [multiple_ordered_table.c ASC NULLS LAST]], ordering_mode=Sorted
-03)----SortExec: expr=[a@0 ASC NULLS LAST], preserve_partitioning=[true]
-04)------CoalesceBatchesExec: target_batch_size=8192
-05)--------RepartitionExec: partitioning=Hash([a@0], 4), input_partitions=4


Eliminates this hash because it would break ordering guarantees

gene-bordegaray · 2025-11-26T06:10:34Z

cc: @NGA-TRAN @alamb this is updated solution with report on why I chose the approach I did

fmonjalet · 2025-11-27T16:47:35Z

Thanks a lot for the description and companion doc, they are super useful.

This work is super nice and is even crucial for distributed DataFusion. Reusing partitioning and avoiding repartitions can make a huge difference when the repartition is done on the network. The plans you posted as examples are exactly what we should be aiming for.

I think I am still missing part of the point of KeyPartitioned vs reusing Hash. I'll explain what I understand and you can correct me:

Anything KeyPartitioned is Hash partitioned (but the opposite is not true) ==> is this correct?
KeyPartitioned means each key is in a distinct partition ==> is this correct?
If the above is correct (if it's not, my reasoning does not hold and you can ignore the rest of this comment), I am not sure how this applies to high cardinality keys, for example date_bin(timestamp, 15m) or id hash ranges (say you have a million files, each one having a distinct range). I imagine we'd want to be able to group multiple "keys" into the same processing partition, to avoid having thousands of partitions. My understanding is that DataFusion partitions will add overhead if there are too many (subsequent repartitions, coalesce, merge sort), but I may be mistaken.
Once we group KeyPartitioned partitions together, they become Hash partitions. ==> is this correct?
So in practice, it appears to me that we'll almost always need to resort to Hash partitions.
What we'd loose compared to KeyPartition is the SortExec elision when aggregating then sorting by the partition key, but I'd argue that if you had one group per partition, then probably the sorting is cheap enough. ==> Do we lose something else?
(This point is not challenging the PR as a whole but just an implementation choice.)

So my current understanding is: KeyPartitioned is indeed different from Hash (a specific case carrying more information) but the ratio complexity / added value is not obvious. The reason we'd not take full advantage of KeyPartitioned may be that DF partitions are actually bound to processing units (~threads), and maybe there would be value in separating the notion of processing thread and the notion of data partition, where you could have N processing unit per partitions (with partial repartitions), or N partitions per thread. But this sounds like a completely different topic and I don't know how much it makes sense.

Sorry for the wall of text, I am mostly trying to wrap my head around this, please correct anything I missed in here.

gabotechs · 2025-11-28T08:10:59Z

datafusion/physical-expr/src/partitioning.rs

    /// Allocate rows based on a hash of one of more expressions and the specified number of
    /// partitions
    Hash(Vec<Arc<dyn PhysicalExpr>>, usize),
+    /// Partitions that are already organized by disjoint key values for the provided expressions.
+    /// Rows that have the same values for these expressions are guaranteed to be in the same partition.
+    KeyPartitioned(Vec<Arc<dyn PhysicalExpr>>, usize),


My impression over all is that KeyPartitioned should not be adding anything that is not already representable with Hash. I was planning on doing a longer reasoning on this, but @fmonjalet is right on point in his comment here #18919 (comment), so I'd just +1 his comment, grab some 🍿, and see what comes out of it.

gene-bordegaray · 2025-11-28T17:27:11Z

Anything KeyPartitioned is Hash partitioned (but the opposite is not true) ==> is this correct?

Yes, key partitioning guarantees that each distinct value of the key is fully contained within a single partition which is pretty much a stronger hash partitioning. Another thing to note is that key partitioning can only root from the file scan level as of now compared to hash which of course has a repartitioning operator.

KeyPartitioned means each key is in a distinct partition ==> is this correct?

Yes, I believe you have the right idea but to be sure, KeyPartitioned, in theory allows, multiple keys in a single partition as long as they are fully contained.

If the above is correct (if it's not, my reasoning does not hold and you can ignore the rest of this comment), I am not sure how this applies to high cardinality keys, for example date_bin(timestamp, 15m) or id hash ranges (say you have a million files, each one having a distinct range). I imagine we'd want to be able to group multiple "keys" into the same processing partition, to avoid having thousands of partitions. My understanding is that DataFusion partitions will add overhead if there are too many (subsequent repartitions, coalesce, merge sort), but I may be mistaken.

Yes, this is a noted limit to the original design. I added the comment: "best with moderate partition counts (10-100 partitions)." in the config. This is rooting from splitting distinct keys into their own partitions as of now. I did this to keep the first iteration relatively simple as the PR is large. In a follow up issue, some gerat work would be to merge group to target_partitions when size allows. This would still have keys fully contained within each partition but allow for higher cardinality.

Once we group KeyPartitioned partitions together, they become Hash partitions. ==> is this correct?

It depends what "group" means. If we simply merge key partitioned data into a single partition, no, this is still key partitioned as each key is still fully in one place. If we are repartitioning or shuffling data, we lose key partitioning and fallback to hash

So in practice, it appears to me that we'll almost always need to resort to Hash partitions.

For this first PR, yes, but I think this isn't a one PR fix all scenario. I think this comes down to how intentional the user is. Yes, key partitioned data is rarer than say hash, but it is powerful enough for people to consider it. The use cases will also increse as follow up issues are resolved: higher cardinality, propagation through joins, etc.

What we'd loose compared to KeyPartition is the SortExec elision when aggregating then sorting by the partition key, but I'd argue that if you had one group per partition, then probably the sorting is cheap enough. ==> Do we lose something else?
(This point is not challenging the PR as a whole but just an implementation choice.)
We would also lose rpartitioning between partial and final aggregations which is the main overhead we are trying to avoid:

BEFORE: DataSourceExec -> Aggregate Partial (gby: a) -> Repartition Hash(a) -> Aggregate Final (gby: a)
AFTER: DataSourceExec -> Aggregate FinalPartitioned (gby: a)

In some cases we also eliminate bottlenecks due to SPMs between aggregations:
BEFORE: DataSourceExec -> Aggregate Partial (gby: a) -> SPM -> Aggregate Final (gby: a) - single-threaded!
AFTER: DataSourceExec -> Aggregate FinalPartitioned (gby: a) -> SPM

So my current understanding is: KeyPartitioned is indeed different from Hash (a specific case carrying more information) but the ratio complexity / added value is not obvious. The reason we'd not take full advantage of KeyPartitioned may be that DF partitions are actually bound to processing units (~threads), and maybe there would be value in separating the notion of processing thread and the notion of data partition, where you could have N processing unit per partitions (with partial repartitions), or N partitions per thread. But this sounds like a completely different topic and I don't know how much it makes sense.

I am in favor of keeping Hash and KeyPartitioned separate as I see them as two distinct methods of partitioning. I also don't knowif adding more information into Hash partitioning will eliminate cimplexity and raher just cause more indirection. I do like the idea of merging file groups for higher cardinality as this was my main concern with this v1 (as noted in the comments) but chose to refrain due to complexity.

Sorry for the wall of text, I am mostly trying to wrap my head around this, please correct anything I missed in here.

Do not apologize, this is a lot of the internal debates I was / am having and am glad to talk about the trade offs. Let me know what you think 😄

CC: @gabotechs

gene-bordegaray · 2025-11-29T17:24:36Z

I suggest, that this PR remain the limited scope, not meant for high cardinality queries. This was my motivation for having this option set to false by default. Then submit follow up issues to address grouping files into partitions to help with higher cardinality. I just do not want to introduce to many things in this PR and adding this seems like another substantial PR in itself.

gabotechs · 2025-12-01T08:35:20Z

Then submit follow up issues to address grouping files into partitions to help with higher cardinality. I just do not want to introduce to many things in this PR and adding this seems like another substantial PR in itself.

I think @fmonjalet's suggestion (and mine) is to not introduce Partitioning::KeyPartitioned at all and just enhance the existing Partitioning::Hash with smarter capabilities, so it's more a matter of reducing the amount of things this PR already has rather than adding on top.

I think avoiding extra repartitions by reusing an existing hash partitioning in an operator that requires some other hash partitioning that just partially matches the incoming one should be achievable without introducing new partitioning methods.

Also note that ideally the amount of partitions in a data stream should not be given by the nature of the data, but by the amount of CPUs a machine has, that's what allows us to optimize for resource usage regardless of how the data happened to be laid out.

fmonjalet · 2025-12-01T12:59:28Z

Thanks a lot for the explanations @gene-bordegaray, I think I actually start to understand 💡

When the partitioning is by one field, KeyPartitioned and Hash are the same (correct me if wrong)
But what I understand is that the difference starts when you have two fields: KeyPartitioned is hierarchical.
Taking a "practical" distribution: say you have an "order" data set that tracks orders of customers to providers. Files are partitioned by (customer_id, provider_id), conceptually organized as follows:

customer_id_hash_0/
    provider_id_hash_1.parquet
    provider_id_hash_2.parquet
    provider_id_hash_3.parquet
    provider_id_hash_4.parquet
customer_id_hash_1/
    provider_id_hash_1.parquet
    provider_id_hash_2.parquet
    provider_id_hash_3.parquet
    provider_id_hash_4.parquet
[... etc ...]

Technically this layout satisfies all the following partitioning:

KeyPartitioned(hash(customer_id), hash(provider_id))
KeyPartitioned(customer_id, provider_id) (simpler expression of the above)
Hash(customer_id, provider_id)
Hash(customer_id)

Now you want to compute:

SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id

If the data partitioning is declared as Hash(customer_id, provider_id), then you'd have to insert a repartition, because according to the partitioning, (customer0, provider0) and (customer0, provider1) may be in different partitions.
If the data partitioning is KeyPartitioned(customer_id, provider_id), you can reuse the existing partitioning: (customer0, provider0) and (customer0, provider1) are in the same partition, this is what KeyPartitioned guarantees.
Hash(customer_id) also works, but we may lack the mechanism to ask the data source to say it satisfies this partitioning, and we lose information that can be useful.

The following query (a bit artificial ,sorry):

WITH max_order_per_provider AS (
  SELECT customer_id, provider_id, MAX(amount) AS max_amount FROM orders GROUP BY customer_id, provider_id
)
SELECT customer_id, MIN(max_amount) as min_max FROM max_order_per_provider GROUP BY customer_id

Can work with 0 repartitions with KeyPartitioned(customer_id, provider_id) (if the partitioning is propagated properly through the plan)
Hash(customer_id) could also work. Knowing the data source is KeyPartitioned mostly gives us the information that allows to say it satisfies Hash(customer_id).

From there, I see KeyPartitioned as a device to avoid propagating partitionings from the top.
In the latest example, partitioning by customer_id along the entire subplan would be ideal, but I don't think we have a mechanism to propagate this information in the plan. We currently cannot ask the leaf node "could you provide Hash(customer_id) instead of Hash(customer_id, provider_id)?".

I am now wondering about whether we should have KeyPartitioned, or a mechanism to propagate ideal partitioning down the stack (e.g. Hash(a, b) can be changed to Hash(a) to avoid a repartition later on).
Gene, does this capture your thoughts? Do you see cases where KeyPartitioned adds value in between execution nodes that is not captured here?

NGA-TRAN · 2025-12-01T13:19:49Z

I will review this PR this week, too

gene-bordegaray · 2025-12-01T18:28:57Z

hey @fmonjalet thank you for the thoughtful response. here are some of my thoughts following this. Let me know what you think.

When the partitioning is by one field, KeyPartitioned and Hash are the same (correct me if wrong)

This is correct but with a caveat. Say data is partitioned on column "a", this is not necessarily the same for hash vs key partitioned. Imagine we hash partition with our hash function: "Hash(a) = a % 3" this would put the values 1, 4, 7 in the same partition. Now say we partition the same data but use key partitioning. Now, 1, 4, 7 will initially be put into separate groups and thus different partitions, then with the follow up work to merge groups to improve high cardinality (as described in above comments) these could merge into the same or different partitions.

With the key partitioned approach it is important to note that we can initially separate by the key values then merge them based on size to improve cardinality without breaking the key partitioning. If we tried to merge hash or reshuffle based on size we would not truly be hash partitioned by our hash function. I do not know what the implications of this is throughout Datafusion but nevertheless this is not true hash partitioning.

Another important difference between hash and key partition is there place of origin. Key partitioning can only originate from the data scan itself, making it explicit and a guarantee. Hash partitioning has an associated repartitioning operator which can be introduced any where in the plan, this makes hash an implicit guarantee. This is just another difference I wanted to point out about the partition styles.

But what I understand is that the difference starts when you have two fields: KeyPartitioned is hierarchical. Taking a "practical" distribution: say you have an "order" data set that tracks orders of customers to providers. Files are partitioned by (customer_id, provider_id), conceptually organized as follows:
customer_id_hash_0/
    provider_id_hash_1.parquet
    provider_id_hash_2.parquet
    provider_id_hash_3.parquet
    provider_id_hash_4.parquet
customer_id_hash_1/
    provider_id_hash_1.parquet
    provider_id_hash_2.parquet
    provider_id_hash_3.parquet
    provider_id_hash_4.parquet
[... etc ...]
Technically this layout satisfies all the following partitioning:

KeyPartitioned(hash(customer_id), hash(provider_id))

KeyPartitioned(customer_id, provider_id) (simpler expression of the above)

Hash(customer_id, provider_id)

Hash(customer_id)

Just a comment to ensure we are on the same page. This layout has the capability to achieve all of these partitioning types. If you key or hash partition on customer_id then you are guaranteeing properties about the location of your data "a single partitions will contains all rows for a distinct customer_id". The difference comes in how this is actually achieved and what this allows us to do. Key partitioning explicitly states how our data is partitioned and doe snot use a hash function, allowing us to do things like merge partitions without breaking guarantees. Hash is of course different as described above.

Now you want to compute:
SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id
If the data partitioning is declared as Hash(customer_id, provider_id), then you'd have to insert a repartition, because according to the partitioning, (customer0, provider0) and (customer0, provider1) may be in different partitions.

Yes, spot on.

If the data partitioning is KeyPartitioned(customer_id, provider_id), you can reuse the existing partitioning: (customer0, provider0) and (customer0, provider1) are in the same partition, this is what KeyPartitioned guarantees.

This is not true for the current implementation, but can be true as an option in follow up work. I noted that yes if our data is set up in a hierarchical format we could implicitly key partition by a superset of the data if it benefitted some parallelization, but decided to not implement this in the first PR as it would require some heuristic or additional user option (I don't know if this is too many knobs for the user to be turning). Due to the complexity and ambiguity in the implementation I decided against it.

This is a good thing to point out though, that yes this would be another differentiating factor between hash and key partitioned.

Hash(customer_id) also works, but we may lack the mechanism to ask the data source to say it satisfies this partitioning, and we lose information that can be useful.

This goes hand-in-hand with the last statement I made. Say your data is organized to actually be partitioned hierarchically by customer_id, provider_id, with the same query of your example

Using hash partitioning say you declare the data partitioned by Hash(customer_id, provider_id). The optimizer will think when it needs to do an aggregation on the group by clause customer_id, we are hash partitioned on customer_id, provider_id with no knowledge about the fact that the underlying data that is actually also partitioned by just customer_id, thus inserting a repartition.

Now, using key partitioning on the same columns: KeyPartitioned(customer_id, provider_id), and using some heuristic or options that determines if it is worth it to repartition by a superset of the passed in partitioning, the optimizer can recognize that with hierarchical organization, this means that all rows with the same customer_id must also be in the same partition. Thus a repartition is avoided.

The following query (a bit artificial ,sorry):
WITH max_order_per_provider AS (
  SELECT customer_id, provider_id, MAX(amount) AS max_amount FROM orders GROUP BY customer_id, provider_id
)
SELECT customer_id, MIN(max_amount) as min_max FROM max_order_per_provider GROUP BY customer_id
Can work with 0 repartitions with KeyPartitioned(customer_id, provider_id) (if the partitioning is propagated properly through the plan)

Hash(customer_id) could also worl. Knowing the data source is KeyPartitioned mostly gives us the information that allows to say it satisfies Hash(customer_id).

Yes, this is correct but taking into account my past statements about implementing the heuristic or option to partition by key partition super set when beneficial.

From there, I see KeyPartitioned as a device to avoid propagating partitionings from the top. In the latest example, partitioning by customer_id along the entire subplan would be ideal, but I don't think we have a mechanism to propagate this information in the plan. We currently cannot ask the leaf node "could you provide Hash(customer_id) instead of Hash(customer_id, provider_id).

Yes this is the main idea, I am just seeing KeyPartitioned as the mechanism to do this behavior. The reason so is because although it holds similar properties to hash partitioning, they are based on fundamentally different concepts. My two biggest strifes with this are:

KeyPartitioned originates solely from the file scan while hash partitioning can be introduced anywhere in the plan. This holds a deeper meaning in the plan that data was explicitly partitioned in some hierarchical fashion and I could see this coming into use for future optimizations.
Although key and hash partitioning guarantee similar things in regards to data location, they achieve this in different ways, hash function vs file level partitions.

I am now wondering about whether we should have KeyPartitioned, or a mechanism to propagate ideal partitioning down the stack (e.g. Hash(a, b) can be changed to Hash(a) to avoid a repartition later on). Gene, does this capture your thoughts? Do you see cases where KeyPartitioned adds value in between execution nodes that is not captured here?

This approach could work but seems like we may be trying to stretch the functionality of hash partitioning too far, turning into something it was not designed to do. I do think that think that an implementation of checking to see if a plan would benefit from hash repartitioning on a superset would be able to achieve similar results but I think that having a different type of partitioning would be a clearer implementation. A rule like this would get pretty tricky as you would have to take into account all partitioning requirements throughout the plan when determining if you would rather partition by a superset hash. With the key partitioning approach the fact that you data is partitioned is this way is apparent from the data source and operators can more naturally decide what to do with this property.

With this said I can see an optimization rule like this being beneficial regardless of if we decide to move forward with the key partitioned of hash partitioned approach. In another scenario say that earlier in the plan we are forces to repartition our data by (customer_id, provider_id) then further down the line we have an operator that is requiring us to be partitioned by just customer_id. Rather than inserting a repartition here we could have this rule that you suggest which changes the original repartition to customer_id which then implicitly satisfies the (customer_id, provider_id) requirement then is maintained through the operators to satisfy the customer_id requirement without the additional shuffle.

also ccing @gabotechs in these to keep everyone in the loop. :)

gene-bordegaray · 2025-12-01T20:50:59Z

I am making a note here that is follow up work to this PR (listed in order of priority):

propagate partitioning through operators: joins and window functions
if we move forward with key partitioned approach, group key partitioned file groups into same partition based on size to improve cardinality
in either case, introduce some rule / heuristic that recognizes when it is valuable to repartiton / partition by a superset of a partition expression to avoid repartitons.

These issues are dependent on this PR so I will hold off on making these for now. Chime in if I missed anything or if some of this is seen as not needed.

NGA-TRAN · 2025-12-02T13:20:12Z

I am actively reviewing this PR

fmonjalet · 2025-12-02T14:04:27Z

Thanks again for the thorough response 🙇

I do think that think that an implementation of checking to see if a plan would benefit from hash repartitioning on a superset would be able to achieve similar results but I think that having a different type of partitioning would be a clearer implementation.

I think this is a good summary, and it will be for maintainers to give a more formal answer on what they prefer seeing maintained. Both solutions have different trade offs, I agree that the alternative to KeyPartitioned is tricky logic that may add its own dose of complexity.

In any case, in the end I think the plan simplifications you are achieving are worth the effort either way! 🚀

fmonjalet · 2025-12-02T14:09:57Z

datafusion/physical-expr/src/partitioning.rs

+                // Here we do not check the partition count for hash partitioning and assumes the partition count
+                // and hash functions in the system are the same. In future if we plan to support storage partition-wise joins,
+                // then we need to have the partition count and hash functions validation.


Looking at this comment I am also thinking KeyPartitioned addresses this concern, in the sense that KeyPartitioned(my_hash_func(key)) expresses that the set is partitioned according to a precise hash function (same for range partitioning). It removes the "unknown hash" from the equation.

So for a join, if the two sides are partitioned by KeyPartitioned(f(key)) with the same f, then we can use a partitioned HashJoin without repartition.

gabotechs · 2025-12-02T17:48:46Z

My main two concerns with this approach are:

It seems to be adding a fair level of complexity that could potentially be expressed by enhancing the existing Partitioning::Hash mechanism. Maybe a reference to how other systems have decided to implement (Trino? Spark?) could serve as a reference for other maintainers to better judge the right approach.
Partitioning::PartitionKey is not going to respect the execution.target_partitions set by DataFusion, creating plans with an unbounded amount of partitions. This means that DataFusion would tokio::spawn as many tasks as number of distinct keys there are, which could be orders of magnitude greater than the intended execution.target_partitions.

I think more experience people in DataFusion should chime in and give their opinion, so in the mean time, one thing that comes to mind, is that we can ship first a benchmark or test that would benefit from this so that we can potentially compare different approaches. WDYT?

In the mean time, some people that come to mind whose opinion could be useful is @crepererum as a core contributor to the repartitioning mechanism, and @adriangb as potential interested in a feature like this.

NGA-TRAN · 2025-12-02T20:26:17Z

Thanks @fmonjalet and @gabotechs for the great comments. Gene and I chatted, and he is currently working on:

Benchmarking with three representative queries. He has already observed several‑times faster performance and linear improvements as the data scales.
Proposing different solutions that outline the pros and cons of introducing the new different kinds of enum.

He’ll be posting the details here soon.

gene-bordegaray · 2025-12-02T20:38:32Z

I will open a new PR with a modified solution after discussing with @gabotechs @fmonjalet and @NGA-TRAN in person.

added parallelization on key partitioned data (opt in only)

5af5522

gene-bordegaray mentioned this pull request Nov 24, 2025

Enable Parallel Aggregation for Non-Overlapping Partitioned Data #18826

Closed

ran clippy, fixed typo, and added unit tests

00bb260

github-actions bot added the documentation Improvements or additions to documentation label Nov 25, 2025

gene-bordegaray added 2 commits November 24, 2025 18:12

ran clippy, fixed typo, and added unit tests

c97404e

simplified logic, enforced repartition rules when target partitions !…

8398852

…= partition groups

gene-bordegaray commented Nov 26, 2025

View reviewed changes

gene-bordegaray marked this pull request as ready for review November 26, 2025 06:09

eliminated redundant name-based checking

25abc29

gabotechs reviewed Nov 28, 2025

View reviewed changes

fmonjalet reviewed Dec 2, 2025

View reviewed changes

gene-bordegaray closed this Dec 2, 2025

gene-bordegaray deleted the gene.bordegaray/2025/11/parallelize_key_partitioned_aggregation branch December 2, 2025 20:34

feat: added parallelization on key partitioned data #18919

feat: added parallelization on key partitioned data #18919

Conversation

gene-bordegaray commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Full Report

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Benchmarking

Are there any user-facing changes?

Uh oh!

gene-bordegaray Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gene-bordegaray Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gene-bordegaray Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gene-bordegaray commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmonjalet commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabotechs Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

gene-bordegaray commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gene-bordegaray commented Nov 29, 2025

Uh oh!

gabotechs commented Dec 1, 2025

Uh oh!

fmonjalet commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NGA-TRAN commented Dec 1, 2025

Uh oh!

gene-bordegaray commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gene-bordegaray commented Dec 1, 2025

Uh oh!

NGA-TRAN commented Dec 2, 2025

Uh oh!

fmonjalet commented Dec 2, 2025

Uh oh!

fmonjalet Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

gabotechs commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NGA-TRAN commented Dec 2, 2025

Uh oh!

gene-bordegaray commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gene-bordegaray commented Nov 24, 2025 •

edited

Loading

gene-bordegaray Nov 26, 2025 •

edited

Loading

gene-bordegaray Nov 26, 2025 •

edited

Loading

gene-bordegaray commented Nov 26, 2025 •

edited

Loading

fmonjalet commented Nov 27, 2025 •

edited

Loading

gene-bordegaray commented Nov 28, 2025 •

edited

Loading

fmonjalet commented Dec 1, 2025 •

edited

Loading

gene-bordegaray commented Dec 1, 2025 •

edited

Loading

gabotechs commented Dec 2, 2025 •

edited

Loading

gene-bordegaray commented Dec 2, 2025 •

edited

Loading