Skip to content

Conversation

@Tpt
Copy link
Contributor

@Tpt Tpt commented Oct 23, 2025

Rely on aggregate GroupValues abstraction to build a hash table of the emitted rows that is used to deduplicate

We might make things a bit more efficient by rewriting a hash table wrapper just for deduplication, but this implementation should give a fair baseline

Which issue does this PR close?

Rationale for this change

Implements deduplicating recursive CTE (i.e. UNION inside of WITH RECURSIVE) using a hash table. I reuse the one from aggregates to avoid rebuilding a full wrapper and specialization for types. Each time a batch is returned by the static or the recursive terms of the CTE, the hash table is used to remove already seen rows before emitting the rows and keeping them in memory for the next recursion step.

What changes are included in this PR?

Reusing GroupValues trait implementations inside of RecursiveQueryExec to get deduplication working.

Are these changes tested?

Yes, some sqllogictests have been added, including ones that would lead to infinite recursion is deduplication where disabled.

Are there any user-facing changes?

No

Rely on aggregate GroupValues abstraction to build a hash table of the emitted rows that is used to deduplicate

We might make things a bit more efficient by rewriting a hash table wrapper just for deduplication, but this implementation should give a fair baseline
@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) physical-plan Changes to the physical-plan crate labels Oct 23, 2025
@Tpt Tpt changed the title Deduplicating recursive CTE implementation feat: Deduplicating recursive CTE implementation Oct 23, 2025
Copy link
Contributor

@tobixdev tobixdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective this is a very nice and concise solution to the problem.

Furthermore, from my understanding this should also correctly terminate the recursion as only each unique row is pushed into the WorkTable and at some point (as it can be seen in the closure example) this will reach a fix point.

What I am also thinking about is test coverage. My gut feeling says there should be some test cases in the SQLite test suite that cover distinct recursion. Would this cause the extended test suite to fail? Ideally, this solution passes all these test cases now! 🥳 However, I am a bit unsure how this is setup currently.

Thank you!

CAVEAT: I am by no means a DataFusion (nor recurisve query) expert so take my comments with a grain of salt.

}

/// Return a mask, each element true if the value is greater than all previous ones and greater or equal than the min_value
fn are_increasing_mask(values: &[usize], mut min_value: usize) -> BooleanArray {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understood what this function does, but I had a hard time with min_value. Maybe we can be more explicit here. Just some suggestions:

input parameter: min_value -> highest_group_id

// Always update the min_value to do de-duplication within a record batch.
let mut min_value = highet_group_id;

May the integrating the comment in the doc comment for are_increasing_mask is also more than enough.

I think this assumes that the group ids are assigned in-order within the record batch but I think this is a valid assumption. Maybe someone more familiar with the aggregation infrastructure has more information on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this assumes that the group ids are assigned in-order within the record batch

yes, this is part of the GroupValues trait documentation.

I have rephrased the doc comment. I hope it's clearer now.

I have not renamed min_value to highest_group_id, the function does not depends on any specific semantic outside of creating the mask from its inputs. But happy to do the rename if you feel strongly about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's perfectly fine. Just a suggestion 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also found this confusing. Some suggestions:

  1. Rename the function to new_groups_mask to reflect what it does
  2. Rename min_value to max_seen_group_id or max_emitted or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! 0af5648

@alamb
Copy link
Contributor

alamb commented Nov 7, 2025

Sorry -- this PR hasn't been on my radar. I will put it on my review list and try and get it in the next few days

@Tpt Tpt force-pushed the tpt/distinct-cte-hash branch from 6cc4434 to 48e8e33 Compare November 24, 2025 20:36
@alamb alamb changed the title feat: Deduplicating recursive CTE implementation feat: Support recursive queries with a distinct 'UNION' Nov 25, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Tpt and @tobixdev -- I looked through this carefully and it makes sense to me. I left some small suggestions but I don't think they are required

}

/// Return a mask, each element true if the value is greater than all previous ones and greater or equal than the min_value
fn are_increasing_mask(values: &[usize], mut min_value: usize) -> BooleanArray {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also found this confusing. Some suggestions:

  1. Rename the function to new_groups_mask to reflect what it does
  2. Rename min_value to max_seen_group_id or max_emitted or something like that.

@Tpt
Copy link
Contributor Author

Tpt commented Nov 25, 2025

@alamb Thank you! Suggestions applied

@alamb alamb added this pull request to the merge queue Nov 25, 2025
@alamb
Copy link
Contributor

alamb commented Nov 25, 2025

Thanks again @Tpt

Merged via the queue into apache:main with commit 3ba7350 Nov 25, 2025
31 checks passed
@Tpt Tpt deleted the tpt/distinct-cte-hash branch November 25, 2025 19:03
@Tpt
Copy link
Contributor Author

Tpt commented Nov 25, 2025

Thank you!

@alamb
Copy link
Contributor

alamb commented Nov 25, 2025

Thank you!

Thank you for your patience

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support deduplicating UNION in recursive CTE

4 participants