fix: Unconditionally wrap UNION BY NAME input nodes w/ `Projection` #15242

rkrishn7 · 2025-03-15T05:11:54Z

Which issue does this PR close?

Closes union by name doesn't seem to be working correctly #15236

Rationale for this change

An assumption made by a predicate while re-writing union inputs is incorrect.

Even if an input node's schema has the same width as the final schema, we still want to wrap the input with a projection as ordering is this doesn't guarantee ordering.

Are these changes tested?

Yes

Are there any user-facing changes?

N/A

… union inputs

rkrishn7 · 2025-03-15T05:13:13Z

Was trying to add all the queries from the issue, but ran into problems with normalize::convert_batches during the SLTs 🤔

Omega359 · 2025-03-15T15:15:03Z

Was trying to add all the queries from the issue, but ran into problems with normalize::convert_batches during the SLTs 🤔

Which of the queries had issues?

Omega359 · 2025-03-15T15:55:33Z

I had issues with github this morning so here is a patch to get tests running:

Index: datafusion/sql/tests/sql_integration.rs
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/datafusion/sql/tests/sql_integration.rs b/datafusion/sql/tests/sql_integration.rs
--- a/datafusion/sql/tests/sql_integration.rs	(revision 5d9cca60d65e5f988c3c383c781de6903d190bf1)
+++ b/datafusion/sql/tests/sql_integration.rs	(revision 27d0f85595804c7163f8adad90af6ff20f96fd4f)
@@ -1901,8 +1901,9 @@
         \n    Projection: NULL AS Int64(1), order_id\
         \n      Projection: orders.order_id\
         \n        TableScan: orders\
-        \n    Projection: orders.order_id, Int64(1)\
-        \n      TableScan: orders";
+        \n    Projection: Int64(1), order_id\
+        \n      Projection: orders.order_id, Int64(1)\
+        \n        TableScan: orders";
     quick_test(sql, expected);
 }
 
@@ -1939,19 +1940,23 @@
         \n  Projection: NULL AS Int64(1), order_id\
         \n    Projection: orders.order_id\
         \n      TableScan: orders\
-        \n  Projection: orders.order_id, Int64(1)\
-        \n    TableScan: orders";
+        \n  Projection: Int64(1), order_id\
+        \n    Projection: orders.order_id, Int64(1)\
+        \n      TableScan: orders";
     quick_test(sql, expected);
 }
 
 #[test]
 fn union_all_by_name_same_column_names() {
     let sql = "SELECT order_id from orders UNION ALL BY NAME SELECT order_id FROM orders";
-    let expected = "Union\
-            \n  Projection: orders.order_id\
-            \n    TableScan: orders\
-            \n  Projection: orders.order_id\
-            \n    TableScan: orders";
+    let expected = "\
+        Union\
+        \n  Projection: order_id\
+        \n    Projection: orders.order_id\
+        \n      TableScan: orders\
+        \n  Projection: order_id\
+        \n    Projection: orders.order_id\
+        \n      TableScan: orders";
     quick_test(sql, expected);
 }
 
Index: datafusion/sqllogictest/test_files/union_by_name.slt
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/datafusion/sqllogictest/test_files/union_by_name.slt b/datafusion/sqllogictest/test_files/union_by_name.slt
--- a/datafusion/sqllogictest/test_files/union_by_name.slt	(revision 5d9cca60d65e5f988c3c383c781de6903d190bf1)
+++ b/datafusion/sqllogictest/test_files/union_by_name.slt	(revision 27d0f85595804c7163f8adad90af6ff20f96fd4f)
@@ -315,3 +315,21 @@
 ----
 a b c NULL
 a b c d
+
+query TTT rowsort
+select x, y, z from t3 union all by name select z, y, x from t3;
+----
+a b c
+a b c
+
+query TTT rowsort
+select x, y, z from t3 union all by name select z, y, x from t4;
+----
+a b c
+a b c
+
+query TTT rowsort
+select x, y, z from t3 union all by name select z, y, x from t4 order by x;
+----
+a b c
+a b c

alamb · 2025-03-17T16:11:31Z

@rkrishn7 will you have time to apply @Omega359 's suggestion?

rkrishn7 · 2025-03-17T17:55:37Z

Thanks @Omega359!

@alamb Yes sorry for the delay, I can fix this up later today.

To expand on my comment here, this was regarding normalize::convert_batches reporting different record batch schemas for UNION ALL BY NAME queries where one side has a literal, e.g.:

select x, y, z from t3 union all by name select z, y, x, 'zz' as zz from t4 order by x;

Though, this was just occurring when collecting the results in the SLT suite. The actual result was correct (verified in the CLI).

But I'll fix up the integration tests after work so all the checks pass and this is mergeable. Then maybe I can look into the above issue separately if that sounds good.

Omega359 · 2025-03-18T13:09:46Z

datafusion/sqllogictest/test_files/union_by_name.slt

+        Field {
+            name: "zz",
+            data_type: Utf8,
+            nullable: true,


Interesting that the nullable changed from false to true here.

Omega359 · 2025-03-18T14:07:52Z

But I'll fix up the integration tests after work so all the checks pass and this is mergeable. Then maybe I can look into the above issue separately if that sounds good.

That sounds good! I took a look at the convert_batches and I cannot see offhand how it is doing anything problematic. I suspect the issue is elsewhere and that function is just making it visible.

~~Why a field would change from not nullable to nullable for one (or more) batches in a query after the first one is highly suspicious to me.~~

I have a thought as to why it's happening. When you fill a column with null when that input is missing the column you are not preserving the other aspects of the column, specifically nullable. Thus, different batches (for the different inputs) can have differing nullable values.

Input 1: 'zz' column missing, is added and set to null (thus nullable = true)
Input 2: 'zz' column exists, is literal and nullable = false

In this case I think we would have to actually change the column nullability for the input that has the column (from non-nullable to nullable) or to wrap another projection that accomplishes the same.

Omega359 · 2025-03-21T11:41:48Z

@rkrishn7 - what do you think of my above comment for the cause of the issue? I personally think it should be fixed before this PR is approved, not after.

alamb · 2025-03-21T19:16:34Z

Once @Omega359 is happy with this PR I will be too -- just let me know

… inputs

rkrishn7 · 2025-03-24T00:00:00Z

@Omega359 Thanks for your thoughts!

And yes, I agree that the problem isn't with normalize::convert_batches. It is just being surfaced there.

I have a thought as to why it's happening. When you fill a column with null when that input is missing the column you are not preserving the other aspects of the column, specifically nullable. Thus, different batches (for the different inputs) can have differing nullable values.

Input 1: 'zz' column missing, is added and set to null (thus nullable = true)
Input 2: 'zz' column exists, is literal and nullable = false

In this case I think we would have to actually change the column nullability for the input that has the column (from non-nullable to nullable) or to wrap another projection that accomplishes the same.

Yup, agreed! If one input contains a non-nullable column and the other does not, then the final schema will have the field's nullability set to false when it really should be true.

I've updated the logic there to account for if the number of occurrences of the field across all inputs is less than the number of inputs (i.e. the field is missing from one or more inputs). In this case, it must be treated as nullable.

Unfortunately I don't think the problem is solved there. Upon further investigation, it seems like another problem exists with there being information loss between the logical and physical schemas. Specifically, when constructing the physical expression for a literal, the nullability is not determined by the already known schema. It is simply based on whether or not the literal is null.

This can be observed by updating the implementation of PhysicalExpr::is_nullable for Literal to return Ok(true). After that, the test suite passes successfully.

I think this should be fixed. Specifically, the nullability of Literal should be derived from its surrounding context. Rather than defaulting to self.value.is_null(). But, I also think it's fair to say this is another issue outside the scope of this PR.

Omega359 · 2025-03-24T13:49:32Z

datafusion/sql/tests/sql_integration.rs

        \n        TableScan: orders\
-        \n    Projection: orders.order_id, Int64(1)\
-        \n      TableScan: orders";
+        \n    Projection: Int64(1), order_id\


I'm trying to figure out where this Projection (Projection: Int64(1), order_id) is coming from and why it's ordered this way. I would expect it to be ordered as order_id, Int64 as that is the order on the rhs.

As well, this test has an expected distinct clause ... the query doesn't. ~~I can't see a reason why that would be introduced.~~ UNION vs UNION ALL. Never mind, it's a Monday :)

The Projection is coming from the change in this PR that unconditionally wraps input nodes with a Projection. The ordering is like this because of the ordering of the schema columns we use to calculate the expression list for the wrapped projection. They are essentially in alphabetical order.

But thanks for catching this. I agree, the ordering seems incorrect here. I think the ordering should be driven by the ordering on the lhs (this seems to also be what duckdb does)

This was actually producing incorrect results as well in cases where one of the inputs is not wrapped. Now it's just the ordering that is the issue.

I'll be able to push up a fix here after work today.

@Omega359 This should be fixed now.

Omega359 · 2025-03-24T15:03:40Z

Unfortunately I don't think the problem is solved there. Upon further investigation, it seems like another problem exists with there being information loss between the logical and physical schemas. Specifically, when constructing the physical expression for a literal, the nullability is not determined by the already known schema. It is simply based on whether or not the literal is null.

This can be observed by updating the implementation of PhysicalExpr::is_nullable for Literal to return Ok(true). After that, the test suite passes successfully.

I think this should be fixed. Specifically, the nullability of Literal should be derived from its surrounding context. Rather than defaulting to self.value.is_null(). But, I also think it's fair to say this is another issue outside the scope of this PR.

Confirmed. If you don't mind could you file a followup issue for this and update the comment in the union_by_name.slt to note that issue. I think this fix is complete otherwise.

rkrishn7 · 2025-03-24T16:32:50Z

Unfortunately I don't think the problem is solved there. Upon further investigation, it seems like another problem exists with there being information loss between the logical and physical schemas. Specifically, when constructing the physical expression for a literal, the nullability is not determined by the already known schema. It is simply based on whether or not the literal is null.
This can be observed by updating the implementation of PhysicalExpr::is_nullable for Literal to return Ok(true). After that, the test suite passes successfully.
I think this should be fixed. Specifically, the nullability of Literal should be derived from its surrounding context. Rather than defaulting to self.value.is_null(). But, I also think it's fair to say this is another issue outside the scope of this PR.

Confirmed. If you don't mind could you file a followup issue for this and update the comment in the union_by_name.slt to note that issue. I think this fix is complete otherwise.

@Omega359 Done! Issue here

rkrishn7 · 2025-03-25T03:16:33Z

datafusion/sqllogictest/test_files/union_by_name.slt


 query III
-SELECT 1 UNION BY NAME SELECT * FROM unnest(range(2, 100)) UNION BY NAME SELECT 999 ORDER BY 3, 1 LIMIT 5;
+SELECT 1 UNION BY NAME SELECT * FROM unnest(range(2, 100)) UNION BY NAME SELECT 999 ORDER BY 3, 1, 2 LIMIT 5;


Previously the unnest column was incorrectly at position 3. Now it is correctly at position 2 so the result here has changed.

The extra ordering is to order the unnest values for a deterministic result.

Also verified result is consistent with duckdb's.

rkrishn7 · 2025-03-25T03:16:59Z

datafusion/sqllogictest/test_files/union_by_name.slt

-NULL NULL 4
-NULL NULL 5
-NULL NULL 6
+NULL NULL 999


Same here - result set is changing due to column reordering. Also verified result is consistent with duckdb's.

Omega359 · 2025-03-25T19:53:07Z

This looks good to me, thanks!

I've started poking at the Literal/nullable issue a bit. No solution yet though.

rkrishn7 · 2025-03-25T22:10:46Z

This looks good to me, thanks!

I've started poking at the Literal/nullable issue a bit. No solution yet though.

Thanks for the thorough review @Omega359. I feel a lot better about this now.

cc @alamb - I think we should be gtg here

alamb

LGTM -- thank you @rkrishn7 and @Omega359

alamb · 2025-03-26T21:06:49Z

datafusion/expr/src/logical_plan/plan.rs

        let schema_width = schema.iter().count();
        let mut wrapped_inputs = Vec::with_capacity(inputs.len());
        for input in inputs {
-            // If the input plan's schema contains the same number of fields


This does indeed seem to be an overzealous optimization

rkrishn7 · 2025-03-27T18:20:06Z

Thanks @alamb - is this ready to get merged?

alamb · 2025-03-27T19:06:14Z

Thanks again @rkrishn7 and @Omega359

…pache#15242) * fix: Remove incorrect predicate to skip input wrapping when rewriting union inputs * chore: Add/update tests * fix: SQL integration tests * test: Add union all by name SLT tests * test: Add problematic union all by name SLT test * chore: styling nits * fix: Correct handling of nullability when field is not present in all inputs * chore: Update fixme comment * fix: handle ordering by order of inputs

rkrishn7 added 2 commits March 14, 2025 22:10

fix: Remove incorrect predicate to skip input wrapping when rewriting…

50e1be9

… union inputs

chore: Add/update tests

5d9cca6

github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Mar 15, 2025

rkrishn7 changed the title ~~Fix/15236~~ fix: Unconditonally wrap UNION BY NAME input nodes w/ Projection Mar 15, 2025

alamb marked this pull request as draft March 17, 2025 16:10

alamb marked this pull request as ready for review March 17, 2025 16:11

rkrishn7 added 3 commits March 17, 2025 15:54

fix: SQL integration tests

3a7e3da

test: Add union all by name SLT tests

9c94a48

test: Add problematic union all by name SLT test

c3cdcc7

github-actions bot added the sql SQL Planner label Mar 17, 2025

rkrishn7 changed the title ~~fix: Unconditonally wrap UNION BY NAME input nodes w/ Projection~~ fix: Unconditionally wrap UNION BY NAME input nodes w/ Projection Mar 17, 2025

chore: styling nits

2f4e796

Omega359 reviewed Mar 18, 2025

View reviewed changes

fix: Correct handling of nullability when field is not present in all…

2eb3c79

… inputs

Omega359 reviewed Mar 24, 2025

View reviewed changes

rkrishn7 mentioned this pull request Mar 24, 2025

Each Union branch should produce fields with nullable: true if any of the branches produces fields with nullable: true on the same position #15394

Open

chore: Update fixme comment

31fae11

rkrishn7 commented Mar 25, 2025

View reviewed changes

fix: handle ordering by order of inputs

0204f35

rkrishn7 force-pushed the fix/15236 branch from 24b3946 to 0204f35 Compare March 25, 2025 05:23

rkrishn7 requested a review from Omega359 March 25, 2025 16:41

alamb approved these changes Mar 26, 2025

View reviewed changes

alamb merged commit dc7e44d into apache:main Mar 27, 2025
27 checks passed

findepi mentioned this pull request Jun 30, 2025

Fix spurious failure in convert_batches test helper #16627

Merged

fix: Unconditionally wrap UNION BY NAME input nodes w/ Projection #15242

fix: Unconditionally wrap UNION BY NAME input nodes w/ Projection #15242

Uh oh!

Conversation

rkrishn7 commented Mar 15, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

rkrishn7 commented Mar 15, 2025

Uh oh!

Omega359 commented Mar 15, 2025

Uh oh!

Omega359 commented Mar 15, 2025

Uh oh!

alamb commented Mar 17, 2025

Uh oh!

rkrishn7 commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Omega359 Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

Omega359 commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Omega359 commented Mar 21, 2025

Uh oh!

alamb commented Mar 21, 2025

Uh oh!

rkrishn7 commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Omega359 Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkrishn7 Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

rkrishn7 Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkrishn7 Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

Omega359 commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkrishn7 commented Mar 24, 2025

Uh oh!

rkrishn7 Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkrishn7 Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Omega359 commented Mar 25, 2025

Uh oh!

rkrishn7 commented Mar 25, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

rkrishn7 commented Mar 27, 2025

Uh oh!

Uh oh!

alamb commented Mar 27, 2025

Uh oh!

Reviewers

Assignees

Labels

fix: Unconditionally wrap UNION BY NAME input nodes w/ `Projection` #15242

fix: Unconditionally wrap UNION BY NAME input nodes w/ `Projection` #15242

rkrishn7 commented Mar 15, 2025 •

edited by alamb

Loading

rkrishn7 commented Mar 17, 2025 •

edited

Loading

Omega359 commented Mar 18, 2025 •

edited

Loading

rkrishn7 commented Mar 24, 2025 •

edited

Loading

Omega359 Mar 24, 2025 •

edited

Loading

rkrishn7 Mar 24, 2025 •

edited

Loading

Omega359 commented Mar 24, 2025 •

edited

Loading

rkrishn7 Mar 25, 2025 •

edited

Loading

rkrishn7 Mar 25, 2025 •

edited

Loading