Skip to content

Conversation

@rkrishn7
Copy link
Contributor

@rkrishn7 rkrishn7 commented Mar 15, 2025

Which issue does this PR close?

Rationale for this change

An assumption made by a predicate while re-writing union inputs is incorrect.

Even if an input node's schema has the same width as the final schema, we still want to wrap the input with a projection as ordering is this doesn't guarantee ordering.

Are these changes tested?

Yes

Are there any user-facing changes?

N/A

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Mar 15, 2025
@rkrishn7
Copy link
Contributor Author

Was trying to add all the queries from the issue, but ran into problems with normalize::convert_batches during the SLTs 🤔

@rkrishn7 rkrishn7 changed the title Fix/15236 fix: Unconditonally wrap UNION BY NAME input nodes w/ Projection Mar 15, 2025
@Omega359
Copy link
Contributor

Was trying to add all the queries from the issue, but ran into problems with normalize::convert_batches during the SLTs 🤔

Which of the queries had issues?

@Omega359
Copy link
Contributor

I had issues with github this morning so here is a patch to get tests running:

Index: datafusion/sql/tests/sql_integration.rs
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/datafusion/sql/tests/sql_integration.rs b/datafusion/sql/tests/sql_integration.rs
--- a/datafusion/sql/tests/sql_integration.rs	(revision 5d9cca60d65e5f988c3c383c781de6903d190bf1)
+++ b/datafusion/sql/tests/sql_integration.rs	(revision 27d0f85595804c7163f8adad90af6ff20f96fd4f)
@@ -1901,8 +1901,9 @@
         \n    Projection: NULL AS Int64(1), order_id\
         \n      Projection: orders.order_id\
         \n        TableScan: orders\
-        \n    Projection: orders.order_id, Int64(1)\
-        \n      TableScan: orders";
+        \n    Projection: Int64(1), order_id\
+        \n      Projection: orders.order_id, Int64(1)\
+        \n        TableScan: orders";
     quick_test(sql, expected);
 }
 
@@ -1939,19 +1940,23 @@
         \n  Projection: NULL AS Int64(1), order_id\
         \n    Projection: orders.order_id\
         \n      TableScan: orders\
-        \n  Projection: orders.order_id, Int64(1)\
-        \n    TableScan: orders";
+        \n  Projection: Int64(1), order_id\
+        \n    Projection: orders.order_id, Int64(1)\
+        \n      TableScan: orders";
     quick_test(sql, expected);
 }
 
 #[test]
 fn union_all_by_name_same_column_names() {
     let sql = "SELECT order_id from orders UNION ALL BY NAME SELECT order_id FROM orders";
-    let expected = "Union\
-            \n  Projection: orders.order_id\
-            \n    TableScan: orders\
-            \n  Projection: orders.order_id\
-            \n    TableScan: orders";
+    let expected = "\
+        Union\
+        \n  Projection: order_id\
+        \n    Projection: orders.order_id\
+        \n      TableScan: orders\
+        \n  Projection: order_id\
+        \n    Projection: orders.order_id\
+        \n      TableScan: orders";
     quick_test(sql, expected);
 }
 
Index: datafusion/sqllogictest/test_files/union_by_name.slt
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/datafusion/sqllogictest/test_files/union_by_name.slt b/datafusion/sqllogictest/test_files/union_by_name.slt
--- a/datafusion/sqllogictest/test_files/union_by_name.slt	(revision 5d9cca60d65e5f988c3c383c781de6903d190bf1)
+++ b/datafusion/sqllogictest/test_files/union_by_name.slt	(revision 27d0f85595804c7163f8adad90af6ff20f96fd4f)
@@ -315,3 +315,21 @@
 ----
 a b c NULL
 a b c d
+
+query TTT rowsort
+select x, y, z from t3 union all by name select z, y, x from t3;
+----
+a b c
+a b c
+
+query TTT rowsort
+select x, y, z from t3 union all by name select z, y, x from t4;
+----
+a b c
+a b c
+
+query TTT rowsort
+select x, y, z from t3 union all by name select z, y, x from t4 order by x;
+----
+a b c
+a b c

@alamb alamb marked this pull request as draft March 17, 2025 16:10
@alamb alamb marked this pull request as ready for review March 17, 2025 16:11
@alamb
Copy link
Contributor

alamb commented Mar 17, 2025

@rkrishn7 will you have time to apply @Omega359 's suggestion?

@rkrishn7
Copy link
Contributor Author

rkrishn7 commented Mar 17, 2025

Thanks @Omega359!

@alamb Yes sorry for the delay, I can fix this up later today.

To expand on my comment here, this was regarding normalize::convert_batches reporting different record batch schemas for UNION ALL BY NAME queries where one side has a literal, e.g.:

select x, y, z from t3 union all by name select z, y, x, 'zz' as zz from t4 order by x;

Though, this was just occurring when collecting the results in the SLT suite. The actual result was correct (verified in the CLI).

But I'll fix up the integration tests after work so all the checks pass and this is mergeable. Then maybe I can look into the above issue separately if that sounds good.

@github-actions github-actions bot added the sql SQL Planner label Mar 17, 2025
@rkrishn7 rkrishn7 changed the title fix: Unconditonally wrap UNION BY NAME input nodes w/ Projection fix: Unconditionally wrap UNION BY NAME input nodes w/ Projection Mar 17, 2025
Field {
name: "zz",
data_type: Utf8,
nullable: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that the nullable changed from false to true here.

@Omega359
Copy link
Contributor

Omega359 commented Mar 18, 2025

But I'll fix up the integration tests after work so all the checks pass and this is mergeable. Then maybe I can look into the above issue separately if that sounds good.

That sounds good! I took a look at the convert_batches and I cannot see offhand how it is doing anything problematic. I suspect the issue is elsewhere and that function is just making it visible.

Why a field would change from not nullable to nullable for one (or more) batches in a query after the first one is highly suspicious to me.

I have a thought as to why it's happening. When you fill a column with null when that input is missing the column you are not preserving the other aspects of the column, specifically nullable. Thus, different batches (for the different inputs) can have differing nullable values.

Input 1: 'zz' column missing, is added and set to null (thus nullable = true)
Input 2: 'zz' column exists, is literal and nullable = false

In this case I think we would have to actually change the column nullability for the input that has the column (from non-nullable to nullable) or to wrap another projection that accomplishes the same.

@Omega359
Copy link
Contributor

@rkrishn7 - what do you think of my above comment for the cause of the issue? I personally think it should be fixed before this PR is approved, not after.

@alamb
Copy link
Contributor

alamb commented Mar 21, 2025

Once @Omega359 is happy with this PR I will be too -- just let me know

@rkrishn7
Copy link
Contributor Author

rkrishn7 commented Mar 24, 2025

@Omega359 Thanks for your thoughts!

And yes, I agree that the problem isn't with normalize::convert_batches. It is just being surfaced there.

I have a thought as to why it's happening. When you fill a column with null when that input is missing the column you are not preserving the other aspects of the column, specifically nullable. Thus, different batches (for the different inputs) can have differing nullable values.

Input 1: 'zz' column missing, is added and set to null (thus nullable = true)
Input 2: 'zz' column exists, is literal and nullable = false

In this case I think we would have to actually change the column nullability for the input that has the column (from non-nullable to nullable) or to wrap another projection that accomplishes the same.

Yup, agreed! If one input contains a non-nullable column and the other does not, then the final schema will have the field's nullability set to false when it really should be true.

I've updated the logic there to account for if the number of occurrences of the field across all inputs is less than the number of inputs (i.e. the field is missing from one or more inputs). In this case, it must be treated as nullable.

Unfortunately I don't think the problem is solved there. Upon further investigation, it seems like another problem exists with there being information loss between the logical and physical schemas. Specifically, when constructing the physical expression for a literal, the nullability is not determined by the already known schema. It is simply based on whether or not the literal is null.

This can be observed by updating the implementation of PhysicalExpr::is_nullable for Literal to return Ok(true). After that, the test suite passes successfully.

I think this should be fixed. Specifically, the nullability of Literal should be derived from its surrounding context. Rather than defaulting to self.value.is_null(). But, I also think it's fair to say this is another issue outside the scope of this PR.

\n TableScan: orders\
\n Projection: orders.order_id, Int64(1)\
\n TableScan: orders";
\n Projection: Int64(1), order_id\
Copy link
Contributor

@Omega359 Omega359 Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to figure out where this Projection (Projection: Int64(1), order_id) is coming from and why it's ordered this way. I would expect it to be ordered as order_id, Int64 as that is the order on the rhs.

As well, this test has an expected distinct clause ... the query doesn't. I can't see a reason why that would be introduced. UNION vs UNION ALL. Never mind, it's a Monday :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Projection is coming from the change in this PR that unconditionally wraps input nodes with a Projection. The ordering is like this because of the ordering of the schema columns we use to calculate the expression list for the wrapped projection. They are essentially in alphabetical order.

But thanks for catching this. I agree, the ordering seems incorrect here. I think the ordering should be driven by the ordering on the lhs (this seems to also be what duckdb does)

Copy link
Contributor Author

@rkrishn7 rkrishn7 Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually producing incorrect results as well in cases where one of the inputs is not wrapped. Now it's just the ordering that is the issue.

I'll be able to push up a fix here after work today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Omega359 This should be fixed now.

@Omega359
Copy link
Contributor

Omega359 commented Mar 24, 2025

Unfortunately I don't think the problem is solved there. Upon further investigation, it seems like another problem exists with there being information loss between the logical and physical schemas. Specifically, when constructing the physical expression for a literal, the nullability is not determined by the already known schema. It is simply based on whether or not the literal is null.

This can be observed by updating the implementation of PhysicalExpr::is_nullable for Literal to return Ok(true). After that, the test suite passes successfully.

I think this should be fixed. Specifically, the nullability of Literal should be derived from its surrounding context. Rather than defaulting to self.value.is_null(). But, I also think it's fair to say this is another issue outside the scope of this PR.

Confirmed. If you don't mind could you file a followup issue for this and update the comment in the union_by_name.slt to note that issue. I think this fix is complete otherwise.

@rkrishn7
Copy link
Contributor Author

Unfortunately I don't think the problem is solved there. Upon further investigation, it seems like another problem exists with there being information loss between the logical and physical schemas. Specifically, when constructing the physical expression for a literal, the nullability is not determined by the already known schema. It is simply based on whether or not the literal is null.
This can be observed by updating the implementation of PhysicalExpr::is_nullable for Literal to return Ok(true). After that, the test suite passes successfully.
I think this should be fixed. Specifically, the nullability of Literal should be derived from its surrounding context. Rather than defaulting to self.value.is_null(). But, I also think it's fair to say this is another issue outside the scope of this PR.

Confirmed. If you don't mind could you file a followup issue for this and update the comment in the union_by_name.slt to note that issue. I think this fix is complete otherwise.

@Omega359 Done! Issue here


query III
SELECT 1 UNION BY NAME SELECT * FROM unnest(range(2, 100)) UNION BY NAME SELECT 999 ORDER BY 3, 1 LIMIT 5;
SELECT 1 UNION BY NAME SELECT * FROM unnest(range(2, 100)) UNION BY NAME SELECT 999 ORDER BY 3, 1, 2 LIMIT 5;
Copy link
Contributor Author

@rkrishn7 rkrishn7 Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the unnest column was incorrectly at position 3. Now it is correctly at position 2 so the result here has changed.

The extra ordering is to order the unnest values for a deterministic result.

Also verified result is consistent with duckdb's.

NULL NULL 4
NULL NULL 5
NULL NULL 6
NULL NULL 999
Copy link
Contributor Author

@rkrishn7 rkrishn7 Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - result set is changing due to column reordering. Also verified result is consistent with duckdb's.

@Omega359
Copy link
Contributor

This looks good to me, thanks!

I've started poking at the Literal/nullable issue a bit. No solution yet though.

@rkrishn7
Copy link
Contributor Author

This looks good to me, thanks!

I've started poking at the Literal/nullable issue a bit. No solution yet though.

Thanks for the thorough review @Omega359. I feel a lot better about this now.

cc @alamb - I think we should be gtg here

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thank you @rkrishn7 and @Omega359

let schema_width = schema.iter().count();
let mut wrapped_inputs = Vec::with_capacity(inputs.len());
for input in inputs {
// If the input plan's schema contains the same number of fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does indeed seem to be an overzealous optimization

@rkrishn7
Copy link
Contributor Author

Thanks @alamb - is this ready to get merged?

@alamb alamb merged commit dc7e44d into apache:main Mar 27, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 27, 2025

Thanks again @rkrishn7 and @Omega359

qstommyshu pushed a commit to qstommyshu/datafusion that referenced this pull request Mar 28, 2025
…pache#15242)

* fix: Remove incorrect predicate to skip input wrapping when rewriting union inputs

* chore: Add/update tests

* fix: SQL integration tests

* test: Add union all by name SLT tests

* test: Add problematic union all by name SLT test

* chore: styling nits

* fix: Correct handling of nullability when field is not present in all inputs

* chore: Update fixme comment

* fix: handle ordering by order of inputs
nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
…pache#15242)

* fix: Remove incorrect predicate to skip input wrapping when rewriting union inputs

* chore: Add/update tests

* fix: SQL integration tests

* test: Add union all by name SLT tests

* test: Add problematic union all by name SLT test

* chore: styling nits

* fix: Correct handling of nullability when field is not present in all inputs

* chore: Update fixme comment

* fix: handle ordering by order of inputs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

union by name doesn't seem to be working correctly

3 participants