-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: Unconditionally wrap UNION BY NAME input nodes w/ Projection
#15242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Was trying to add all the queries from the issue, but ran into problems with |
Projection
Which of the queries had issues? |
|
I had issues with github this morning so here is a patch to get tests running: |
|
Thanks @Omega359! @alamb Yes sorry for the delay, I can fix this up later today. To expand on my comment here, this was regarding Though, this was just occurring when collecting the results in the SLT suite. The actual result was correct (verified in the CLI). But I'll fix up the integration tests after work so all the checks pass and this is mergeable. Then maybe I can look into the above issue separately if that sounds good. |
ProjectionProjection
| Field { | ||
| name: "zz", | ||
| data_type: Utf8, | ||
| nullable: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that the nullable changed from false to true here.
That sounds good! I took a look at the convert_batches and I cannot see offhand how it is doing anything problematic. I suspect the issue is elsewhere and that function is just making it visible.
I have a thought as to why it's happening. When you fill a column with null when that input is missing the column you are not preserving the other aspects of the column, specifically nullable. Thus, different batches (for the different inputs) can have differing nullable values. Input 1: 'zz' column missing, is added and set to null (thus nullable = true) In this case I think we would have to actually change the column nullability for the input that has the column (from non-nullable to nullable) or to wrap another projection that accomplishes the same. |
|
@rkrishn7 - what do you think of my above comment for the cause of the issue? I personally think it should be fixed before this PR is approved, not after. |
|
Once @Omega359 is happy with this PR I will be too -- just let me know |
|
@Omega359 Thanks for your thoughts! And yes, I agree that the problem isn't with
Yup, agreed! If one input contains a non-nullable column and the other does not, then the final schema will have the field's nullability set to false when it really should be true. I've updated the logic there to account for if the number of occurrences of the field across all inputs is less than the number of inputs (i.e. the field is missing from one or more inputs). In this case, it must be treated as nullable. Unfortunately I don't think the problem is solved there. Upon further investigation, it seems like another problem exists with there being information loss between the logical and physical schemas. Specifically, when constructing the physical expression for a literal, the nullability is not determined by the already known schema. It is simply based on whether or not the literal is null. This can be observed by updating the implementation of I think this should be fixed. Specifically, the nullability of |
| \n TableScan: orders\ | ||
| \n Projection: orders.order_id, Int64(1)\ | ||
| \n TableScan: orders"; | ||
| \n Projection: Int64(1), order_id\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to figure out where this Projection (Projection: Int64(1), order_id) is coming from and why it's ordered this way. I would expect it to be ordered as order_id, Int64 as that is the order on the rhs.
As well, this test has an expected distinct clause ... the query doesn't. I can't see a reason why that would be introduced. UNION vs UNION ALL. Never mind, it's a Monday :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Projection is coming from the change in this PR that unconditionally wraps input nodes with a Projection. The ordering is like this because of the ordering of the schema columns we use to calculate the expression list for the wrapped projection. They are essentially in alphabetical order.
But thanks for catching this. I agree, the ordering seems incorrect here. I think the ordering should be driven by the ordering on the lhs (this seems to also be what duckdb does)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually producing incorrect results as well in cases where one of the inputs is not wrapped. Now it's just the ordering that is the issue.
I'll be able to push up a fix here after work today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Omega359 This should be fixed now.
Confirmed. If you don't mind could you file a followup issue for this and update the comment in the union_by_name.slt to note that issue. I think this fix is complete otherwise. |
|
|
|
||
| query III | ||
| SELECT 1 UNION BY NAME SELECT * FROM unnest(range(2, 100)) UNION BY NAME SELECT 999 ORDER BY 3, 1 LIMIT 5; | ||
| SELECT 1 UNION BY NAME SELECT * FROM unnest(range(2, 100)) UNION BY NAME SELECT 999 ORDER BY 3, 1, 2 LIMIT 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously the unnest column was incorrectly at position 3. Now it is correctly at position 2 so the result here has changed.
The extra ordering is to order the unnest values for a deterministic result.
Also verified result is consistent with duckdb's.
| NULL NULL 4 | ||
| NULL NULL 5 | ||
| NULL NULL 6 | ||
| NULL NULL 999 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - result set is changing due to column reordering. Also verified result is consistent with duckdb's.
|
This looks good to me, thanks! I've started poking at the Literal/nullable issue a bit. No solution yet though. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let schema_width = schema.iter().count(); | ||
| let mut wrapped_inputs = Vec::with_capacity(inputs.len()); | ||
| for input in inputs { | ||
| // If the input plan's schema contains the same number of fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does indeed seem to be an overzealous optimization
|
Thanks @alamb - is this ready to get merged? |
…pache#15242) * fix: Remove incorrect predicate to skip input wrapping when rewriting union inputs * chore: Add/update tests * fix: SQL integration tests * test: Add union all by name SLT tests * test: Add problematic union all by name SLT test * chore: styling nits * fix: Correct handling of nullability when field is not present in all inputs * chore: Update fixme comment * fix: handle ordering by order of inputs
…pache#15242) * fix: Remove incorrect predicate to skip input wrapping when rewriting union inputs * chore: Add/update tests * fix: SQL integration tests * test: Add union all by name SLT tests * test: Add problematic union all by name SLT test * chore: styling nits * fix: Correct handling of nullability when field is not present in all inputs * chore: Update fixme comment * fix: handle ordering by order of inputs
Which issue does this PR close?
Rationale for this change
An assumption made by a predicate while re-writing union inputs is incorrect.
Even if an input node's schema has the same width as the final schema, we still want to wrap the input with a projection as ordering is this doesn't guarantee ordering.
Are these changes tested?
Yes
Are there any user-facing changes?
N/A