-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add serialization/deserialization and round-trip tests for all tpc-h queries #16742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // Only 4 queries pass: q3, q5, q10, q12 | ||
| // The rest fails with 2 different reasons: | ||
| // - q16 fails at deserialization step: https://github.com/apache/datafusion/issues/16665 | ||
| // - Other queries fails due to mismatch between the serialized and deserialized physical plans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb : Let me know if you think a different bug ticket for this is necessary. I am happy to file it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I think @XiangpengHao says that #16744 does not fix them all, I do think we should file a separate bug to track the other failures. Who knows -- maybe even other people will fix them like @XiangpengHao did for #16665 ❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #16772
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm 👍
| use datafusion_common::test_util::datafusion_test_data; | ||
|
|
||
| let ctx = SessionContext::new(); | ||
| let test_data = datafusion_test_data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is also parquet_test_data() which stores data on datafusion/parquet-testing
| // Create external tables for all TPC-H tables | ||
| for table in &tables { | ||
| let table_sql = format!( | ||
| "CREATE EXTERNAL TABLE {table} STORED AS PARQUET LOCATION '{test_data}/tpch_{table}_small.parquet'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also use register_parquet
|
Just out of curiosity, do you know if the issue is that an specific node can't serialize? |
Some info and fix: |
|
I checked #16744 with this test, and confirm that most tests still fail. A closer look at this show that it's mostly due to the field "human_display", the deserialized one seems to strip out the human_display. So it's probably a different bug from #16665 I plan to fix the human_display thing over the weekend unless someone beats me to it. Here's a diff: https://www.diffchecker.com/Jy2neEt0/ |
|
A little bit more investigation show that some of the non-determinism is introduced by hashset, so we probably also want to change how we compare plans. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NGA-TRAN -- I think this could be merged as is.
If you have time it would be really nice to consolidate the tests and file another ticket as suggested in the text
Thank you @LiaCastaneda for the review
| // Only 4 queries pass: q3, q5, q10, q12 | ||
| // The rest fails with 2 different reasons: | ||
| // - q16 fails at deserialization step: https://github.com/apache/datafusion/issues/16665 | ||
| // - Other queries fails due to mismatch between the serialized and deserialized physical plans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I think @XiangpengHao says that #16744 does not fix them all, I do think we should file a separate bug to track the other failures. Who knows -- maybe even other people will fix them like @XiangpengHao did for #16665 ❤️
|
@alamb and @XiangpengHao : I have created a new ticket for the display & hashset bug; and also merged q16 test into the new test for all tpc-h queries |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @NGA-TRAN and @LiaCastaneda
Which issue does this PR close?
These tests are based on @alamb’s suggestion at #16662 (review):
Rationale for this change
This PR introduces two separate tests:
Each TPC-H Parquet file contains 20 rows, designed to simulate the actual schema.
What changes are included in this PR?
New tests
Are these changes tested?
They are tests. No new functionality
Are there any user-facing changes?
No