Skip to content

Conversation

@NGA-TRAN
Copy link
Contributor

@NGA-TRAN NGA-TRAN commented Jul 10, 2025

Which issue does this PR close?

These tests are based on @alamb’s suggestion at #16662 (review):

I also wonder if we should add examples / tpch serialization for all the remaining tpch queries in datafuison-proto

Rationale for this change

This PR introduces two separate tests:

  1. test_serialize_deserialize_tpch_queries Validates serialization and deserialization for all 21 TPC-H queries. Note: Q16 is excluded due to a known bug TPC-H Q16 fails during deserialization #16665
  2. test_round_trip_tpch_queries Compares the original and deserialized versions of each query. Currently ignored: Only 4 out of 22 queries pass at the moment.

Each TPC-H Parquet file contains 20 rows, designed to simulate the actual schema.

What changes are included in this PR?

New tests

Are these changes tested?

They are tests. No new functionality

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate proto Related to proto crate labels Jul 10, 2025
// Only 4 queries pass: q3, q5, q10, q12
// The rest fails with 2 different reasons:
// - q16 fails at deserialization step: https://github.com/apache/datafusion/issues/16665
// - Other queries fails due to mismatch between the serialized and deserialized physical plans
Copy link
Contributor Author

@NGA-TRAN NGA-TRAN Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb : Let me know if you think a different bug ticket for this is necessary. I am happy to file it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I think @XiangpengHao says that #16744 does not fix them all, I do think we should file a separate bug to track the other failures. Who knows -- maybe even other people will fix them like @XiangpengHao did for #16665 ❤️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #16772

Copy link
Contributor

@LiaCastaneda LiaCastaneda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

use datafusion_common::test_util::datafusion_test_data;

let ctx = SessionContext::new();
let test_data = datafusion_test_data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is also parquet_test_data() which stores data on datafusion/parquet-testing

// Create external tables for all TPC-H tables
for table in &tables {
let table_sql = format!(
"CREATE EXTERNAL TABLE {table} STORED AS PARQUET LOCATION '{test_data}/tpch_{table}_small.parquet'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also use register_parquet

@LiaCastaneda
Copy link
Contributor

Just out of curiosity, do you know if the issue is that an specific node can't serialize?

@NGA-TRAN
Copy link
Contributor Author

Just out of curiosity, do you know if the issue is that an specific node can't serialize?

Some info and fix:

@XiangpengHao
Copy link
Contributor

XiangpengHao commented Jul 11, 2025

I checked #16744 with this test, and confirm that most tests still fail.

A closer look at this show that it's mostly due to the field "human_display", the deserialized one seems to strip out the human_display. So it's probably a different bug from #16665

I plan to fix the human_display thing over the weekend unless someone beats me to it.

Here's a diff: https://www.diffchecker.com/Jy2neEt0/

@XiangpengHao
Copy link
Contributor

A little bit more investigation show that some of the non-determinism is introduced by hashset, so we probably also want to change how we compare plans.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NGA-TRAN -- I think this could be merged as is.

If you have time it would be really nice to consolidate the tests and file another ticket as suggested in the text

Thank you @LiaCastaneda for the review

// Only 4 queries pass: q3, q5, q10, q12
// The rest fails with 2 different reasons:
// - q16 fails at deserialization step: https://github.com/apache/datafusion/issues/16665
// - Other queries fails due to mismatch between the serialized and deserialized physical plans
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I think @XiangpengHao says that #16744 does not fix them all, I do think we should file a separate bug to track the other failures. Who knows -- maybe even other people will fix them like @XiangpengHao did for #16665 ❤️

@NGA-TRAN
Copy link
Contributor Author

NGA-TRAN commented Jul 14, 2025

@alamb and @XiangpengHao : I have created a new ticket for the display & hashset bug; and also merged q16 test into the new test for all tpc-h queries

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @NGA-TRAN and @LiaCastaneda

@alamb alamb merged commit a45a4c4 into apache:main Jul 14, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants