Skip to content

TPC-H Q16 fails during deserialization #16665

@NGA-TRAN

Description

@NGA-TRAN

Describe the bug

Datadog is working on building a distributed version of DataFusion, which requires query serialization and deserialization. While testing with TPC-H queries, we found that Q16 fails during deserialization—potentially due to an issue in the serialization step. I've minimized the query to a smaller form that still reproduces the problem.

SELECT p_size FROM part WHERE p_size IN (14, 6, 5, 31)

To Reproduce

See this PR for the reproducer

Expected behavior

The deserialization of the query should work

Additional context

  • You can only reproduce this on actual TPC-H data. See the comments in the repro for the details
  • You won't hit the bug if the number of items in the list is 3 or fewer. E.g. `(14, 6, 5)
  • The same bug still happens if you replace SELECT p_size with SELECT p_brand but the mismatch data type in the erorr message is now different. It looks to me the data type of the list (14, 6, 5, 31) was wrongly read from some schema during serialization/deserialization and that schema depends on the query and the parquet file

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions