Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Dec 1, 2025

Which issue does this PR close?

Rationale for this change

tpchgen-cli is 10x faster than dbgen for generating tpch data (see blog here)

Thus let's use that to generate tpch data for our benchmarks, rather than ancient docker / tpchgen

While I was testing this locally I also found a bunch of unecessary code

Also @comphead pointed out on #19034 (review) that the bench.sh data tpch generated both csv and parquet files when it only really needs parquet.

What changes are included in this PR?

  1. Use tpchgen-cli to generate tpch data for our benchmarks
  2. Do not generate tbl anymore (tpchgen-cli can make csv and parquet files directly)
  3. Remove the "convert" code and the tpch binary shim
  4. Update the readme to explain how to use tpchgen-cli to generate data

Are these changes tested?

I tested them manually using

./benchmarks/bench.sh data tpch
./benchmarks/bench.sh run tpch

./benchmarks/bench.sh data tpch_mem
./benchmarks/bench.sh run tpch_mem

./benchmarks/bench.sh data tpch_csv
./benchmarks/bench.sh run tpch_csv

./benchmarks/bench.sh data tpch10
./benchmarks/bench.sh run tpch10

./benchmarks/bench.sh data tpch_mem10
./benchmarks/bench.sh run tpch_mem10

./benchmarks/bench.sh data tpch_csv10
./benchmarks/bench.sh run tpch_csv10

Are there any user-facing changes?

No, this is internal develpment code

@alamb alamb marked this pull request as draft December 1, 2025 20:27
@comphead
Copy link
Contributor

comphead commented Dec 1, 2025

@alamb it still can generate TPCH data only? so for TPCDS currently i would be relying on pregenerated data

@alamb
Copy link
Contributor Author

alamb commented Dec 1, 2025

@alamb it still can generate TPCH data only? so for TPCDS currently i would be relying on pregenerated data

Yes, it only supports tpch data for now

@clflushopt is working on the tpchds support, see this ticket

However, I am not sure how far away we are

cargo run --release --features "mimalloc" --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096
```

The benchmark program also supports CSV and Parquet input file formats and a utility is provided to convert from `tbl`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the data is generated directly using tpchgen-cli now, no need to convert data with another pass

@alamb alamb force-pushed the alamb/tpchgen_cli branch from 482901a to 03fd2f2 Compare December 1, 2025 20:50
echo " creating tbl files with tpch_dbgen..."
docker run -v "${TPCH_DIR}":/data -it --rm ghcr.io/scalytics/tpch-docker:main -vf -s "${SCALE_FACTOR}"
echo " creating tbl files with tpchgen-cli..."
tpchgen-cli --scale-factor "${SCALE_FACTOR}" --format tbl --output-dir "${TPCH_DIR}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the old docker command takes 10s of minutes on my laptop. Using tpchgen-cli takes 10s of seconds

use parquet::file::properties::WriterProperties;
use structopt::StructOpt;

/// Convert tpch .slt files to .parquet or .csv files
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed as the data is created directly in the format of interest (tbl, csv, or parquet) rather than converted from tbl


FORMAT=$2
debug_run $CARGO_COMMAND --bin tpch -- benchmark datafusion --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format ${FORMAT} -o "${RESULTS_FILE}" ${QUERY_ARG}
debug_run $CARGO_COMMAND --bin dfbench -- tpch --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format ${FORMAT} -o "${RESULTS_FILE}" ${QUERY_ARG}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this updates bench.sh to use dfbench directly rather than the alternate tpch binary

@comphead
Copy link
Contributor

comphead commented Dec 1, 2025

@clflushopt is working on the tpchds support, see this ticket

However, I am not sure how far away we are

Okie, I'm having a reference to pregenerated files in #18985 but once generator is done, we can implement our own data_tpcds()

@alamb alamb force-pushed the alamb/tpchgen_cli branch from 35ffdb4 to 907bce3 Compare December 1, 2025 21:10
;;
tpch)
data_tpch "1"
data_tpch "1" "parquet"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per @comphead 's suggestion, now only the format required is made, rather than always creating all three formats

@alamb alamb marked this pull request as ready for review December 1, 2025 21:15
@alamb
Copy link
Contributor Author

alamb commented Dec 1, 2025

run benchmark tpch10

@alamb
Copy link
Contributor Author

alamb commented Dec 1, 2025

(I am playing around with my new test script)

@alamb
Copy link
Contributor Author

alamb commented Dec 1, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/tpchgen_cli (907bce3) to da36ad8 diff using: tpch10
Results will be posted here when complete

1 similar comment
@alamb
Copy link
Contributor Author

alamb commented Dec 1, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/tpchgen_cli (907bce3) to da36ad8 diff using: tpch10
Results will be posted here when complete

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb

echo "Internal error: Scale factor not specified"
exit 1
fi
FORMAT=$2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
FORMAT=$2
FORMAT=${2:-parquet}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, that would default the argument to parquet -- what is the rationale for doing so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #19035 (comment)
There are two calls of data_tpch there which do not pass the format.

https://github.com/apache/datafusion/pull/19035/files/907bce3e16352148eade3b7cf512091a9aab4232#diff-1769f5787dc11c8b1f1b48288cdf3c89d25a5b5cbc6be4740bfcc70a6313ba99R550 will print Creating tpch <EMPTY> dataset at Scale Factor, where <EMPTY> is an empty string.

And the third reason why I proposed parquet as default is:

Also @comphead pointed out on https://github.com/apache/datafusion/pull/19034#pullrequestreview-3526952491 that the bench.sh data tpch generated both csv and parquet files when it only really needs parquet.

This sounds like parquet is the needed format most of the time.

But data_h2o() uses CSV as a default format:
https://github.com/alamb/datafusion/blob/907bce3e16352148eade3b7cf512091a9aab4232/benchmarks/bench.sh#L853

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you are saying now - there are several other calls to data_tpch that don't pass the format argument here

https://github.com/alamb/datafusion/blob/21a0237a1b96cf42bd96e93e6fc184a8e320138f/benchmarks/bench.sh#L298-L308

eg

                sort_tpch)
                    # same data as for tpch
                    data_tpch "1"
                    ;;
                sort_tpch10)
                    # same data as for tpch10
                    data_tpch "10"
                    ;;
                topk_tpch)
                    # same data as for tpch
                    data_tpch "1"
                    ;;
                nlj)

I will update those calls to explicitly pass in a format as I think that will make it clear what is going on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in 9a5a3b0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻
Just FYI: There is a shorter syntax for this check too: FORMAT=${1:?Internal error: Format not specified}

@Omega359
Copy link
Contributor

Omega359 commented Dec 4, 2025

I could have used this today ... I had to install docker just to generate the tpch data.

@alamb alamb added this pull request to the merge queue Dec 4, 2025
@alamb
Copy link
Contributor Author

alamb commented Dec 4, 2025

Thanks @martin-g @Omega359 and @comphead

Merged via the queue into apache:main with commit f22a3f3 Dec 4, 2025
27 checks passed
@alamb alamb deleted the alamb/tpchgen_cli branch December 4, 2025 15:33
@alamb alamb self-assigned this Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants