Use `tpchgen-cli` to generate tpch data in bench.sh #19035

alamb · 2025-12-01T20:27:43Z

Which issue does this PR close?

Rationale for this change

tpchgen-cli is 10x faster than dbgen for generating tpch data (see blog here)

Thus let's use that to generate tpch data for our benchmarks, rather than ancient docker / tpchgen

While I was testing this locally I also found a bunch of unecessary code

Also @comphead pointed out on #19034 (review) that the bench.sh data tpch generated both csv and parquet files when it only really needs parquet.

What changes are included in this PR?

Use tpchgen-cli to generate tpch data for our benchmarks
Do not generate tbl anymore (tpchgen-cli can make csv and parquet files directly)
Remove the "convert" code and the tpch binary shim
Update the readme to explain how to use tpchgen-cli to generate data

Are these changes tested?

I tested them manually using

./benchmarks/bench.sh data tpch
./benchmarks/bench.sh run tpch

./benchmarks/bench.sh data tpch_mem
./benchmarks/bench.sh run tpch_mem

./benchmarks/bench.sh data tpch_csv
./benchmarks/bench.sh run tpch_csv

./benchmarks/bench.sh data tpch10
./benchmarks/bench.sh run tpch10

./benchmarks/bench.sh data tpch_mem10
./benchmarks/bench.sh run tpch_mem10

./benchmarks/bench.sh data tpch_csv10
./benchmarks/bench.sh run tpch_csv10

Are there any user-facing changes?

No, this is internal develpment code

comphead · 2025-12-01T20:32:19Z

@alamb it still can generate TPCH data only? so for TPCDS currently i would be relying on pregenerated data

alamb · 2025-12-01T20:46:59Z

@alamb it still can generate TPCH data only? so for TPCDS currently i would be relying on pregenerated data

Yes, it only supports tpch data for now

@clflushopt is working on the tpchds support, see this ticket

[FEATURE] Extend to support TPC-DS data generation clflushopt/tpchgen-rs#51

However, I am not sure how far away we are

alamb · 2025-12-01T20:47:36Z

benchmarks/README.md

 cargo run --release --features "mimalloc" --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096
 ```

-The benchmark program also supports CSV and Parquet input file formats and a utility is provided to convert from `tbl`


the data is generated directly using tpchgen-cli now, no need to convert data with another pass

alamb · 2025-12-01T20:48:24Z

benchmarks/bench.sh

-        echo " creating tbl files with tpch_dbgen..."
-        docker run -v "${TPCH_DIR}":/data -it --rm ghcr.io/scalytics/tpch-docker:main -vf -s "${SCALE_FACTOR}"
+        echo " creating tbl files with tpchgen-cli..."
+        tpchgen-cli --scale-factor "${SCALE_FACTOR}" --format tbl --output-dir "${TPCH_DIR}"


Using the old docker command takes 10s of minutes on my laptop. Using tpchgen-cli takes 10s of seconds

alamb · 2025-12-01T20:49:05Z

benchmarks/src/tpch/convert.rs

-use parquet::file::properties::WriterProperties;
-use structopt::StructOpt;
-
-/// Convert tpch .slt files to .parquet or .csv files


This is no longer needed as the data is created directly in the format of interest (tbl, csv, or parquet) rather than converted from tbl

alamb · 2025-12-01T20:52:44Z

benchmarks/bench.sh


    FORMAT=$2
-    debug_run $CARGO_COMMAND --bin tpch -- benchmark datafusion --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format ${FORMAT} -o "${RESULTS_FILE}" ${QUERY_ARG}
+    debug_run $CARGO_COMMAND --bin dfbench -- tpch --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format ${FORMAT} -o "${RESULTS_FILE}" ${QUERY_ARG}


this updates bench.sh to use dfbench directly rather than the alternate tpch binary

comphead · 2025-12-01T20:56:58Z

@clflushopt is working on the tpchds support, see this ticket

[FEATURE] Extend to support TPC-DS data generation clflushopt/tpchgen-rs#51

However, I am not sure how far away we are

Okie, I'm having a reference to pregenerated files in #18985 but once generator is done, we can implement our own data_tpcds()

alamb · 2025-12-01T21:15:07Z

benchmarks/bench.sh

                    ;;
                tpch)
-                    data_tpch "1"
+                    data_tpch "1" "parquet"


per @comphead 's suggestion, now only the format required is made, rather than always creating all three formats

alamb · 2025-12-01T21:18:20Z

run benchmark tpch10

alamb · 2025-12-01T21:18:37Z

(I am playing around with my new test script)

alamb · 2025-12-01T21:20:29Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/tpchgen_cli (907bce3) to da36ad8 diff using: tpch10
Results will be posted here when complete

alamb · 2025-12-01T21:31:16Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/tpchgen_cli (907bce3) to da36ad8 diff using: tpch10
Results will be posted here when complete

comphead

Thanks @alamb

martin-g · 2025-12-02T06:50:47Z

benchmarks/bench.sh

        echo "Internal error: Scale factor not specified"
        exit 1
    fi
+    FORMAT=$2


Suggested change

FORMAT=$2

FORMAT=${2:-parquet}

As I understand, that would default the argument to parquet -- what is the rationale for doing so?

See #19035 (comment)
There are two calls of data_tpch there which do not pass the format.

https://github.com/apache/datafusion/pull/19035/files/907bce3e16352148eade3b7cf512091a9aab4232#diff-1769f5787dc11c8b1f1b48288cdf3c89d25a5b5cbc6be4740bfcc70a6313ba99R550 will print Creating tpch <EMPTY> dataset at Scale Factor, where <EMPTY> is an empty string.

And the third reason why I proposed parquet as default is:

Also @comphead pointed out on https://github.com/apache/datafusion/pull/19034#pullrequestreview-3526952491 that the bench.sh data tpch generated both csv and parquet files when it only really needs parquet.

This sounds like parquet is the needed format most of the time.

But data_h2o() uses CSV as a default format:
https://github.com/alamb/datafusion/blob/907bce3e16352148eade3b7cf512091a9aab4232/benchmarks/bench.sh#L853

Ah, I see what you are saying now - there are several other calls to data_tpch that don't pass the format argument here

https://github.com/alamb/datafusion/blob/21a0237a1b96cf42bd96e93e6fc184a8e320138f/benchmarks/bench.sh#L298-L308

eg

sort_tpch) # same data as for tpch data_tpch "1" ;; sort_tpch10) # same data as for tpch10 data_tpch "10" ;; topk_tpch) # same data as for tpch data_tpch "1" ;; nlj)

I will update those calls to explicitly pass in a format as I think that will make it clear what is going on

in 9a5a3b0

👍🏻
Just FYI: There is a shorter syntax for this check too: FORMAT=${1:?Internal error: Format not specified}

benchmarks/bench.sh

benchmarks/README.md

Co-authored-by: Martin Grigorov <[email protected]>

Omega359 · 2025-12-04T03:34:22Z

I could have used this today ... I had to install docker just to generate the tpch data.

alamb · 2025-12-04T15:15:54Z

Thanks @martin-g @Omega359 and @comphead

alamb marked this pull request as draft December 1, 2025 20:27

alamb commented Dec 1, 2025

View reviewed changes

alamb force-pushed the alamb/tpchgen_cli branch from 482901a to 03fd2f2 Compare December 1, 2025 20:50

alamb commented Dec 1, 2025

View reviewed changes

alamb mentioned this pull request Dec 1, 2025

Fix data for tpch_csv and tpch_csv10 #19034

Merged

alamb added 2 commits December 1, 2025 16:09

Use tpchgen-cli to create benchmark data

e0a9157

Remove uneeded benchmarking code

907bce3

alamb force-pushed the alamb/tpchgen_cli branch from 35ffdb4 to 907bce3 Compare December 1, 2025 21:10

alamb commented Dec 1, 2025

View reviewed changes

alamb marked this pull request as ready for review December 1, 2025 21:15

comphead approved these changes Dec 1, 2025

View reviewed changes

martin-g reviewed Dec 2, 2025

View reviewed changes

alamb and others added 3 commits December 2, 2025 14:55

Apply suggestions from code review

21a0237

Co-authored-by: Martin Grigorov <[email protected]>

Merge remote-tracking branch 'apache/main' into alamb/tpchgen_cli

1790863

Specify tpch data for all formats, add error checking

9a5a3b0

alamb added this pull request to the merge queue Dec 4, 2025

Merged via the queue into apache:main with commit f22a3f3 Dec 4, 2025
27 checks passed

alamb deleted the alamb/tpchgen_cli branch December 4, 2025 15:33

alamb self-assigned this Dec 4, 2025

Use tpchgen-cli to generate tpch data in bench.sh #19035

Use tpchgen-cli to generate tpch data in bench.sh #19035

Conversation

alamb commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

comphead commented Dec 1, 2025

Uh oh!

alamb commented Dec 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Dec 1, 2025

Uh oh!

alamb commented Dec 1, 2025

Uh oh!

alamb commented Dec 1, 2025

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Omega359 commented Dec 4, 2025

Uh oh!

alamb commented Dec 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Use `tpchgen-cli` to generate tpch data in bench.sh #19035

Use `tpchgen-cli` to generate tpch data in bench.sh #19035

alamb commented Dec 1, 2025 •

edited

Loading

comphead commented Dec 1, 2025 •

edited

Loading

alamb commented Dec 1, 2025 •

edited

Loading