-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Use tpchgen-cli to generate tpch data in bench.sh
#19035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@alamb it still can generate TPCH data only? so for TPCDS currently i would be relying on pregenerated data |
Yes, it only supports tpch data for now @clflushopt is working on the tpchds support, see this ticket However, I am not sure how far away we are |
| cargo run --release --features "mimalloc" --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096 | ||
| ``` | ||
|
|
||
| The benchmark program also supports CSV and Parquet input file formats and a utility is provided to convert from `tbl` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the data is generated directly using tpchgen-cli now, no need to convert data with another pass
482901a to
03fd2f2
Compare
benchmarks/bench.sh
Outdated
| echo " creating tbl files with tpch_dbgen..." | ||
| docker run -v "${TPCH_DIR}":/data -it --rm ghcr.io/scalytics/tpch-docker:main -vf -s "${SCALE_FACTOR}" | ||
| echo " creating tbl files with tpchgen-cli..." | ||
| tpchgen-cli --scale-factor "${SCALE_FACTOR}" --format tbl --output-dir "${TPCH_DIR}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the old docker command takes 10s of minutes on my laptop. Using tpchgen-cli takes 10s of seconds
| use parquet::file::properties::WriterProperties; | ||
| use structopt::StructOpt; | ||
|
|
||
| /// Convert tpch .slt files to .parquet or .csv files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer needed as the data is created directly in the format of interest (tbl, csv, or parquet) rather than converted from tbl
|
|
||
| FORMAT=$2 | ||
| debug_run $CARGO_COMMAND --bin tpch -- benchmark datafusion --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format ${FORMAT} -o "${RESULTS_FILE}" ${QUERY_ARG} | ||
| debug_run $CARGO_COMMAND --bin dfbench -- tpch --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format ${FORMAT} -o "${RESULTS_FILE}" ${QUERY_ARG} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this updates bench.sh to use dfbench directly rather than the alternate tpch binary
Okie, I'm having a reference to pregenerated files in #18985 but once generator is done, we can implement our own |
35ffdb4 to
907bce3
Compare
| ;; | ||
| tpch) | ||
| data_tpch "1" | ||
| data_tpch "1" "parquet" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per @comphead 's suggestion, now only the format required is made, rather than always creating all three formats
|
run benchmark tpch10 |
|
(I am playing around with my new test script) |
|
🤖 |
1 similar comment
|
🤖 |
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb
| echo "Internal error: Scale factor not specified" | ||
| exit 1 | ||
| fi | ||
| FORMAT=$2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| FORMAT=$2 | |
| FORMAT=${2:-parquet} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, that would default the argument to parquet -- what is the rationale for doing so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #19035 (comment)
There are two calls of data_tpch there which do not pass the format.
https://github.com/apache/datafusion/pull/19035/files/907bce3e16352148eade3b7cf512091a9aab4232#diff-1769f5787dc11c8b1f1b48288cdf3c89d25a5b5cbc6be4740bfcc70a6313ba99R550 will print Creating tpch <EMPTY> dataset at Scale Factor, where <EMPTY> is an empty string.
And the third reason why I proposed parquet as default is:
Also @comphead pointed out on https://github.com/apache/datafusion/pull/19034#pullrequestreview-3526952491 that the bench.sh data tpch generated both csv and parquet files when it only really needs parquet.
This sounds like parquet is the needed format most of the time.
But data_h2o() uses CSV as a default format:
https://github.com/alamb/datafusion/blob/907bce3e16352148eade3b7cf512091a9aab4232/benchmarks/bench.sh#L853
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what you are saying now - there are several other calls to data_tpch that don't pass the format argument here
eg
sort_tpch)
# same data as for tpch
data_tpch "1"
;;
sort_tpch10)
# same data as for tpch10
data_tpch "10"
;;
topk_tpch)
# same data as for tpch
data_tpch "1"
;;
nlj)I will update those calls to explicitly pass in a format as I think that will make it clear what is going on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in 9a5a3b0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
Just FYI: There is a shorter syntax for this check too: FORMAT=${1:?Internal error: Format not specified}
|
I could have used this today ... I had to install docker just to generate the tpch data. |
Which issue does this PR close?
Rationale for this change
tpchgen-cli is 10x faster than dbgen for generating tpch data (see blog here)
Thus let's use that to generate tpch data for our benchmarks, rather than ancient docker / tpchgen
While I was testing this locally I also found a bunch of unecessary code
Also @comphead pointed out on #19034 (review) that the
bench.sh data tpchgenerated both csv and parquet files when it only really needs parquet.What changes are included in this PR?
tblanymore (tpchgen-clican makecsvandparquetfiles directly)tpchbinary shimAre these changes tested?
I tested them manually using
Are there any user-facing changes?
No, this is internal develpment code