-
Notifications
You must be signed in to change notification settings - Fork 1k
[FEA] Support "dataframe.query-planning" config in dask.dataframe #15027
Description
PSA
To unblock CI failures related to the dask-expr migration, down-stream RAPIDS libraries can set the following environment variable in CI (before dask.dataframe/dask_cudf is ever imported):
export DASK_DATAFRAME__QUERY_PLANNING=False
If you do this, please be sure to comment on the change, and link it to this meta issue. (So I can make the necessary changes/fixes, and turn query-planning back on)
Background
The 2024.2.0 release of Dask has deprecated the "legacy" dask.dataframe API. Given that dask-cudf (and much of RAPIDS) is tightly integrated with dask.dataframe, it is critical that dask_cudf be updated to use the new dask_expr backend smoothly.
Most of the heavy lifting is already being done in #14805. However, there will also be some follow-up work to expand coverage/examples/documentation/benchmarks. We will also need to update dask-cuda/explicit-comms.
Action Items
Basics (to be covered by #14805):
- Add dask-expr
DataFrameBackendEntrypointentrypoint for "cudf" - Align top-level
dask_cudfimports withdask.dataframefor"dataframe.query-planning"support
Expected Follow-up:
- Add
read_jsonsupport (Enabledask_cudfjson and s3 tests with query-planning on #15408) - Add
read_orcsupport (Support orc and text IO with dask-expr using legacy conversion #15439) -
read_parquetshould always return DataFrame (not currently the case in dask-expr ifcolumns=<str>) - Remove outdated
check_file_sizefunctionality fromdask_cudf.read_parquet - Add s3 testing/support (Enable
dask_cudfjson and s3 tests with query-planning on #15408) - Add
read_textsupport (Support orc and text IO with dask-expr using legacy conversion #15439) - Fix unexplained test failures for categorical accessors (Fix categorical-accessor support and testing in dask-cudf #15591)
- Deprecate
to_dask_dataframeAPI in favor ofto_backend(Deprecateto/from_dask_dataframeAPIs in dask-cudf #15592) - Deprecate
set_index(..., divisions="quantile")(Deprecatedivisions='quantile'support inset_index#15804) - Add
describesupport (seems to be working now? Just need to removexfailmarkers) - Add
groupby"collect" support (Add "collect" aggregation support to dask-cudf #15593) - (Maybe?) add
as_indexsupport togroupby - Fix
get_dummysupport (Generalizeget_dummiesdask/dask-expr#1053) - Fix sorting by categorical columns (Fix maxima of categorical column #15701; Related Issues: [BUG] Cannot find maxima of a categorical series #15641 & Sorting by a categorical column doesn't always work dask/dask#11090 & Add temporary dask-cudf workaround for categorical sorting #15801)
- Fix sorting with nulls (Enable sorting on column with nulls using query-planning #15639)
-
leftantimerge support (Likely an error message in 24.06 and support in 24.08+) -
to_datetimesupport (Add cudf support toto_datetimeand_maybe_from_pandasdask/dask-expr#1035) - Add
meltsupport (Add support forDataFrame.meltdask/dask-expr#1049 & Addmeltsupport when query-planning is enabled dask/dask#11088)
cuDF / Dask cuDF doc build:
- Revert Disable dask-expr in docs builds. #15343 (see: Patch dask-expr
varlogic in dask-cudf #15347)
cuML support:
cuxfilter support:
cugraph support:
Dask CUDA:
- Explicit comms support (Support "dataframe.query-planning" config in
dask.dataframedask-cuda#1311)
Dask SQL:
- Migrate predicate pushdown to dask-expr
NeMo Curator:
- Migrate custom-graph code and test against latest dask/cudf
Merlin:
- Port Merlin/NVTabular (Heavy lift - Aiming for 24.08)