Skip to content

[FEA] Support "dataframe.query-planning" config in dask.dataframe #15027

@rjzamora

Description

@rjzamora

PSA

To unblock CI failures related to the dask-expr migration, down-stream RAPIDS libraries can set the following environment variable in CI (before dask.dataframe/dask_cudf is ever imported):

export DASK_DATAFRAME__QUERY_PLANNING=False

If you do this, please be sure to comment on the change, and link it to this meta issue. (So I can make the necessary changes/fixes, and turn query-planning back on)


Background

The 2024.2.0 release of Dask has deprecated the "legacy" dask.dataframe API. Given that dask-cudf (and much of RAPIDS) is tightly integrated with dask.dataframe, it is critical that dask_cudf be updated to use the new dask_expr backend smoothly.

Most of the heavy lifting is already being done in #14805. However, there will also be some follow-up work to expand coverage/examples/documentation/benchmarks. We will also need to update dask-cuda/explicit-comms.

Action Items

Basics (to be covered by #14805):

  • Add dask-expr DataFrameBackendEntrypoint entrypoint for "cudf"
  • Align top-level dask_cudf imports with dask.dataframe for "dataframe.query-planning" support

Expected Follow-up:

cuDF / Dask cuDF doc build:

cuML support:

cuxfilter support:

cugraph support:

Dask CUDA:

Dask SQL:

  • Migrate predicate pushdown to dask-expr

NeMo Curator:

  • Migrate custom-graph code and test against latest dask/cudf

Merlin:

  • Port Merlin/NVTabular (Heavy lift - Aiming for 24.08)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions