Move chars column to parent data buffer in strings column#14202
Move chars column to parent data buffer in strings column#14202rapids-bot[bot] merged 73 commits intorapidsai:branch-24.02from
Conversation
|
@galipremsagar failing pytests Update: these issues are fixed. |
wence-
left a comment
There was a problem hiding this comment.
Approving python changes with (non-blocking) suggestion to introduce a single type definition for the char type.
| build_column( | ||
| data=as_buffer( | ||
| rmm.DeviceBuffer( | ||
| size=row_count * cudf.dtype("int8").itemsize |
There was a problem hiding this comment.
In general this condition can also be true for (at least) List and Struct columns. But, those are handled by specific cases above.
From a quick test, I think we don't need an empty chars buffer, so can you try using rmm.DeviceBuffer(0)?
| self.offset + self.size | ||
| ) | ||
| char_dtype_size = self.base_children[1].dtype.itemsize | ||
| char_dtype_size = cudf.api.types.dtype("int8").itemsize |
There was a problem hiding this comment.
Ah, ok. Can we introduce (like cudf._lib.types.size_type_dtype) a single source of truth for the type of the string char buffer, perhaps cudf._lib.types.char_type_dtype?
|
We'll probably want to update the developer's guide once this is merged as well |
Co-authored-by: David Wendt <[email protected]>
mroeschke
left a comment
There was a problem hiding this comment.
Optional comment otherwise LGTM
| self.offset + self.size | ||
| ) | ||
| char_dtype_size = self.base_children[1].dtype.itemsize | ||
| char_dtype_size = cudf.api.types.dtype("int8").itemsize |
There was a problem hiding this comment.
I think it would be worth adding a # TODO comment noting that int8 is a workaround
|
/merge |
Fixes deprecation warnings introduced when #14202 merged. Most of these are for calls to `cudf::make_strings_column` which deprecated the chars-column function overload. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #14771
Removes the functions deprecated in 24.02 in #14202. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Vyas Ramasubramani (https://github.com/vyasr) URL: #14848
This PR contains a number of different fixes currently required to get cugraph tests passing:
- There are two main changes for pandas 2 compatibility:
- [pandas renamed `DataFrame.applymap` to `DataFrame.map`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html) so creating the renumbering map with a column `map` caused problems for attribute-based column access `renumber_map.map`. Those columns are now renamed to `renumber_map`.
- Empty columns now default to str rather than float, so tests that assumed we could access the values as cupy arrays failed because cudf's string columns cannot be converted to cupy arrays. These columns are now always cast to float in the tests before the cupy conversion.
- cugraph-dgl and cugraph-pyg's wheel builds were not downloading the latest cugraph/pylibcugraph wheels to run tests. As a result, the above pandas 2 fixes didn't take when running the dgl and pyg tests. I updated the wheel building scripts to account for this discrepancy.
- rapidsai/cudf#14202 made a breaking change to how characters are encoded in strings columns in cudf, which broke cugraph_etl. This PR fixes the code that depended on the old APIs.
This code also includes a small patch to the cugraph_etl CMake so that it exports the correct package name (previously it was using cugraph).
Authors:
- Vyas Ramasubramani (https://github.com/vyasr)
Approvers:
- GALI PREM SAGAR (https://github.com/galipremsagar)
- Bradley Dice (https://github.com/bdice)
- Chuck Hastings (https://github.com/ChuckHastings)
- Rick Ratzel (https://github.com/rlratzel)
- Jake Awe (https://github.com/AyodeAwe)
URL: #4144
This PR contains a number of different fixes currently required to get cugraph tests passing:
- There are two main changes for pandas 2 compatibility:
- [pandas renamed `DataFrame.applymap` to `DataFrame.map`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html) so creating the renumbering map with a column `map` caused problems for attribute-based column access `renumber_map.map`. Those columns are now renamed to `renumber_map`.
- Empty columns now default to str rather than float, so tests that assumed we could access the values as cupy arrays failed because cudf's string columns cannot be converted to cupy arrays. These columns are now always cast to float in the tests before the cupy conversion.
- cugraph-dgl and cugraph-pyg's wheel builds were not downloading the latest cugraph/pylibcugraph wheels to run tests. As a result, the above pandas 2 fixes didn't take when running the dgl and pyg tests. I updated the wheel building scripts to account for this discrepancy.
- rapidsai/cudf#14202 made a breaking change to how characters are encoded in strings columns in cudf, which broke cugraph_etl. This PR fixes the code that depended on the old APIs.
This code also includes a small patch to the cugraph_etl CMake so that it exports the correct package name (previously it was using cugraph).
Authors:
- Vyas Ramasubramani (https://github.com/vyasr)
Approvers:
- GALI PREM SAGAR (https://github.com/galipremsagar)
- Bradley Dice (https://github.com/bdice)
- Chuck Hastings (https://github.com/ChuckHastings)
- Rick Ratzel (https://github.com/rlratzel)
- Jake Awe (https://github.com/AyodeAwe)
URL: rapidsai/cugraph#4144
Description
Eliminates chars column and moves chars data to parent string column's _data buffer.
Summary of changes
chars_size(),chars_end()instrings_column_viewand their invocationschars_column_index, and deprecatechars()fromstrings_column_viewchars_col.begin<char>()withstatic_cast<char*>(parent.head())rmm::device_bufferinstead of chars columnstreamparameter to chars_size)Preparing for #13733