Skip to content

Move chars column to parent data buffer in strings column#14202

Merged
rapids-bot[bot] merged 73 commits intorapidsai:branch-24.02from
karthikeyann:fea-char_limit_experiment
Jan 17, 2024
Merged

Move chars column to parent data buffer in strings column#14202
rapids-bot[bot] merged 73 commits intorapidsai:branch-24.02from
karthikeyann:fea-char_limit_experiment

Conversation

@karthikeyann
Copy link
Copy Markdown
Contributor

@karthikeyann karthikeyann commented Sep 26, 2023

Description

Eliminates chars column and moves chars data to parent string column's _data buffer.

Summary of changes

  • chars child column is removed, chars buffer is added to parent column
  • Adds stream to chars_size(), chars_end() in strings_column_view and their invocations
  • Remove chars_column_index, and deprecate chars() from strings_column_view
  • Replace chars_col.begin<char>() with static_cast<char*>(parent.head())
  • Adds string column factory which accepts rmm::device_buffer instead of chars column
  • Deprecate string column factory which accepts chars column
  • IO changes - contiguous split (From @nvdbaranec ), to_arrow, parquet writer.
  • Fix binary ops, column_view, interleave columns, byte cast, strings APIs, text APIs
  • Fix tests, benchmarks (mostly adding stream parameter to chars_size)
  • Java fixes (From @andygrove)
  • Python changes
    • .data special case for string column
    • get size from offsets column for rmm.DeviceBuffer in column
    • special condition for string slice
    • Pickle file update for string column
    • a few unit tests updates

Preparing for #13733

@karthikeyann karthikeyann added feature request New feature or request 2 - In Progress Currently a work in progress DO NOT MERGE Hold off on merging; see PR for details breaking Breaking change labels Sep 26, 2023
@karthikeyann karthikeyann self-assigned this Sep 26, 2023
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Java Affects Java cuDF API. labels Sep 26, 2023
@github-actions github-actions bot added the Python Affects Python cuDF API. label Sep 29, 2023
@karthikeyann karthikeyann changed the base branch from branch-23.10 to branch-23.12 October 3, 2023 09:07
@karthikeyann
Copy link
Copy Markdown
Contributor Author

karthikeyann commented Nov 2, 2023

@galipremsagar failing pytests
FAILED test_pickling.py::test_pickle_string_column[slices1] - AssertionError: Series are different
FAILED test_pickling.py::test_pickle_string_column[slices3] - AssertionError: Series are different
FAILED test_serialize.py::test_deserialize_cudf_0_16 - RuntimeError: CUDF failure at: /home/knataraj/dev/rapids/cudf/cpp/src/column/column_view.cpp:57: Null data pointer.
FAILED test_testing.py::test_assert_column_memory_basic_same[arrow_arrays1] - ValueError: Buffer size must be divisible by element size
FAILED test_orc.py::test_writer_timestamp_stream_size - AssertionError: numpy array are different
FAILED test_orc.py::test_orc_reader_negative_timestamp[pyarrow] - AssertionError: numpy array are different
FAILED test_orc.py::test_orc_writer_negative_timestamp - AssertionError: numpy array are different

Update: these issues are fixed.

@karthikeyann karthikeyann requested a review from wence- January 9, 2024 22:46
Copy link
Copy Markdown
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving python changes with (non-blocking) suggestion to introduce a single type definition for the char type.

build_column(
data=as_buffer(
rmm.DeviceBuffer(
size=row_count * cudf.dtype("int8").itemsize
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this condition can also be true for (at least) List and Struct columns. But, those are handled by specific cases above.

From a quick test, I think we don't need an empty chars buffer, so can you try using rmm.DeviceBuffer(0)?

self.offset + self.size
)
char_dtype_size = self.base_children[1].dtype.itemsize
char_dtype_size = cudf.api.types.dtype("int8").itemsize
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Can we introduce (like cudf._lib.types.size_type_dtype) a single source of truth for the type of the string char buffer, perhaps cudf._lib.types.char_type_dtype?

@davidwendt
Copy link
Copy Markdown
Contributor

We'll probably want to update the developer's guide once this is merged as well
https://github.com/rapidsai/cudf/blob/branch-24.02/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#strings-columns

Copy link
Copy Markdown
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional comment otherwise LGTM

self.offset + self.size
)
char_dtype_size = self.base_children[1].dtype.itemsize
char_dtype_size = cudf.api.types.dtype("int8").itemsize
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worth adding a # TODO comment noting that int8 is a workaround

@karthikeyann
Copy link
Copy Markdown
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit c7acdaa into rapidsai:branch-24.02 Jan 17, 2024
rapids-bot bot pushed a commit that referenced this pull request Jan 19, 2024
Fixes deprecation warnings introduced when #14202 merged.
Most of these are for calls to `cudf::make_strings_column` which deprecated the chars-column function overload.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #14771
rapids-bot bot pushed a commit that referenced this pull request Feb 3, 2024
rapids-bot bot pushed a commit to rapidsai/cugraph that referenced this pull request Feb 6, 2024
This PR contains a number of different fixes currently required to get cugraph tests passing:
- There are two main changes for pandas 2 compatibility:
    - [pandas renamed `DataFrame.applymap` to `DataFrame.map`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html) so creating the renumbering map with a column `map` caused problems for attribute-based column access `renumber_map.map`. Those columns are now renamed to `renumber_map`.
    - Empty columns now default to str rather than float, so tests that assumed we could access the values as cupy arrays failed because cudf's string columns cannot be converted to cupy arrays. These columns are now always cast to float in the tests before the cupy conversion.
- cugraph-dgl and cugraph-pyg's wheel builds were not downloading the latest cugraph/pylibcugraph wheels to run tests. As a result, the above pandas 2 fixes didn't take when running the dgl and pyg tests. I updated the wheel building scripts to account for this discrepancy.
- rapidsai/cudf#14202 made a breaking change to how characters are encoded in strings columns in cudf, which broke cugraph_etl. This PR fixes the code that depended on the old APIs.

This code also includes a small patch to the cugraph_etl CMake so that it exports the correct package name (previously it was using cugraph).

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Bradley Dice (https://github.com/bdice)
  - Chuck Hastings (https://github.com/ChuckHastings)
  - Rick Ratzel (https://github.com/rlratzel)
  - Jake Awe (https://github.com/AyodeAwe)

URL: #4144
nv-rliu pushed a commit to nv-rliu/cugraph-gnn that referenced this pull request Jul 15, 2025
This PR contains a number of different fixes currently required to get cugraph tests passing:
- There are two main changes for pandas 2 compatibility:
    - [pandas renamed `DataFrame.applymap` to `DataFrame.map`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html) so creating the renumbering map with a column `map` caused problems for attribute-based column access `renumber_map.map`. Those columns are now renamed to `renumber_map`.
    - Empty columns now default to str rather than float, so tests that assumed we could access the values as cupy arrays failed because cudf's string columns cannot be converted to cupy arrays. These columns are now always cast to float in the tests before the cupy conversion.
- cugraph-dgl and cugraph-pyg's wheel builds were not downloading the latest cugraph/pylibcugraph wheels to run tests. As a result, the above pandas 2 fixes didn't take when running the dgl and pyg tests. I updated the wheel building scripts to account for this discrepancy.
- rapidsai/cudf#14202 made a breaking change to how characters are encoded in strings columns in cudf, which broke cugraph_etl. This PR fixes the code that depended on the old APIs.

This code also includes a small patch to the cugraph_etl CMake so that it exports the correct package name (previously it was using cugraph).

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Bradley Dice (https://github.com/bdice)
  - Chuck Hastings (https://github.com/ChuckHastings)
  - Rick Ratzel (https://github.com/rlratzel)
  - Jake Awe (https://github.com/AyodeAwe)

URL: rapidsai/cugraph#4144
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team breaking Breaking change feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants