Skip to content

Conversation

@trxcllnt
Copy link
Contributor

@trxcllnt trxcllnt commented Nov 4, 2025

Description

RAPIDS has deployed an autoscaling cloud build cluster that can be used to accelerate building large RAPIDS projects.

This PR updates the conda and wheel builds to use the build cluster.

This contributes to rapidsai/build-planning#228.

@trxcllnt trxcllnt requested review from a team as code owners November 4, 2025 18:35
@trxcllnt trxcllnt requested a review from msarahan November 4, 2025 18:35
@trxcllnt trxcllnt added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Nov 4, 2025
@trxcllnt trxcllnt added the DO NOT MERGE Hold off on merging; see PR for details label Nov 5, 2025
SCCACHE_S3_USE_SSL: ${{ env.get("SCCACHE_S3_USE_SSL") }}
SCCACHE_S3_NO_CREDENTIALS: ${{ env.get("SCCACHE_S3_NO_CREDENTIALS") }}
SCCACHE_S3_KEY_PREFIX: libucxx/${{ env.get("RAPIDS_CONDA_ARCH") }}/cuda${{ cuda_major }}
NVCC_APPEND_FLAGS: ${{ env.get("NVCC_APPEND_FLAGS", default="") }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding NVCC flags here? UCXX shouldn't need NVCC to compile.

Copy link
Contributor Author

@trxcllnt trxcllnt Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the conda recipe envvar updates consistent across all the PRs, mostly because I wasn't sure which recipes use which compilers.

When using the build cluster, the rapids-configure-sccache script from gha-tools sets PARALLEL_LEVEL=<a_very_large_number> and NVCC_APPEND_FLAGS=-t=100 to maximize total parallelism.

Since it's just an envvar, it doesn't hurt to include it here. If UCXX ever does add any nvcc targets, you won't have to worry about needing to update the recipe in this way.

@trxcllnt trxcllnt force-pushed the fea/use-sccache-build-cluster branch from b7d772f to 9107611 Compare November 10, 2025 22:56
@trxcllnt trxcllnt changed the base branch from main to release/0.47 November 15, 2025 00:48
@trxcllnt trxcllnt removed the DO NOT MERGE Hold off on merging; see PR for details label Nov 15, 2025
bdice
bdice previously approved these changes Nov 20, 2025
@trxcllnt
Copy link
Contributor Author

Does anyone know what's going on here? I wasn't seeing this test fail a few days ago, but now it happens consistently and AFAIK I didn't change anything that should cause this. Maybe a UCX update broke something?

@trxcllnt
Copy link
Contributor Author

I think the issue is that cudf.datasets.timeseries() is producing bad data:

(base) root@831ed948ef31:/# pip install --extra-index-url https://pypi.anaconda.org/rapidsai-wheels-nightly/simple 'cudf-cu13==25.12.*,>=0.0.0a0'
(base) root@831ed948ef31:/# python
Python 3.13.9 | packaged by conda-forge | (main, Oct 22 2025, 23:33:35) [GCC 14.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cudf
>>> cudf.datasets.timeseries()
                       id  name     x     y
timestamp                                  
2000-01-01 00:00:00  <NA>  <NA>  <NA>  <NA>
2000-01-01 00:00:01  <NA>  <NA>  <NA>  <NA>
2000-01-01 00:00:02  <NA>  <NA>  <NA>  <NA>
2000-01-01 00:00:03  <NA>  <NA>  <NA>  <NA>
2000-01-01 00:00:04  <NA>  <NA>  <NA>  <NA>
...                   ...   ...   ...   ...
2000-01-30 23:59:56  <NA>  <NA>  <NA>  <NA>
2000-01-30 23:59:57  <NA>  <NA>  <NA>  <NA>
2000-01-30 23:59:58  <NA>  <NA>  <NA>  <NA>
2000-01-30 23:59:59  <NA>  <NA>  <NA>  <NA>
2000-01-31 00:00:00  <NA>  <NA>  <NA>  <NA>
[2592001 rows x 4 columns]

@trxcllnt
Copy link
Contributor Author

Should be fixed by rapidsai/cudf@7d54b71

@trxcllnt trxcllnt merged commit 489736c into rapidsai:release/0.47 Nov 22, 2025
298 of 338 checks passed
@trxcllnt
Copy link
Contributor Author

Merged after discussing with @vyasr, who said cudf is unlikely to have a fix today.

For context, this is the exact backtrace of the failing tests:

Traceback (most recent call last):
  File "/test.py", line 106, in <module>
    asyncio.run(test_send_recv_cudf(lambda cudf: cudf.datasets.timeseries()))
    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.13/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ~~~~~~~~~~^^^^^^
  File "/opt/conda/lib/python3.13/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/opt/conda/lib/python3.13/asyncio/base_events.py", line 725, in run_until_complete
    return future.result()
           ~~~~~~~~~~~~~^^
  File "/test.py", line 97, in test_send_recv_cudf
    assert_eq(res, msg)
    ~~~~~~~~~^^^^^^^^^^
  File "/opt/conda/lib/python3.13/site-packages/cudf/testing/testing.py", line 860, in assert_eq
    tm.assert_frame_equal(left, right, **kwargs)
    ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.13/site-packages/pandas/_testing/asserters.py", line 1303, in assert_frame_equal
    assert_series_equal(
    ~~~~~~~~~~~~~~~~~~~^
        lcol,
        ^^^^^
    ...<12 lines>...
        check_flags=False,
        ^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/opt/conda/lib/python3.13/site-packages/pandas/_testing/asserters.py", line 986, in assert_series_equal
    assert lidx.freq == ridx.freq, (lidx.freq, ridx.freq)
           ^^^^^^^^^^^^^^^^^^^^^^
AssertionError: (None, <Second>)

@trxcllnt trxcllnt deleted the fea/use-sccache-build-cluster branch November 22, 2025 00:21
@trxcllnt
Copy link
Contributor Author

rapidsai/cudf#20709 should fix the failing UCXX test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants