Skip to content

Ensure MG to have the same number of allreduce calls in mean_stddev for sparse matrix to avoid hanging#6141

Merged
rapids-bot[bot] merged 4 commits intorapidsai:branch-24.12from
lijinf2:fix_hanging_mean_stddev
Dec 6, 2024
Merged

Ensure MG to have the same number of allreduce calls in mean_stddev for sparse matrix to avoid hanging#6141
rapids-bot[bot] merged 4 commits intorapidsai:branch-24.12from
lijinf2:fix_hanging_mean_stddev

Conversation

@lijinf2
Copy link
Copy Markdown
Contributor

@lijinf2 lijinf2 commented Nov 22, 2024

The hanging occurs when one GPU gets a sparse matrix of all zero values, while other GPUs get-zero values.

@lijinf2 lijinf2 requested a review from a team as a code owner November 22, 2024 01:23
@lijinf2 lijinf2 requested review from dantegd and teju85 November 22, 2024 01:23
@lijinf2 lijinf2 added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change 2 - In Progress Currenty a work in progress labels Nov 22, 2024
@lijinf2 lijinf2 force-pushed the fix_hanging_mean_stddev branch from 6c56fa1 to 21be3fb Compare November 26, 2024 18:33
@lijinf2 lijinf2 added 3 - Ready for Review Ready for review by team and removed 3 - Ready for Review Ready for review by team labels Nov 26, 2024
@lijinf2 lijinf2 force-pushed the fix_hanging_mean_stddev branch from 21be3fb to 83ca352 Compare December 3, 2024 04:45
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Dec 3, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lijinf2
Copy link
Copy Markdown
Contributor Author

lijinf2 commented Dec 3, 2024

build

@lijinf2 lijinf2 force-pushed the fix_hanging_mean_stddev branch from 83ca352 to f87fab6 Compare December 3, 2024 05:07
@lijinf2 lijinf2 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currenty a work in progress labels Dec 3, 2024
@lijinf2 lijinf2 force-pushed the fix_hanging_mean_stddev branch from f87fab6 to dabf335 Compare December 4, 2024 23:00
@lijinf2 lijinf2 requested a review from a team as a code owner December 4, 2024 23:00
@github-actions github-actions Bot added the Cython / Python Cython or Python issue label Dec 4, 2024
@lijinf2
Copy link
Copy Markdown
Contributor Author

lijinf2 commented Dec 4, 2024

Added test cases to test_dask_logistic_regression.py for better testing. No change to the main code (standardization.cuh). Ready for review.

Copy link
Copy Markdown
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional testing @lijinf2 ! The code itself looks good to me

add the testcase of one GPU gets all zeroes

revise test_standardization_example to reuse functions

keep revise to reuse code

give a better name
@lijinf2 lijinf2 force-pushed the fix_hanging_mean_stddev branch from 6affb8a to 90d7df5 Compare December 5, 2024 17:51
@wphicks
Copy link
Copy Markdown
Contributor

wphicks commented Dec 6, 2024

/merge

@rapids-bot rapids-bot Bot merged commit 4bfe72f into rapidsai:branch-24.12 Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants