-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Initialize NVML on demand. #1969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
accelerators/nvidia.go
Outdated
| if !nm.nvmlInitialized { | ||
| initializeNVML(nm) | ||
| nm.Unlock() | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you dont need the else here. Just unlock outside the if statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this made me realize that I can simplify the taking locks multiple times to taking lock just once. PTAL.
dashpole
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one nit. Otherwise looks good.
|
/lgtm |
4c45a2e to
844caff
Compare
Earlier if the NVIDIA driver was not installed when cAdvisor was started we would start a goroutine to try to initialize NVML every minute. This resulted in a race. We can have a situation where: - goroutine tries to initialize NVML but fails. So, it sleeps for a minute. - the driver is installed. - a container that uses NVIDIA devices is started. This container would not get GPU stats because a minute has not passed since the last failed initialization attempt and so NVML is not initialized.
844caff to
2ce4161
Compare
dashpole
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The race condition that required this sleep was fixed in google/cadvisor#1969. That was vendored in kubernetes#65334.
Automatic merge from submit-queue (batch tested with PRs 65377, 63837, 65370, 65294, 65376). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Remove unneeded sleep from test. The race condition that required this sleep was fixed in google/cadvisor#1969. That was vendored in #65334. ```release-note NONE ``` /assign @jiayingz @vishh
Earlier if the NVIDIA driver was not installed when cAdvisor was started
we would start a goroutine to try to initialize NVML every minute.
This resulted in a race. We can have a situation where:
This container would not get GPU stats because a minute has not passed
since the last failed initialization attempt and so NVML is not
initialized.