Skip to content

Conversation

@rohitagarwal003
Copy link
Contributor

Earlier if the NVIDIA driver was not installed when cAdvisor was started
we would start a goroutine to try to initialize NVML every minute.

This resulted in a race. We can have a situation where:

  • goroutine tries to initialize NVML but fails. So, it sleeps for a minute.
  • the driver is installed.
  • a container that uses NVIDIA devices is started.
    This container would not get GPU stats because a minute has not passed
    since the last failed initialization attempt and so NVML is not
    initialized.

@rohitagarwal003
Copy link
Contributor Author

/assign @dashpole @vishh

if !nm.nvmlInitialized {
initializeNVML(nm)
nm.Unlock()
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you dont need the else here. Just unlock outside the if statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this made me realize that I can simplify the taking locks multiple times to taking lock just once. PTAL.

Copy link
Collaborator

@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit. Otherwise looks good.

@vishh
Copy link
Contributor

vishh commented Jun 18, 2018

/lgtm

Earlier if the NVIDIA driver was not installed when cAdvisor was started
we would start a goroutine to try to initialize NVML every minute.

This resulted in a race. We can have a situation where:
- goroutine tries to initialize NVML but fails. So, it sleeps for a minute.
- the driver is installed.
- a container that uses NVIDIA devices is started.
This container would not get GPU stats because a minute has not passed
since the last failed initialization attempt and so NVML is not
initialized.
Copy link
Collaborator

@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dashpole dashpole merged commit fc0bd7a into google:master Jun 18, 2018
dashpole added a commit that referenced this pull request Jun 21, 2018
dashpole added a commit that referenced this pull request Jun 21, 2018
dashpole added a commit that referenced this pull request Jun 21, 2018
rohitagarwal003 added a commit to rohitagarwal003/kubernetes that referenced this pull request Jun 22, 2018
The race condition that required this sleep was fixed in google/cadvisor#1969.
That was vendored in kubernetes#65334.
k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this pull request Jun 22, 2018
Automatic merge from submit-queue (batch tested with PRs 65377, 63837, 65370, 65294, 65376). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Remove unneeded sleep from test.

The race condition that required this sleep was fixed in google/cadvisor#1969.
That was vendored in #65334.

```release-note
NONE
```

/assign @jiayingz @vishh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants