-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.
Description
Description
I am conducting stress tests on Volcano. I simulated thousands of nodes using Kwok and created thousands of Pods. However, after I deleted all the Pods, the job-related metrics in the Volcano scheduler did not reset to zero.
Steps to reproduce the issue
- Create a batch of Pods
- Quickly delete this batch of Pods
- Observe that metrics such as
unschedule_task_countare not cleared
Describe the results you received and expected
Although I deleted 4096 pending Pods, there are still over 3,000 Pods remaining in the metrics.
After reboot the scheduler, the metrics has become normal.
What version of Volcano are you using?
v1.13.0
Any other relevant information
I examined the logic for cleaning up the metrics:
func (sc *SchedulerCache) processCleanupJob() {
job, shutdown := sc.DeletedJobs.Get()
if shutdown {
return
}
defer sc.DeletedJobs.Done(job)
sc.Mutex.Lock()
defer sc.Mutex.Unlock()
if schedulingapi.JobTerminated(job) {
oldJob, found := sc.Jobs[job.UID]
if !found {
klog.V(3).Infof("Failed to find Job <%v:%v/%v>, ignore it", job.UID, job.Namespace, job.Name)
sc.DeletedJobs.Forget(job)
return
}
newPgVersion := oldJob.PgUID
oldPgVersion := job.PgUID
klog.V(5).Infof("Just add pguid:%v, try to delete pguid:%v", newPgVersion, oldPgVersion)
if oldPgVersion == newPgVersion {
delete(sc.Jobs, job.UID)
metrics.DeleteJobMetrics(job.Name, string(job.Queue), job.Namespace)
klog.V(3).Infof("Job <%v:%v/%v> was deleted.", job.UID, job.Namespace, job.Name)
}
sc.DeletedJobs.Forget(job)
} else {
// Retry
sc.retryDeleteJob(job)
}
}Metadata
Metadata
Assignees
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.