Skip to content

Volcano metrics related to the job were not properly recycled #4705

@MondayCha

Description

@MondayCha

Description

I am conducting stress tests on Volcano. I simulated thousands of nodes using Kwok and created thousands of Pods. However, after I deleted all the Pods, the job-related metrics in the Volcano scheduler did not reset to zero.

Image

Steps to reproduce the issue

  1. Create a batch of Pods
  2. Quickly delete this batch of Pods
  3. Observe that metrics such as unschedule_task_count are not cleared

Describe the results you received and expected

Although I deleted 4096 pending Pods, there are still over 3,000 Pods remaining in the metrics.

After reboot the scheduler, the metrics has become normal.

Image

What version of Volcano are you using?

v1.13.0

Any other relevant information

I examined the logic for cleaning up the metrics:

func (sc *SchedulerCache) processCleanupJob() {
	job, shutdown := sc.DeletedJobs.Get()
	if shutdown {
		return
	}

	defer sc.DeletedJobs.Done(job)

	sc.Mutex.Lock()
	defer sc.Mutex.Unlock()

	if schedulingapi.JobTerminated(job) {
		oldJob, found := sc.Jobs[job.UID]
		if !found {
			klog.V(3).Infof("Failed to find Job <%v:%v/%v>, ignore it", job.UID, job.Namespace, job.Name)
			sc.DeletedJobs.Forget(job)
			return
		}
		newPgVersion := oldJob.PgUID
		oldPgVersion := job.PgUID
		klog.V(5).Infof("Just add pguid:%v, try to delete pguid:%v", newPgVersion, oldPgVersion)
		if oldPgVersion == newPgVersion {
			delete(sc.Jobs, job.UID)
			metrics.DeleteJobMetrics(job.Name, string(job.Queue), job.Namespace)
			klog.V(3).Infof("Job <%v:%v/%v> was deleted.", job.UID, job.Namespace, job.Name)
		}
		sc.DeletedJobs.Forget(job)
	} else {
		// Retry
		sc.retryDeleteJob(job)
	}
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions