Conversation
mschubert
left a comment
There was a problem hiding this comment.
Thanks for the PR! I added some comments, please have a look and explain/amend as appropriate.
R/zzz.r
Outdated
| qname = c("SLURM", "LSF", "SGE", "GCS", "OCS", "LOCAL") | ||
| exec = Sys.which(c("sbatch", "bsub", "qsub", "qsub", "qsub")) | ||
| select = c(which(nchar(exec) > 0), 6)[1] |
There was a problem hiding this comment.
This will not work: We're checking available shell commands to guess the scheduler. If qsub is available, this is SGE (by assumption). We can not distinguish between PBS, Torque, GCS, OCS; so it makes no sense to check these here.
There was a problem hiding this comment.
What is your recommendation here?
There was a problem hiding this comment.
I think it's best to leave the code as it was before. We could also consider parsing the output of e.g. qsub --help to disambiguate between the different schedulers, but we'd need to make sure this is stable across versions.
vignettes/userguide.Rmd
Outdated
| * [GCS](#gcs) - *works without setup* | ||
| * [OCS](#ocs) - *works without setup* |
There was a problem hiding this comment.
This is incorrect, needs clustermq.scheduler to be set (see PBS/Torque)
README.md
Outdated
| * [GCS](https://mschubert.github.io/clustermq/articles/userguide.html#gcs) - *works without setup* | ||
| * [OCS](https://mschubert.github.io/clustermq/articles/userguide.html#ocs) - *works without setup* |
There was a problem hiding this comment.
This is incorrect, needs clustermq.scheduler to be set (see PBS/Torque)
There was a problem hiding this comment.
Ok. Now I understand. From the scheduler end there is no change required. A default OCS/GCS installtion is sufficient.
| the missing options. | ||
|
|
||
|
|
||
| ### GCS |
There was a problem hiding this comment.
The templates and docs of GCS/OCS are a lot more verbose than for the other schedulers. We should try to be consistent here
There was a problem hiding this comment.
Does this mean that I should remove helpfull comments?
There was a problem hiding this comment.
Maybe we can make them a bit more concise. Can you make it more in the style of the SGE docs?
| OCS = R6::R6Class("OCS", | ||
| inherit = QSys, | ||
|
|
||
| public = list( |
There was a problem hiding this comment.
Why is the SGE initializer not reused here? Job names are guaranteed to be unique within clustermq; but if IDs are better, we should use them in SGE as well
There was a problem hiding this comment.
I have no access to SGE and do not know if job names were unique back then with the old Sun Microsystems release.
There was a problem hiding this comment.
I feel fairly strongly that we should duplicate this code only if necessary. The ideal implementation would be shared between SGE, OCS and GCS
R/qsys_sge.r
Outdated
| log_worker=FALSE, log_file=NULL, verbose=TRUE) { | ||
| super$initialize(addr=addr, master=master, template=template) | ||
|
|
||
| # fill the template with options and required fields |
There was a problem hiding this comment.
Please only add comments/whitespace where they add value. If the function is called fill_template, a comment with fill the template is superfluous.
R/qsys_sge.r
Outdated
| private$template_error(class(self)[1], status, filled) | ||
|
|
||
| # try to read the job ID from stdout. On error stop | ||
| private$job_id <- regmatches(private$qsub_stdout, regexpr("^[0-9]+", private$qsub_stdout)) |
There was a problem hiding this comment.
Please be consistent with assignments; we use = by convention in this project
R/qsys_sge.r
Outdated
| # first call finalize to send qdel ... | ||
| private$finalize() |
There was a problem hiding this comment.
qdel should only be called if there are still running jobs, not otherwise. I believe the cleanup implementation in SGE is correct.
There was a problem hiding this comment.
Ok. This did not work for me when i had the same code for OCS/GCS. qdel was not triggered although the worker tasks where processed and there were still pending jobs...
There was a problem hiding this comment.
If it doesn't work for the already implemented schedulers that would be a bug rather than an enhancement for OCS/GCS
I didn't see this in the changes, did I miss it? |
|
I made the suggested changes except for reusing the SGE initializer. The initially copied sections are not required for OCS/GCS. They have been removed now. I still think the SGE cleanup/finalize handling is wrong. In case of successfull execution of the workload there might still be pending array job tasks in the SGE cluster. If they are not deleted they will get unnecessaily scheduled and consume resources. If they where all scheduled before the workload was sucessfully executed then the additional qdel call does no harm because the job is not running anymore. The tests that I have mentioned in my initial comment are GCS tests that we run in our test environment with a real GCS cluster. |
As discussed in #341, this PR adds native backend support for OCS and GCS.
Summary of changes:
Testing:
Many thanks for the helpful feedback and suggestions.
Best regards,
Ernst