-
Notifications
You must be signed in to change notification settings - Fork 8
Hpc ugent #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hpc ugent #7
Conversation
|
@SanderBorgmans it seems like you used version 0.0.1 which was not slurm compatible, did you ever try version 0.0.3 ? |
|
I should have based my code on the most recent github version. The adapter did work after altering the queue status function, since our hpc environment returns a slightly different output. Where is it apparent that I based this code on an older version? |
|
Can you take a look at #9 as far as I understand those were the files you changed. |
|
That looks correct. I noticed that the latest github version already includes functions to convert memory strings, that makes my get_size method redundant. So that part could be omitted. |
|
Then I guess what we have to work on is #10 - as this one currently results into merge conflicts. And the part I do not really understand is the specification of the queues. The idea of pysqa is that most users only have a certain number of default settings, which they always use. Therefore instead of recreating all possible configurations, our idea was that the user just specifies templates, consisting of queues or parallel environments or what ever and then only use these templates. As far as I understand your changes you introduced the option to name the queue, which should not be necessary from my perspective, instead my recommendation would be to have multiple templates one for each queue. Examples can be found in https://github.com/pyiron/pysqa/tree/master/tests/config/sge |
|
The changes I introduced allow the user to use pyiron on any cluster and submit jobs to any cluster regardless of where the pyiron code is running. Our hpc environment consists of several computation clusters, each with different purposes (e.g. a cluster for a lot of single core jobs, a cluster for large multi node jobs, etc.). The queue variable allows to user to specify which cluster to submit your job to. Finally, an overview of all jobs statuses can be queried from the environment over all clusters (hence the module swap command). If no queue is specified, it just submits the jobs on the current cluster. |
|
To me the |
|
If you want the status of a job or want to submit a job on a cluster different from the cluster the jupyter notebook is running on, the swap commando is necessary. How would we move this part to the environments? |
|
Ok, then it does not work with the current setup. But I could see how it works by extending the existing slurm class and move the functionality in there. Meaning I would prefer to only change the |
|
I see what you mean, I will take a look at it next week. |
|
@SanderBorgmans I just saw your commit and I have the feeling we had an misunderstanding. Instead of importing the queue adapter in the slurm module, my idea was to develop a class similar to |
|
@jan-janssen My mistake. But then I do not understand how the |
|
I guess the easiest way would be to just place a shell script behind |
|
@jan-janssen I guess that replacing your queue system that was bound to different templates on a single cluster to the different clusters on a multi-cluster HPC was not ideal. It seemed logical since I could specify the core and memory limits per queue to do it this way, but perhaps it would be better to separate the cluster properties (cluster name, memory/core limits, workload manager (slurm,torque,...)) to a new class within pyiron. In this way, the queueadapter remains generic, and single cluster machines could just have a single cluster object. Is this possible? |
|
The primary idea of this pysqa package is to simplify the control of the cluster to the Python user. Therefore I would like to hide the complexity in this module rather than integrating it in pyiron. For this case: I would recommend having queues like |
|
@jan-janssen This seems should be a tractable idea when the amount of clusters does not exceed 10. On what level are the job_ids altered? Is this completely within the queueadapter, or does this happen within pyiron? |
|
I would keep it within the queueadapter. |
|
@jan-janssen Can we access and alter the job_id from within the queueadapter? It seems only the job name ('pi_' + job_id) is handled by the queueadapter, but I am certainly not familiar with all the code. |
|
That's what I tried in #11 the idea is simply providing another layer of abstraction between the |
|
@jan-janssen we could also introduce a swap cluster command in the slurm wrapper that is always prepended to any command, that remains empty if there is only one cluster? |
|
Hi @SanderBorgmans I updated the queue adapter to work with the module loading, can you test if https://github.com/jan-janssen/pysqa/tree/interface works for you, then I merge it into the main branch. |
|
@jan-janssen I created a new pull request using your interface code to the master. #15 |
Replaced queue system with cluster system and adapted wrapper for hpc ugent environment