Hpc ugent #7

SanderBorgmans · 2019-08-17T09:17:06Z

Replaced queue system with cluster system and adapted wrapper for hpc ugent environment

jan-janssen · 2019-08-17T15:03:50Z

@SanderBorgmans it seems like you used version 0.0.1 which was not slurm compatible, did you ever try version 0.0.3 ?

SanderBorgmans · 2019-08-17T15:19:43Z

I should have based my code on the most recent github version. The adapter did work after altering the queue status function, since our hpc environment returns a slightly different output. Where is it apparent that I based this code on an older version?

jan-janssen · 2019-08-18T15:26:58Z

Can you take a look at #9 as far as I understand those were the files you changed.

SanderBorgmans · 2019-08-18T17:45:05Z

That looks correct. I noticed that the latest github version already includes functions to convert memory strings, that makes my get_size method redundant. So that part could be omitted.

jan-janssen · 2019-08-18T19:06:20Z

Then I guess what we have to work on is #10 - as this one currently results into merge conflicts. And the part I do not really understand is the specification of the queues. The idea of pysqa is that most users only have a certain number of default settings, which they always use. Therefore instead of recreating all possible configurations, our idea was that the user just specifies templates, consisting of queues or parallel environments or what ever and then only use these templates. As far as I understand your changes you introduced the option to name the queue, which should not be necessary from my perspective, instead my recommendation would be to have multiple templates one for each queue. Examples can be found in https://github.com/pyiron/pysqa/tree/master/tests/config/sge

SanderBorgmans · 2019-08-18T19:29:30Z

The changes I introduced allow the user to use pyiron on any cluster and submit jobs to any cluster regardless of where the pyiron code is running. Our hpc environment consists of several computation clusters, each with different purposes (e.g. a cluster for a lot of single core jobs, a cluster for large multi node jobs, etc.). The queue variable allows to user to specify which cluster to submit your job to. Finally, an overview of all jobs statuses can be queried from the environment over all clusters (hence the module swap command). If no queue is specified, it just submits the jobs on the current cluster.

jan-janssen · 2019-08-18T19:39:02Z

To me the module part seems to be very specific to your cluster, that is why I would move this part into the environments. What I do not understand at the moment is do you need to load a specific module to get the status of a job or only for the submission?

SanderBorgmans · 2019-08-18T19:55:38Z

If you want the status of a job or want to submit a job on a cluster different from the cluster the jupyter notebook is running on, the swap commando is necessary. How would we move this part to the environments?

jan-janssen · 2019-08-18T20:05:30Z

Ok, then it does not work with the current setup. But I could see how it works by extending the existing slurm class and move the functionality in there. Meaning I would prefer to only change the slurm.py part and basically have a derived Queue class specific to your cluster, rather than changing the API of the queue adapter in general.

SanderBorgmans · 2019-08-18T20:10:55Z

I see what you mean, I will take a look at it next week.

jan-janssen · 2019-08-20T14:03:21Z

@SanderBorgmans I just saw your commit and I have the feeling we had an misunderstanding. Instead of importing the queue adapter in the slurm module, my idea was to develop a class similar to SlurmCommands maybe GentCommands derived from SlurmCommands which implements the same functions as SlurmCommands but is capable of accessing them via modules and so on.

SanderBorgmans · 2019-08-20T14:10:28Z

@jan-janssen My mistake. But then I do not understand how the queue variable is parsed to these functions without altering QueueAdapter? What is your opinion?

jan-janssen · 2019-08-20T14:23:55Z

I guess the easiest way would be to just place a shell script behind scancel and the other underlying commands and then format the returned queue IDs in a way that we can identify which queuing system they belong to - for example just add another digit. But you are right maybe it makes sense to split the pysqa API a bit more to make this kind of modifications more easy.

SanderBorgmans · 2019-08-20T14:55:40Z

@jan-janssen I guess that replacing your queue system that was bound to different templates on a single cluster to the different clusters on a multi-cluster HPC was not ideal. It seemed logical since I could specify the core and memory limits per queue to do it this way, but perhaps it would be better to separate the cluster properties (cluster name, memory/core limits, workload manager (slurm,torque,...)) to a new class within pyiron. In this way, the queueadapter remains generic, and single cluster machines could just have a single cluster object. Is this possible?

jan-janssen · 2019-08-20T15:12:16Z

The primary idea of this pysqa package is to simplify the control of the cluster to the Python user. Therefore I would like to hide the complexity in this module rather than integrating it in pyiron. For this case: I would recommend having queues like [cluster1_queue1, cluster1_queue2, ..., cluster2_queue1, ...]. When the user submits a job we just attach another digit, for example cluster1 queue id 1234 would become 12341 and cluster2 queue id 2345 would become 23452. Finally when the users tries to delete a job we can match it again. I am aware that we loose the connection between the reported queue id and the queue id in the job management system, but that is currently the only way if we want to maintain the return values as int. To allow this kind in-between level of abstraction I introduced another layer in #11

SanderBorgmans · 2019-08-20T15:39:46Z

@jan-janssen This seems should be a tractable idea when the amount of clusters does not exceed 10. On what level are the job_ids altered? Is this completely within the queueadapter, or does this happen within pyiron?

jan-janssen · 2019-08-20T15:41:45Z

I would keep it within the queueadapter.

SanderBorgmans · 2019-08-20T18:21:21Z

@jan-janssen Can we access and alter the job_id from within the queueadapter? It seems only the job name ('pi_' + job_id) is handled by the queueadapter, but I am certainly not familiar with all the code.

jan-janssen · 2019-08-20T19:02:06Z

That's what I tried in #11 the idea is simply providing another layer of abstraction between the QueueAdapter which is accessed from pyiron and the implementation of the QueueAdapterInterface

SanderBorgmans · 2019-08-21T13:24:30Z

@jan-janssen we could also introduce a swap cluster command in the slurm wrapper that is always prepended to any command, that remains empty if there is only one cluster?

jan-janssen · 2019-08-28T08:15:23Z

Hi @SanderBorgmans I updated the queue adapter to work with the module loading, can you test if https://github.com/jan-janssen/pysqa/tree/interface works for you, then I merge it into the main branch.

SanderBorgmans · 2019-08-28T12:03:35Z

@jan-janssen I created a new pull request using your interface code to the master. #15
I hope this was a good approach.

Sander Borgmans and others added 6 commits August 12, 2019 13:14

Altered files for compatibility with HPC UGent

5157463

Removed cache and whitespaces

95ef359

Ready for release

ef7a697

fix for empty hpc queues

7acc49c

Fix for futurewarning

eaa3f88

removed redundant module

78ed01e

Preparations for merging with master

893e1fa

Sander Borgmans added 4 commits August 20, 2019 16:59

Allow memory specifications such as MB and GB

782b092

Whitespace removal

7f7ba95

azure-pipelines from master

c6a116f

Indentation error

13e4209

SanderBorgmans closed this Aug 28, 2019

Hpc ugent #7

Hpc ugent #7

Uh oh!

Conversation

SanderBorgmans commented Aug 17, 2019

Uh oh!

jan-janssen commented Aug 17, 2019

Uh oh!

SanderBorgmans commented Aug 17, 2019

Uh oh!

jan-janssen commented Aug 18, 2019

Uh oh!

SanderBorgmans commented Aug 18, 2019

Uh oh!

jan-janssen commented Aug 18, 2019

Uh oh!

SanderBorgmans commented Aug 18, 2019

Uh oh!

jan-janssen commented Aug 18, 2019

Uh oh!

SanderBorgmans commented Aug 18, 2019

Uh oh!

jan-janssen commented Aug 18, 2019

Uh oh!

SanderBorgmans commented Aug 18, 2019

Uh oh!

jan-janssen commented Aug 20, 2019

Uh oh!

SanderBorgmans commented Aug 20, 2019

Uh oh!

jan-janssen commented Aug 20, 2019

Uh oh!

SanderBorgmans commented Aug 20, 2019

Uh oh!

jan-janssen commented Aug 20, 2019

Uh oh!

SanderBorgmans commented Aug 20, 2019

Uh oh!

jan-janssen commented Aug 20, 2019

Uh oh!

SanderBorgmans commented Aug 20, 2019

Uh oh!

jan-janssen commented Aug 20, 2019

Uh oh!

SanderBorgmans commented Aug 21, 2019

Uh oh!

jan-janssen commented Aug 28, 2019

Uh oh!

SanderBorgmans commented Aug 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants