Skip to content

Conversation

@SanderBorgmans
Copy link

Replaced queue system with cluster system and adapted wrapper for hpc ugent environment

@jan-janssen
Copy link
Member

@SanderBorgmans it seems like you used version 0.0.1 which was not slurm compatible, did you ever try version 0.0.3 ?

@SanderBorgmans
Copy link
Author

I should have based my code on the most recent github version. The adapter did work after altering the queue status function, since our hpc environment returns a slightly different output. Where is it apparent that I based this code on an older version?

@jan-janssen
Copy link
Member

Can you take a look at #9 as far as I understand those were the files you changed.

@SanderBorgmans
Copy link
Author

That looks correct. I noticed that the latest github version already includes functions to convert memory strings, that makes my get_size method redundant. So that part could be omitted.

@jan-janssen
Copy link
Member

Then I guess what we have to work on is #10 - as this one currently results into merge conflicts. And the part I do not really understand is the specification of the queues. The idea of pysqa is that most users only have a certain number of default settings, which they always use. Therefore instead of recreating all possible configurations, our idea was that the user just specifies templates, consisting of queues or parallel environments or what ever and then only use these templates. As far as I understand your changes you introduced the option to name the queue, which should not be necessary from my perspective, instead my recommendation would be to have multiple templates one for each queue. Examples can be found in https://github.com/pyiron/pysqa/tree/master/tests/config/sge

@SanderBorgmans
Copy link
Author

The changes I introduced allow the user to use pyiron on any cluster and submit jobs to any cluster regardless of where the pyiron code is running. Our hpc environment consists of several computation clusters, each with different purposes (e.g. a cluster for a lot of single core jobs, a cluster for large multi node jobs, etc.). The queue variable allows to user to specify which cluster to submit your job to. Finally, an overview of all jobs statuses can be queried from the environment over all clusters (hence the module swap command). If no queue is specified, it just submits the jobs on the current cluster.

@jan-janssen
Copy link
Member

To me the module part seems to be very specific to your cluster, that is why I would move this part into the environments. What I do not understand at the moment is do you need to load a specific module to get the status of a job or only for the submission?

@SanderBorgmans
Copy link
Author

If you want the status of a job or want to submit a job on a cluster different from the cluster the jupyter notebook is running on, the swap commando is necessary. How would we move this part to the environments?

@jan-janssen
Copy link
Member

Ok, then it does not work with the current setup. But I could see how it works by extending the existing slurm class and move the functionality in there. Meaning I would prefer to only change the slurm.py part and basically have a derived Queue class specific to your cluster, rather than changing the API of the queue adapter in general.

@SanderBorgmans
Copy link
Author

I see what you mean, I will take a look at it next week.

@jan-janssen
Copy link
Member

@SanderBorgmans I just saw your commit and I have the feeling we had an misunderstanding. Instead of importing the queue adapter in the slurm module, my idea was to develop a class similar to SlurmCommands maybe GentCommands derived from SlurmCommands which implements the same functions as SlurmCommands but is capable of accessing them via modules and so on.

@SanderBorgmans
Copy link
Author

@jan-janssen My mistake. But then I do not understand how the queue variable is parsed to these functions without altering QueueAdapter? What is your opinion?

@jan-janssen
Copy link
Member

I guess the easiest way would be to just place a shell script behind scancel and the other underlying commands and then format the returned queue IDs in a way that we can identify which queuing system they belong to - for example just add another digit. But you are right maybe it makes sense to split the pysqa API a bit more to make this kind of modifications more easy.

@SanderBorgmans
Copy link
Author

@jan-janssen I guess that replacing your queue system that was bound to different templates on a single cluster to the different clusters on a multi-cluster HPC was not ideal. It seemed logical since I could specify the core and memory limits per queue to do it this way, but perhaps it would be better to separate the cluster properties (cluster name, memory/core limits, workload manager (slurm,torque,...)) to a new class within pyiron. In this way, the queueadapter remains generic, and single cluster machines could just have a single cluster object. Is this possible?

@jan-janssen
Copy link
Member

The primary idea of this pysqa package is to simplify the control of the cluster to the Python user. Therefore I would like to hide the complexity in this module rather than integrating it in pyiron. For this case: I would recommend having queues like [cluster1_queue1, cluster1_queue2, ..., cluster2_queue1, ...]. When the user submits a job we just attach another digit, for example cluster1 queue id 1234 would become 12341 and cluster2 queue id 2345 would become 23452. Finally when the users tries to delete a job we can match it again. I am aware that we loose the connection between the reported queue id and the queue id in the job management system, but that is currently the only way if we want to maintain the return values as int. To allow this kind in-between level of abstraction I introduced another layer in #11

@SanderBorgmans
Copy link
Author

@jan-janssen This seems should be a tractable idea when the amount of clusters does not exceed 10. On what level are the job_ids altered? Is this completely within the queueadapter, or does this happen within pyiron?

@jan-janssen
Copy link
Member

I would keep it within the queueadapter.

@SanderBorgmans
Copy link
Author

@jan-janssen Can we access and alter the job_id from within the queueadapter? It seems only the job name ('pi_' + job_id) is handled by the queueadapter, but I am certainly not familiar with all the code.

@jan-janssen
Copy link
Member

That's what I tried in #11 the idea is simply providing another layer of abstraction between the QueueAdapter which is accessed from pyiron and the implementation of the QueueAdapterInterface

@SanderBorgmans
Copy link
Author

@jan-janssen we could also introduce a swap cluster command in the slurm wrapper that is always prepended to any command, that remains empty if there is only one cluster?

@jan-janssen
Copy link
Member

Hi @SanderBorgmans I updated the queue adapter to work with the module loading, can you test if https://github.com/jan-janssen/pysqa/tree/interface works for you, then I merge it into the main branch.

@SanderBorgmans
Copy link
Author

@jan-janssen I created a new pull request using your interface code to the master. #15
I hope this was a good approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants