Simple low-level program template to distribute tasks on GPUs using the Sun Grid Engine paradigm. The example use case is a pytorch tensor multiplication with copying data to GPU and perfoming operations on it.
-
Running
python gpu_session.py gpu_id base_folderopens a python session, locks GPUgpu_idto that session, and listens for job inbase_folder\agent_pool\{gpu_id}.POOL.- Do this for as many GPUs (in different terminal or tmux sessions) or
- For convenience, running
start_agent.sh n_gpu base_foldercreates the communication folders inbase_folderand fires upn_gpupython session, each locking a GPU. This can be run directly frommain.pyby setting HOT_START = FALSE
-
Running
main.pywill feed the tasks to the Master, which will distribute it to the JobPool. Master periodically checks for available agents, distribute jobs to agents and collect results from completed jobs.
This template can be used to distribute tasks on GPU, as part of a bigger computationally intensive program. In a typical use case,
- The first step is to up
expensive_task.pyand ensure GPU usage in the expensive function (for example in pytorch, this is done via.to(device)where device is of type cuda). - Set up the tasks in main.py, this is a list of arguments/data passed to
expensive_task.- Some minor configuration will be required in the
Jobobject injobs.pyto set the way the data should be passed. - In the current example, it is written to a
.npzfile whenAgentcallsjob.deploy(), and it is loaded by thegpu_session.pyto which it is assigned.
- Some minor configuration will be required in the
- In case all jobs need to access shared data, this is passed in the shared data argument. At the moment, even an empty piece of data should be passed because
Agentcommunicates that it has a job pending by populatingbase_folder/agent_pool/{agent_id}.POOLwith the location of the shared data (TODO needs to be fixed/auto-generated) - In a separate terminal, run
watch -n 1 nvidia-smito observe usage of every GPU unit requested.
- [!!!] Lock GPU in gpu_session.py to prevent other processes from using it
- [!!!] Need shared data path to populate agents (agent.py)
- [!!] Put new agents directly in available agents (master.py)
- [!] Shared data is needed (even an empty matrix) to let agents properly display status to master (main.py)
- load data directly on GPU
- [!] Implement result handling in jobs.py (e.g., pass to JobPool or collect files)
- [!] Consolidate logging to a single file
