-
Notifications
You must be signed in to change notification settings - Fork 59
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Bug summary
I use dpgen to submit a dpgen job to run the fp on SUGON platform, the fp is like:
"fp": [
{
"command": "OMP_NUM_THREADS=1 mpirun -np 4 $abacus | tee out.log",
"machine": {
"batch_type": "Slurm",
"context_type": "SSHContext",
"local_root": "./",
"remote_root": "/public/home/abacus/tmp",
"remote_profile": {
"key_filename": "sugon",
"hostname": "cancon.hpccube.com",
"username": "abacus",
"port": 65023
}
},
"resources": {
"batch_type": "Slurm",
"number_node": 1,
"cpu_per_node": 32,
"group_size": 1,
"queue_name": "kshdnormal",
"custom_flags": [
"#SBATCH --gres=dcu:4"
],
"source_list": [
"/public/home/abacus/run_dcu.sh"
]
}
}
]
The fp job can be submitted to sugon and run abacus successfully, but it throw the below warning when dpgen get the returned results:
2024-01-23 13:53:23,653 - ERROR : Failed to run ['rsync', '-az', '-e', 'ssh -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no -p 65023 -q -i sugon', '-q', '[email protected]:/public/home/abacus/tmp/695809f93a5474bde7743bddb46cbd857e2906c6/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz', '/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz']: b'rsync: chown "/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/.695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz.sKchjf" failed: Operation not permitted (1)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1677) [generator=3.1.3]\n'
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 273, in try_download_result
self.download_jobs()
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 501, in download_jobs
self.machine.context.download(self)
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 675, in download
self._get_files(
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 905, in _get_files
self.ssh_session.get(from_f, to_f)
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 376, in get
return rsync(
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/utils.py", line 136, in rsync
raise RuntimeError(f"Failed to run {cmd}: {err}")
RuntimeError: Failed to run ['rsync', '-az', '-e', 'ssh -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no -p 65023 -q -i sugon', '-q', '[email protected]:/public/home/abacus/tmp/695809f93a5474bde7743bddb46cbd857e2906c6/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz', '/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz']: b'rsync: chown "/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/.695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz.sKchjf" failed: Operation not permitted (1)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1677) [generator=3.1.3]\n'
2024-01-23 13:53:23,655 - INFO : Retrying in 1 minute...
It seems that rsync try to do chown action, but it is failed.
DP-GEN Version
0.11.1.dev51+gbea559b
Platform, Python Version, Remote Platform, etc
Platform: bohrium
Python: 3.8.8
Remote Platform: Sugon
Input Files, Running Commands, Error Log, etc.
dpgen.zip
Need an extra Sugon secret file named as "sugon".
command: dpgen init_bulk init.json machine.json
Steps to Reproduce
- download the secret file of sugon, and name as "sugon"
- modify the fp in machine.json
- submit the job: dpgen init_bulk init.json machine.json
Further Information, Files, and Links
No response
Copilot
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working