Skip to content

[BUG] rsync receive data from remote platform failed  #434

@pxlxingliang

Description

@pxlxingliang

Bug summary

I use dpgen to submit a dpgen job to run the fp on SUGON platform, the fp is like:

    "fp": [
        {
            "command": "OMP_NUM_THREADS=1 mpirun -np 4 $abacus | tee out.log",
            "machine": {
		"batch_type": "Slurm",
		"context_type": "SSHContext",
                "local_root": "./",
                "remote_root": "/public/home/abacus/tmp",
                "remote_profile": {
                    "key_filename": "sugon",
                    "hostname": "cancon.hpccube.com",
                    "username": "abacus",
                    "port": 65023
                }
            },
            "resources": {
		    "batch_type": "Slurm",
                "number_node": 1,
                "cpu_per_node": 32,
		"group_size": 1,
                "queue_name": "kshdnormal",
                "custom_flags": [
                    "#SBATCH --gres=dcu:4"
                ],
                "source_list": [
                    "/public/home/abacus/run_dcu.sh"
                ]
            }
        }
    ]

The fp job can be submitted to sugon and run abacus successfully, but it throw the below warning when dpgen get the returned results:

2024-01-23 13:53:23,653 - ERROR : Failed to run ['rsync', '-az', '-e', 'ssh -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no -p 65023 -q -i sugon', '-q', '[email protected]:/public/home/abacus/tmp/695809f93a5474bde7743bddb46cbd857e2906c6/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz', '/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz']: b'rsync: chown "/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/.695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz.sKchjf" failed: Operation not permitted (1)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1677) [generator=3.1.3]\n'
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 273, in try_download_result
    self.download_jobs()
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 501, in download_jobs
    self.machine.context.download(self)
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 675, in download
    self._get_files(
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 905, in _get_files
    self.ssh_session.get(from_f, to_f)
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 376, in get
    return rsync(
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/utils.py", line 136, in rsync
    raise RuntimeError(f"Failed to run {cmd}: {err}")
RuntimeError: Failed to run ['rsync', '-az', '-e', 'ssh -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no -p 65023 -q -i sugon', '-q', '[email protected]:/public/home/abacus/tmp/695809f93a5474bde7743bddb46cbd857e2906c6/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz', '/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz']: b'rsync: chown "/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/.695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz.sKchjf" failed: Operation not permitted (1)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1677) [generator=3.1.3]\n'
2024-01-23 13:53:23,655 - INFO : Retrying in 1 minute...

It seems that rsync try to do chown action, but it is failed.

DP-GEN Version

0.11.1.dev51+gbea559b

Platform, Python Version, Remote Platform, etc

Platform: bohrium

Python: 3.8.8

Remote Platform: Sugon

Input Files, Running Commands, Error Log, etc.

dpgen.zip
Need an extra Sugon secret file named as "sugon".
command: dpgen init_bulk init.json machine.json

Steps to Reproduce

  1. download the secret file of sugon, and name as "sugon"
  2. modify the fp in machine.json
  3. submit the job: dpgen init_bulk init.json machine.json

Further Information, Files, and Links

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions