This part of tutorial shows how you can train models with multiple GPUs.
python -m torch.distributed.launch --nproc_per_node=${NUMBER_GPUS} --master_port=${MASTER_PORT} scripts/train.py -c ${cfg_file}- nproc_per_node (
int): Number of GPUs in the current machine, for example--nproc_per_node=8. - master_port (
int): Master port, for example--master_port=29527.
For example, we have 2 nodes.
Node 1:
python -m torch.distributed.launch --nproc_per_node=${NUMBER_GPUS} --nnodes=2 --node_rank=0 --master_addr=${MASTER_IP_ADDRESS} --master_port=${MASTER_PORT} scripts/train.py -c ${cfg_file}Node 2:
python -m torch.distributed.launch --nproc_per_node=${NUMBER_GPUS} --nnodes=2 --node_rank=1 --master_addr=${MASTER_IP_ADDRESS} --master_port=${MASTER_PORT} scripts/train.py -c ${cfg_file}- nproc_per_node(
int): Number of GPUs in the current machine, for example--nproc_per_node=8. - nnodes(
int): Number of nodes. - node_rank(
int): Rank of current node, starting from 0. - master_addr(
int): Master IP, for example--master_addr=192.168.1.1. - master_port(
int): Master port, for example--master_port=29527.
Add this script to the end of your configuration file when RuntimeError occurs.
parallel:
type: DistributedDataParallel
find_unused_parameters: true