解决 RuntimeError: Default process group has not been initialized 错误

This error occurs when the default process group has not been initialized in the code. The function 'init_process_group()' needs to be called before using any distributed training functions.

To resolve this error, you can add the following code to your script before calling any distributed training functions:

import torch.distributed as dist
dist.init_process_group(backend='nccl', init_method='env://')

This initializes the default process group using the NCCL backend and an environment variable for the initialization method. You can modify the values as per your requirements.

Make sure to also check that any checkpoints you are loading during training are valid and can be used with the current process group.

解决 RuntimeError: Default process group has not been initialized 错误