当前位置: 首页 > 工具软件 > fleet > 使用案例 >

Paddle 单机多卡怎么玩耍? (非 Fleet 版本)

阙博容
2023-12-01

一直想用 Paddle 跑单机多卡,奈何平时胆子太小,一直不敢尝试,今天专门通过几个demo来玩玩♂
环境是 aistudio 的四卡环境,四张 V100s,同时不使用 Fleet

1. distributed 启动

需要两个条件

  • 一个是在启动时指定 python -m paddle.distributed.launch
  • 另一个是使用dist.init_parallel_env() 来初始化分布式环境

来第一个demo,多卡间将数据聚合操作 (Demo 摘自Paddle官方文档) :

聚合操作大意是,每个进程都有其他进程的数据

import paddle
import paddle.distributed as dist

dist.init_parallel_env()    # <------ 初始化动态图模式下的并行训练环境
object_list = []
if dist.get_rank() == 0: # 指定 0 号卡的变量内容
    obj = {"foo": [1, 2, 3]}
else:                    # 指定非0号卡的变量内容
    obj = {"bar": [4, 5, 6]}
dist.all_gather_object(object_list, obj)
print(object_list)

接下来我们执行一下:

python -m paddle.distributed.launch --devices=0,1,2 demo.py

结果是这样的,三个进程的数据聚合在一起

[{'foo': [1, 2, 3]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}]

--devices=0,1,2 指定前三张卡,所以上边只有三条内容,如若我指定 --devices=0,1,2,3

python -m paddle.distributed.launch --devices=0,1,2,3 demo.py

则结果是:

[{'foo': [1, 2, 3]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}]

默认情况下如果不指定 --devices,则包含所有可见的设备

值得一提的是,如果我没有用 init_parallel_env 初始化 或者 在命令行没有使用 paddle.distributed.launch 都会报这个 AssertionError :

AssertionError: Call paddle.distributed.init_parallel_env first to initialize the distributed environment.

如果只使用 dist.init_parallel_env(), 则之前会又这个 warning:

UserWarning: Currently not a parallel execution environment, `paddle.distributed.init_parallel_env` will not do anything.

2. dist.spawn 启动

当然,每次命令行启动 -m paddle.distributed.launch 不方便,咱也可以用 dist.spawn 启动多进程任务

import paddle
import paddle.distributed as dist

def all_gather(_obj):
    object_list = []
    dist.all_gather_object(object_list, _obj)
    return object_list

def train():
    
    dist.init_parallel_env()

    if dist.get_rank() == 0:
        obj = {"foo": [1, 2, 3]}
    else:
        obj = {"bar": [4, 5, 6]}

    object_list = all_gather(obj)
    print(object_list)
    

if __name__ == "__main__":
    dist.spawn(train)

以上文件的执行结果:

I0127 13:56:40.694171  3098 tcp_utils.cc:107] Retry to connect to 127.0.0.1:60373 while the server is not yet listening.
I0127 13:56:40.708599  3100 tcp_utils.cc:107] Retry to connect to 127.0.0.1:60373 while the server is not yet listening.
I0127 13:56:40.713651  3096 tcp_utils.cc:107] Retry to connect to 127.0.0.1:60373 while the server is not yet listening.
I0127 13:56:40.715276  3094 tcp_utils.cc:181] The server starts to listen on IP_ANY:60373
I0127 13:56:40.715495  3094 tcp_utils.cc:130] Successfully connected to 127.0.0.1:60373
I0127 13:56:43.694442  3098 tcp_utils.cc:130] Successfully connected to 127.0.0.1:60373
I0127 13:56:43.708822  3100 tcp_utils.cc:130] Successfully connected to 127.0.0.1:60373
I0127 13:56:43.713865  3096 tcp_utils.cc:130] Successfully connected to 127.0.0.1:60373
W0127 13:56:45.969120  3094 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 13:56:45.972702  3094 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
W0127 13:56:46.605268  3100 gpu_resources.cc:61] Please NOTE: device: 3, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 13:56:46.607916  3098 gpu_resources.cc:61] Please NOTE: device: 2, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 13:56:46.608922  3096 gpu_resources.cc:61] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 13:56:46.609696  3100 gpu_resources.cc:91] device: 3, cuDNN Version: 8.2.
W0127 13:56:46.611501  3098 gpu_resources.cc:91] device: 2, cuDNN Version: 8.2.
W0127 13:56:46.612416  3096 gpu_resources.cc:91] device: 1, cuDNN Version: 8.2.
[{'foo': [1, 2, 3]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}]
[{'foo': [1, 2, 3]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}]
[{'foo': [1, 2, 3]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}]
[{'foo': [1, 2, 3]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}]

可以看到每个GPU对应的进程需要和主进程通讯,且执行了四次 print 操作,这是与 launch 命令行启动不同的地方,为了执行一次,可以将 print 改成这句:

    if dist.get_rank() == 3:
        print(object_list)

只由最后一个张卡来执行打印操作

默认情况下如果不指定 gpus 参数,则包含所有设备,我们可以指定 gpus='1,3' 来使用第1和3号卡

if __name__ == "__main__":
    dist.spawn(train, gpus='1,3')

与一般多进程程序执行相同的是,其必须在 if __name__ == "__main__" 下执行来避免被子进程递归创建子进程

具体可以参考:
https://blog.csdn.net/HaoZiHuang/article/details/127267686 的第4部分

当然,dist.init_parallel_env() 如果不执行,也会报:

AssertionError: Call paddle.distributed.init_parallel_env first to initialize the distributed environment.

如果 train 函数传入 dist.spawn 中需要传入自己的参数,则用 args=(xxx, yyy, yyy)

if __name__ == '__main__':
    dist.spawn(train, args=(True,), gpus='4,5')

3. is_initialized

该 API 用来判断分布式环境是否已经被初始化,用一个小demo即可判断

import paddle
import paddle.distributed as dist

def train(gpu_id):
    if gpu_id == dist.get_rank():
        print("ID:%s"%gpu_id, paddle.distributed.is_initialized())
    paddle.distributed.init_parallel_env()
    if gpu_id == dist.get_rank():
        print("ID:%s"%gpu_id, paddle.distributed.is_initialized())

if __name__ == "__main__":
    dist.spawn(train, args=(3,))
ID:3 False
I0127 14:15:23.948392  6527 tcp_utils.cc:181] The server starts to listen on IP_ANY:58409
I0127 14:15:23.948485  6531 tcp_utils.cc:130] Successfully connected to 127.0.0.1:58409
I0127 14:15:23.948487  6529 tcp_utils.cc:130] Successfully connected to 127.0.0.1:58409
I0127 14:15:23.948484  6533 tcp_utils.cc:130] Successfully connected to 127.0.0.1:58409
I0127 14:15:23.948619  6527 tcp_utils.cc:130] Successfully connected to 127.0.0.1:58409
W0127 14:15:26.104260  6527 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 14:15:26.107730  6527 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
W0127 14:15:26.214732  6531 gpu_resources.cc:61] Please NOTE: device: 2, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 14:15:26.218250  6531 gpu_resources.cc:91] device: 2, cuDNN Version: 8.2.
W0127 14:15:26.711598  6533 gpu_resources.cc:61] Please NOTE: device: 3, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 14:15:26.715206  6533 gpu_resources.cc:91] device: 3, cuDNN Version: 8.2.
W0127 14:15:26.733402  6529 gpu_resources.cc:61] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 14:15:26.736800  6529 gpu_resources.cc:91] device: 1, cuDNN Version: 8.2.
ID:3 True
I0127 14:15:27.105729  6596 tcp_store.cc:257] receive shutdown event and so quit from MasterDaemon run loop

4. DistributedBatchSampler 的使用

# required: distributed
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
import paddle.optimizer as opt
import paddle.distributed as dist
from tqdm import tqdm
import numpy as np
from paddle.io import Dataset, DistributedBatchSampler
from paddle.vision.transforms import ToTensor
from paddle.io import DataLoader

class MyNet(paddle.nn.Layer):
    def __init__(self, num_classes=1):
        super(MyNet, self).__init__()

        self.conv1 = paddle.nn.Conv2D(in_channels=3, out_channels=32, kernel_size=(3, 3))
        self.pool1 = paddle.nn.MaxPool2D(kernel_size=2, stride=2)

        self.conv2 = paddle.nn.Conv2D(in_channels=32, out_channels=64, kernel_size=(3,3))
        self.pool2 = paddle.nn.MaxPool2D(kernel_size=2, stride=2)

        self.conv3 = paddle.nn.Conv2D(in_channels=64, out_channels=64, kernel_size=(3,3))

        self.flatten = paddle.nn.Flatten()

        self.linear1 = paddle.nn.Linear(in_features=1024, out_features=64)
        self.linear2 = paddle.nn.Linear(in_features=64, out_features=num_classes)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.pool1(x)

        x = self.conv2(x)
        x = F.relu(x)
        x = self.pool2(x)

        x = self.conv3(x)
        x = F.relu(x)

        x = self.flatten(x)
        x = self.linear1(x)
        x = F.relu(x)
        x = self.linear2(x)
        return x

def train(model, opt, train_loader):
 
    epoch_num = 10
    # batch_size = 32

    for epoch in tqdm(range(epoch_num)):
        model.train()
        for batch_id, data in enumerate(train_loader()):
            x_data = data[0]
            y_data = paddle.to_tensor(data[1])
            y_data = paddle.unsqueeze(y_data, 1)

            logits = model(x_data)
            loss = F.cross_entropy(logits, y_data)

            # print(dist.get_rank(), loss.item())

            if batch_id % 1000 == 0:
                print("epoch: {}, batch_id: {}, loss is: {}".format(epoch, batch_id, loss.numpy()))
            loss.backward()
            opt.step()
            opt.clear_grad()

if __name__ == '__main__':
    
    # 1. initialize parallel environment
    dist.init_parallel_env()   

    # init with dataset
    transform = ToTensor()
    cifar10_train = paddle.vision.datasets.Cifar10(mode='train', download=True,
                                           transform=transform)
    # cifar10_test = paddle.vision.datasets.Cifar10(mode='test', download=True,
    #                                       transform=transform)

    # 构建分布式训练使用的数据集
    train_sampler = DistributedBatchSampler(cifar10_train, 32, shuffle=True, drop_last=True)
    train_loader = DataLoader(cifar10_train, batch_sampler=train_sampler, num_workers=4, use_shared_memory=True)

    # valid_sampler = DistributedBatchSampler(cifar10_test, 32, drop_last=True)
    # valid_loader = DataLoader(cifar10_test, batch_sampler=valid_sampler, num_workers=2)

    model = MyNet(num_classes=10)
    # 三、构建分布式训练使用的网络模型
    model = paddle.DataParallel(model)

    learning_rate = 0.001
    opt = paddle.optimizer.Adam(learning_rate=learning_rate, parameters=model.parameters())

    # dist.spawn(train, nprocs=4, gpus="0,1,2,3", args=(model, opt, train_loader))
    # dist.spawn(train, nprocs=2, gpus="0,1")

    train(model, opt, train_loader)

    # python -m paddle.distributed.launch demo.py --devices=0,1
    # python -m paddle.distributed.launch demo.py --devices=0,1,2,3


4卡 V100S 运行 log,约 20s

LAUNCH INFO 2023-01-27 16:17:46,244 -----------  Configuration  ----------------------
LAUNCH INFO 2023-01-27 16:17:46,244 devices: None
LAUNCH INFO 2023-01-27 16:17:46,244 elastic_level: -1
LAUNCH INFO 2023-01-27 16:17:46,244 elastic_timeout: 30
LAUNCH INFO 2023-01-27 16:17:46,244 gloo_port: 6767
LAUNCH INFO 2023-01-27 16:17:46,244 host: None
LAUNCH INFO 2023-01-27 16:17:46,244 ips: None
LAUNCH INFO 2023-01-27 16:17:46,244 job_id: default
LAUNCH INFO 2023-01-27 16:17:46,244 legacy: False
LAUNCH INFO 2023-01-27 16:17:46,244 log_dir: log
LAUNCH INFO 2023-01-27 16:17:46,244 log_level: INFO
LAUNCH INFO 2023-01-27 16:17:46,244 master: None
LAUNCH INFO 2023-01-27 16:17:46,244 max_restart: 3
LAUNCH INFO 2023-01-27 16:17:46,244 nnodes: 1
LAUNCH INFO 2023-01-27 16:17:46,244 nproc_per_node: None
LAUNCH INFO 2023-01-27 16:17:46,244 rank: -1
LAUNCH INFO 2023-01-27 16:17:46,244 run_mode: collective
LAUNCH INFO 2023-01-27 16:17:46,244 server_num: None
LAUNCH INFO 2023-01-27 16:17:46,244 servers: 
LAUNCH INFO 2023-01-27 16:17:46,244 start_port: 6070
LAUNCH INFO 2023-01-27 16:17:46,244 trainer_num: None
LAUNCH INFO 2023-01-27 16:17:46,244 trainers: 
LAUNCH INFO 2023-01-27 16:17:46,244 training_script: demo.py
LAUNCH INFO 2023-01-27 16:17:46,244 training_script_args: []
LAUNCH INFO 2023-01-27 16:17:46,245 with_gloo: 1
LAUNCH INFO 2023-01-27 16:17:46,245 --------------------------------------------------
LAUNCH INFO 2023-01-27 16:17:46,245 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2023-01-27 16:17:46,258 Run Pod: sedgso, replicas 4, status ready
LAUNCH INFO 2023-01-27 16:17:46,306 Watching Pod: sedgso, replicas 4, status running
I0127 16:17:48.042634 17109 tcp_utils.cc:181] The server starts to listen on IP_ANY:46296
I0127 16:17:48.042860 17109 tcp_utils.cc:130] Successfully connected to 10.156.36.186:46296
W0127 16:17:53.273406 17109 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 16:17:53.277031 17109 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.

  0%|                                                                                                                                           | 0/10 [00:00<?, ?it/s]epoch: 0, batch_id: 0, loss is: [2.4006782]

 10%|█████████████                                                                                                                      | 1/10 [00:04<00:40,  4.52s/it]epoch: 1, batch_id: 0, loss is: [1.1844883]

 20%|██████████████████████████▏                                                                                                        | 2/10 [00:06<00:24,  3.05s/it]epoch: 2, batch_id: 0, loss is: [1.3358431]

 30%|███████████████████████████████████████▎                                                                                           | 3/10 [00:08<00:18,  2.58s/it]epoch: 3, batch_id: 0, loss is: [1.28385]

 40%|████████████████████████████████████████████████████▍                                                                              | 4/10 [00:10<00:14,  2.34s/it]epoch: 4, batch_id: 0, loss is: [0.8001609]

 50%|█████████████████████████████████████████████████████████████████▌                                                                 | 5/10 [00:12<00:11,  2.24s/it]epoch: 5, batch_id: 0, loss is: [1.098891]

 60%|██████████████████████████████████████████████████████████████████████████████▌                                                    | 6/10 [00:14<00:08,  2.19s/it]epoch: 6, batch_id: 0, loss is: [0.9990262]

 70%|███████████████████████████████████████████████████████████████████████████████████████████▋                                       | 7/10 [00:16<00:06,  2.16s/it]epoch: 7, batch_id: 0, loss is: [0.62800074]

 80%|████████████████████████████████████████████████████████████████████████████████████████████████████████▊                          | 8/10 [00:18<00:04,  2.10s/it]epoch: 8, batch_id: 0, loss is: [0.55639195]

 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉             | 9/10 [00:20<00:02,  2.07s/it]epoch: 9, batch_id: 0, loss is: [0.59547466]

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:22<00:00,  2.04s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:22<00:00,  2.27s/it]
I0127 16:18:22.867241 17181 tcp_store.cc:257] receive shutdown event and so quit from MasterDaemon run loop
LAUNCH INFO 2023-01-27 16:18:25,354 Pod completed
LAUNCH INFO 2023-01-27 16:18:25,354 Exit code 0

单卡程序:

import paddle
import paddle.nn as nn
import paddle.nn.functional as F
import paddle.optimizer as opt
import paddle.distributed as dist
from tqdm import tqdm
import numpy as np
from paddle.io import Dataset, DistributedBatchSampler, BatchSampler
from paddle.vision.transforms import ToTensor
from paddle.io import DataLoader

class MyNet(paddle.nn.Layer):
    def __init__(self, num_classes=1):
        super(MyNet, self).__init__()

        self.conv1 = paddle.nn.Conv2D(in_channels=3, out_channels=32, kernel_size=(3, 3))
        self.pool1 = paddle.nn.MaxPool2D(kernel_size=2, stride=2)

        self.conv2 = paddle.nn.Conv2D(in_channels=32, out_channels=64, kernel_size=(3,3))
        self.pool2 = paddle.nn.MaxPool2D(kernel_size=2, stride=2)

        self.conv3 = paddle.nn.Conv2D(in_channels=64, out_channels=64, kernel_size=(3,3))

        self.flatten = paddle.nn.Flatten()

        self.linear1 = paddle.nn.Linear(in_features=1024, out_features=64)
        self.linear2 = paddle.nn.Linear(in_features=64, out_features=num_classes)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.pool1(x)

        x = self.conv2(x)
        x = F.relu(x)
        x = self.pool2(x)

        x = self.conv3(x)
        x = F.relu(x)

        x = self.flatten(x)
        x = self.linear1(x)
        x = F.relu(x)
        x = self.linear2(x)
        return x

def train(model, opt, train_loader):
 
    epoch_num = 10
    # batch_size = 32

    for epoch in tqdm(range(epoch_num)):
        model.train()
        for batch_id, data in enumerate(train_loader()):
            x_data = data[0]
            y_data = paddle.to_tensor(data[1])
            y_data = paddle.unsqueeze(y_data, 1)

            logits = model(x_data)
            loss = F.cross_entropy(logits, y_data)

            # print(dist.get_rank(), loss.item())

            if batch_id % 1000 == 0:
                print("epoch: {}, batch_id: {}, loss is: {}".format(epoch, batch_id, loss.numpy()))
            loss.backward()
            opt.step()
            opt.clear_grad()

if __name__ == '__main__':
    
    # 1. initialize parallel environment
    # dist.init_parallel_env()   

    # init with dataset
    transform = ToTensor()
    cifar10_train = paddle.vision.datasets.Cifar10(mode='train', download=True,
                                           transform=transform)
    # cifar10_test = paddle.vision.datasets.Cifar10(mode='test', download=True,
    #                                       transform=transform)

    # 构建分布式训练使用的数据集
    # train_sampler = DistributedBatchSampler(cifar10_train, 32, shuffle=True, drop_last=True)
    # train_loader = DataLoader(cifar10_train, batch_sampler=train_sampler, num_workers=4, use_shared_memory=True)

    train_sampler = BatchSampler(dataset=cifar10_train, batch_size=32, shuffle=True, drop_last=True)
    train_loader = DataLoader(cifar10_train, batch_sampler=train_sampler, num_workers=4, use_shared_memory=True)

    # valid_sampler = DistributedBatchSampler(cifar10_test, 32, drop_last=True)
    # valid_loader = DataLoader(cifar10_test, batch_sampler=valid_sampler, num_workers=2)

    model = MyNet(num_classes=10)
    # 三、构建分布式训练使用的网络模型
    # model = paddle.DataParallel(model)

    learning_rate = 0.001
    opt = paddle.optimizer.Adam(learning_rate=learning_rate, parameters=model.parameters())

    # dist.spawn(train, nprocs=4, gpus="0,1,2,3", args=(model, opt, train_loader))
    # dist.spawn(train, nprocs=2, gpus="0,1")

    train(model, opt, train_loader)

    # python -m paddle.distributed.launch demo.py --devices=0,1
    # python -m paddle.distributed.launch demo.py --devices=0,1,2,3

单卡log,约 58s

W0127 16:12:34.162158 32894 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 16:12:34.165545 32894 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
  0%|                                                                                                                                            | 0/10 [00:00<?, ?it/s]epoch: 0, batch_id: 0, loss is: [2.5992696]
epoch: 0, batch_id: 1000, loss is: [1.3871325]
 10%|█████████████▏                                                                                                                      | 1/10 [00:07<01:08,  7.62s/it]epoch: 1, batch_id: 0, loss is: [0.84788895]
epoch: 1, batch_id: 1000, loss is: [0.89925337]
 20%|██████████████████████████▍                                                                                                         | 2/10 [00:13<00:51,  6.44s/it]epoch: 2, batch_id: 0, loss is: [1.2445617]
epoch: 2, batch_id: 1000, loss is: [0.70835286]
 30%|███████████████████████████████████████▌                                                                                            | 3/10 [00:18<00:42,  6.12s/it]epoch: 3, batch_id: 0, loss is: [0.7969709]
epoch: 3, batch_id: 1000, loss is: [0.75992584]
 40%|████████████████████████████████████████████████████▊                                                                               | 4/10 [00:24<00:35,  5.97s/it]epoch: 4, batch_id: 0, loss is: [0.7136339]
epoch: 4, batch_id: 1000, loss is: [0.84568065]
 50%|██████████████████████████████████████████████████████████████████                                                                  | 5/10 [00:30<00:29,  5.90s/it]epoch: 5, batch_id: 0, loss is: [0.54131997]
epoch: 5, batch_id: 1000, loss is: [0.8754035]
 60%|███████████████████████████████████████████████████████████████████████████████▏                                                    | 6/10 [00:36<00:23,  5.85s/it]epoch: 6, batch_id: 0, loss is: [0.62274516]
epoch: 6, batch_id: 1000, loss is: [0.29540402]
 70%|████████████████████████████████████████████████████████████████████████████████████████████▍                                       | 7/10 [00:41<00:17,  5.79s/it]epoch: 7, batch_id: 0, loss is: [0.6250535]
epoch: 7, batch_id: 1000, loss is: [0.76928544]
 80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▌                          | 8/10 [00:47<00:11,  5.73s/it]epoch: 8, batch_id: 0, loss is: [0.63512653]
epoch: 8, batch_id: 1000, loss is: [0.57084846]
 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊             | 9/10 [00:53<00:05,  5.67s/it]epoch: 9, batch_id: 0, loss is: [0.44904262]
epoch: 9, batch_id: 1000, loss is: [0.54245126]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:58<00:00,  5.87s/it]
  • paddle.DataParallel 需要 dist.init_parallel_env() 来初始化
  • 如果不使用 DistributedBatchSampler 而使用普通的 BatchSampler ,这个程序会卡死,且没有任何输出(在当前的 Paddle2.4.0 是酱紫的)
 类似资料: