一直想用 Paddle 跑单机多卡,奈何平时胆子太小,一直不敢尝试,今天专门通过几个demo来玩玩♂
环境是 aistudio 的四卡环境,四张 V100s,同时不使用 Fleet
需要两个条件
python -m paddle.distributed.launch
dist.init_parallel_env()
来初始化分布式环境来第一个demo,多卡间将数据聚合操作 (Demo 摘自Paddle官方文档) :
聚合操作大意是,每个进程都有其他进程的数据
import paddle
import paddle.distributed as dist
dist.init_parallel_env() # <------ 初始化动态图模式下的并行训练环境
object_list = []
if dist.get_rank() == 0: # 指定 0 号卡的变量内容
obj = {"foo": [1, 2, 3]}
else: # 指定非0号卡的变量内容
obj = {"bar": [4, 5, 6]}
dist.all_gather_object(object_list, obj)
print(object_list)
接下来我们执行一下:
python -m paddle.distributed.launch --devices=0,1,2 demo.py
结果是这样的,三个进程的数据聚合在一起
[{'foo': [1, 2, 3]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}]
--devices=0,1,2
指定前三张卡,所以上边只有三条内容,如若我指定 --devices=0,1,2,3
:
python -m paddle.distributed.launch --devices=0,1,2,3 demo.py
则结果是:
[{'foo': [1, 2, 3]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}]
默认情况下如果不指定 --devices
,则包含所有可见的设备
值得一提的是,如果我没有用 init_parallel_env
初始化 或者 在命令行没有使用 paddle.distributed.launch
都会报这个 AssertionError :
AssertionError: Call paddle.distributed.init_parallel_env first to initialize the distributed environment.
如果只使用 dist.init_parallel_env()
, 则之前会又这个 warning:
UserWarning: Currently not a parallel execution environment, `paddle.distributed.init_parallel_env` will not do anything.
当然,每次命令行启动 -m paddle.distributed.launch
不方便,咱也可以用 dist.spawn
启动多进程任务
import paddle
import paddle.distributed as dist
def all_gather(_obj):
object_list = []
dist.all_gather_object(object_list, _obj)
return object_list
def train():
dist.init_parallel_env()
if dist.get_rank() == 0:
obj = {"foo": [1, 2, 3]}
else:
obj = {"bar": [4, 5, 6]}
object_list = all_gather(obj)
print(object_list)
if __name__ == "__main__":
dist.spawn(train)
以上文件的执行结果:
I0127 13:56:40.694171 3098 tcp_utils.cc:107] Retry to connect to 127.0.0.1:60373 while the server is not yet listening.
I0127 13:56:40.708599 3100 tcp_utils.cc:107] Retry to connect to 127.0.0.1:60373 while the server is not yet listening.
I0127 13:56:40.713651 3096 tcp_utils.cc:107] Retry to connect to 127.0.0.1:60373 while the server is not yet listening.
I0127 13:56:40.715276 3094 tcp_utils.cc:181] The server starts to listen on IP_ANY:60373
I0127 13:56:40.715495 3094 tcp_utils.cc:130] Successfully connected to 127.0.0.1:60373
I0127 13:56:43.694442 3098 tcp_utils.cc:130] Successfully connected to 127.0.0.1:60373
I0127 13:56:43.708822 3100 tcp_utils.cc:130] Successfully connected to 127.0.0.1:60373
I0127 13:56:43.713865 3096 tcp_utils.cc:130] Successfully connected to 127.0.0.1:60373
W0127 13:56:45.969120 3094 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 13:56:45.972702 3094 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
W0127 13:56:46.605268 3100 gpu_resources.cc:61] Please NOTE: device: 3, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 13:56:46.607916 3098 gpu_resources.cc:61] Please NOTE: device: 2, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 13:56:46.608922 3096 gpu_resources.cc:61] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 13:56:46.609696 3100 gpu_resources.cc:91] device: 3, cuDNN Version: 8.2.
W0127 13:56:46.611501 3098 gpu_resources.cc:91] device: 2, cuDNN Version: 8.2.
W0127 13:56:46.612416 3096 gpu_resources.cc:91] device: 1, cuDNN Version: 8.2.
[{'foo': [1, 2, 3]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}]
[{'foo': [1, 2, 3]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}]
[{'foo': [1, 2, 3]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}]
[{'foo': [1, 2, 3]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}, {'bar': [4, 5, 6]}]
可以看到每个GPU对应的进程需要和主进程通讯,且执行了四次 print 操作,这是与 launch
命令行启动不同的地方,为了执行一次,可以将 print 改成这句:
if dist.get_rank() == 3:
print(object_list)
只由最后一个张卡来执行打印操作
默认情况下如果不指定 gpus
参数,则包含所有设备,我们可以指定 gpus='1,3'
来使用第1和3号卡
if __name__ == "__main__":
dist.spawn(train, gpus='1,3')
与一般多进程程序执行相同的是,其必须在 if __name__ == "__main__"
下执行来避免被子进程递归创建子进程
具体可以参考:
https://blog.csdn.net/HaoZiHuang/article/details/127267686 的第4部分
当然,dist.init_parallel_env()
如果不执行,也会报:
AssertionError: Call paddle.distributed.init_parallel_env first to initialize the distributed environment.
如果 train
函数传入 dist.spawn
中需要传入自己的参数,则用 args=(xxx, yyy, yyy)
if __name__ == '__main__':
dist.spawn(train, args=(True,), gpus='4,5')
该 API 用来判断分布式环境是否已经被初始化,用一个小demo即可判断
import paddle
import paddle.distributed as dist
def train(gpu_id):
if gpu_id == dist.get_rank():
print("ID:%s"%gpu_id, paddle.distributed.is_initialized())
paddle.distributed.init_parallel_env()
if gpu_id == dist.get_rank():
print("ID:%s"%gpu_id, paddle.distributed.is_initialized())
if __name__ == "__main__":
dist.spawn(train, args=(3,))
ID:3 False
I0127 14:15:23.948392 6527 tcp_utils.cc:181] The server starts to listen on IP_ANY:58409
I0127 14:15:23.948485 6531 tcp_utils.cc:130] Successfully connected to 127.0.0.1:58409
I0127 14:15:23.948487 6529 tcp_utils.cc:130] Successfully connected to 127.0.0.1:58409
I0127 14:15:23.948484 6533 tcp_utils.cc:130] Successfully connected to 127.0.0.1:58409
I0127 14:15:23.948619 6527 tcp_utils.cc:130] Successfully connected to 127.0.0.1:58409
W0127 14:15:26.104260 6527 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 14:15:26.107730 6527 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
W0127 14:15:26.214732 6531 gpu_resources.cc:61] Please NOTE: device: 2, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 14:15:26.218250 6531 gpu_resources.cc:91] device: 2, cuDNN Version: 8.2.
W0127 14:15:26.711598 6533 gpu_resources.cc:61] Please NOTE: device: 3, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 14:15:26.715206 6533 gpu_resources.cc:91] device: 3, cuDNN Version: 8.2.
W0127 14:15:26.733402 6529 gpu_resources.cc:61] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 14:15:26.736800 6529 gpu_resources.cc:91] device: 1, cuDNN Version: 8.2.
ID:3 True
I0127 14:15:27.105729 6596 tcp_store.cc:257] receive shutdown event and so quit from MasterDaemon run loop
# required: distributed
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
import paddle.optimizer as opt
import paddle.distributed as dist
from tqdm import tqdm
import numpy as np
from paddle.io import Dataset, DistributedBatchSampler
from paddle.vision.transforms import ToTensor
from paddle.io import DataLoader
class MyNet(paddle.nn.Layer):
def __init__(self, num_classes=1):
super(MyNet, self).__init__()
self.conv1 = paddle.nn.Conv2D(in_channels=3, out_channels=32, kernel_size=(3, 3))
self.pool1 = paddle.nn.MaxPool2D(kernel_size=2, stride=2)
self.conv2 = paddle.nn.Conv2D(in_channels=32, out_channels=64, kernel_size=(3,3))
self.pool2 = paddle.nn.MaxPool2D(kernel_size=2, stride=2)
self.conv3 = paddle.nn.Conv2D(in_channels=64, out_channels=64, kernel_size=(3,3))
self.flatten = paddle.nn.Flatten()
self.linear1 = paddle.nn.Linear(in_features=1024, out_features=64)
self.linear2 = paddle.nn.Linear(in_features=64, out_features=num_classes)
def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
x = self.pool1(x)
x = self.conv2(x)
x = F.relu(x)
x = self.pool2(x)
x = self.conv3(x)
x = F.relu(x)
x = self.flatten(x)
x = self.linear1(x)
x = F.relu(x)
x = self.linear2(x)
return x
def train(model, opt, train_loader):
epoch_num = 10
# batch_size = 32
for epoch in tqdm(range(epoch_num)):
model.train()
for batch_id, data in enumerate(train_loader()):
x_data = data[0]
y_data = paddle.to_tensor(data[1])
y_data = paddle.unsqueeze(y_data, 1)
logits = model(x_data)
loss = F.cross_entropy(logits, y_data)
# print(dist.get_rank(), loss.item())
if batch_id % 1000 == 0:
print("epoch: {}, batch_id: {}, loss is: {}".format(epoch, batch_id, loss.numpy()))
loss.backward()
opt.step()
opt.clear_grad()
if __name__ == '__main__':
# 1. initialize parallel environment
dist.init_parallel_env()
# init with dataset
transform = ToTensor()
cifar10_train = paddle.vision.datasets.Cifar10(mode='train', download=True,
transform=transform)
# cifar10_test = paddle.vision.datasets.Cifar10(mode='test', download=True,
# transform=transform)
# 构建分布式训练使用的数据集
train_sampler = DistributedBatchSampler(cifar10_train, 32, shuffle=True, drop_last=True)
train_loader = DataLoader(cifar10_train, batch_sampler=train_sampler, num_workers=4, use_shared_memory=True)
# valid_sampler = DistributedBatchSampler(cifar10_test, 32, drop_last=True)
# valid_loader = DataLoader(cifar10_test, batch_sampler=valid_sampler, num_workers=2)
model = MyNet(num_classes=10)
# 三、构建分布式训练使用的网络模型
model = paddle.DataParallel(model)
learning_rate = 0.001
opt = paddle.optimizer.Adam(learning_rate=learning_rate, parameters=model.parameters())
# dist.spawn(train, nprocs=4, gpus="0,1,2,3", args=(model, opt, train_loader))
# dist.spawn(train, nprocs=2, gpus="0,1")
train(model, opt, train_loader)
# python -m paddle.distributed.launch demo.py --devices=0,1
# python -m paddle.distributed.launch demo.py --devices=0,1,2,3
4卡 V100S 运行 log,约 20s
LAUNCH INFO 2023-01-27 16:17:46,244 ----------- Configuration ----------------------
LAUNCH INFO 2023-01-27 16:17:46,244 devices: None
LAUNCH INFO 2023-01-27 16:17:46,244 elastic_level: -1
LAUNCH INFO 2023-01-27 16:17:46,244 elastic_timeout: 30
LAUNCH INFO 2023-01-27 16:17:46,244 gloo_port: 6767
LAUNCH INFO 2023-01-27 16:17:46,244 host: None
LAUNCH INFO 2023-01-27 16:17:46,244 ips: None
LAUNCH INFO 2023-01-27 16:17:46,244 job_id: default
LAUNCH INFO 2023-01-27 16:17:46,244 legacy: False
LAUNCH INFO 2023-01-27 16:17:46,244 log_dir: log
LAUNCH INFO 2023-01-27 16:17:46,244 log_level: INFO
LAUNCH INFO 2023-01-27 16:17:46,244 master: None
LAUNCH INFO 2023-01-27 16:17:46,244 max_restart: 3
LAUNCH INFO 2023-01-27 16:17:46,244 nnodes: 1
LAUNCH INFO 2023-01-27 16:17:46,244 nproc_per_node: None
LAUNCH INFO 2023-01-27 16:17:46,244 rank: -1
LAUNCH INFO 2023-01-27 16:17:46,244 run_mode: collective
LAUNCH INFO 2023-01-27 16:17:46,244 server_num: None
LAUNCH INFO 2023-01-27 16:17:46,244 servers:
LAUNCH INFO 2023-01-27 16:17:46,244 start_port: 6070
LAUNCH INFO 2023-01-27 16:17:46,244 trainer_num: None
LAUNCH INFO 2023-01-27 16:17:46,244 trainers:
LAUNCH INFO 2023-01-27 16:17:46,244 training_script: demo.py
LAUNCH INFO 2023-01-27 16:17:46,244 training_script_args: []
LAUNCH INFO 2023-01-27 16:17:46,245 with_gloo: 1
LAUNCH INFO 2023-01-27 16:17:46,245 --------------------------------------------------
LAUNCH INFO 2023-01-27 16:17:46,245 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2023-01-27 16:17:46,258 Run Pod: sedgso, replicas 4, status ready
LAUNCH INFO 2023-01-27 16:17:46,306 Watching Pod: sedgso, replicas 4, status running
I0127 16:17:48.042634 17109 tcp_utils.cc:181] The server starts to listen on IP_ANY:46296
I0127 16:17:48.042860 17109 tcp_utils.cc:130] Successfully connected to 10.156.36.186:46296
W0127 16:17:53.273406 17109 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 16:17:53.277031 17109 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
0%| | 0/10 [00:00<?, ?it/s]epoch: 0, batch_id: 0, loss is: [2.4006782]
10%|█████████████ | 1/10 [00:04<00:40, 4.52s/it]epoch: 1, batch_id: 0, loss is: [1.1844883]
20%|██████████████████████████▏ | 2/10 [00:06<00:24, 3.05s/it]epoch: 2, batch_id: 0, loss is: [1.3358431]
30%|███████████████████████████████████████▎ | 3/10 [00:08<00:18, 2.58s/it]epoch: 3, batch_id: 0, loss is: [1.28385]
40%|████████████████████████████████████████████████████▍ | 4/10 [00:10<00:14, 2.34s/it]epoch: 4, batch_id: 0, loss is: [0.8001609]
50%|█████████████████████████████████████████████████████████████████▌ | 5/10 [00:12<00:11, 2.24s/it]epoch: 5, batch_id: 0, loss is: [1.098891]
60%|██████████████████████████████████████████████████████████████████████████████▌ | 6/10 [00:14<00:08, 2.19s/it]epoch: 6, batch_id: 0, loss is: [0.9990262]
70%|███████████████████████████████████████████████████████████████████████████████████████████▋ | 7/10 [00:16<00:06, 2.16s/it]epoch: 7, batch_id: 0, loss is: [0.62800074]
80%|████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 8/10 [00:18<00:04, 2.10s/it]epoch: 8, batch_id: 0, loss is: [0.55639195]
90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 9/10 [00:20<00:02, 2.07s/it]epoch: 9, batch_id: 0, loss is: [0.59547466]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:22<00:00, 2.04s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:22<00:00, 2.27s/it]
I0127 16:18:22.867241 17181 tcp_store.cc:257] receive shutdown event and so quit from MasterDaemon run loop
LAUNCH INFO 2023-01-27 16:18:25,354 Pod completed
LAUNCH INFO 2023-01-27 16:18:25,354 Exit code 0
单卡程序:
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
import paddle.optimizer as opt
import paddle.distributed as dist
from tqdm import tqdm
import numpy as np
from paddle.io import Dataset, DistributedBatchSampler, BatchSampler
from paddle.vision.transforms import ToTensor
from paddle.io import DataLoader
class MyNet(paddle.nn.Layer):
def __init__(self, num_classes=1):
super(MyNet, self).__init__()
self.conv1 = paddle.nn.Conv2D(in_channels=3, out_channels=32, kernel_size=(3, 3))
self.pool1 = paddle.nn.MaxPool2D(kernel_size=2, stride=2)
self.conv2 = paddle.nn.Conv2D(in_channels=32, out_channels=64, kernel_size=(3,3))
self.pool2 = paddle.nn.MaxPool2D(kernel_size=2, stride=2)
self.conv3 = paddle.nn.Conv2D(in_channels=64, out_channels=64, kernel_size=(3,3))
self.flatten = paddle.nn.Flatten()
self.linear1 = paddle.nn.Linear(in_features=1024, out_features=64)
self.linear2 = paddle.nn.Linear(in_features=64, out_features=num_classes)
def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
x = self.pool1(x)
x = self.conv2(x)
x = F.relu(x)
x = self.pool2(x)
x = self.conv3(x)
x = F.relu(x)
x = self.flatten(x)
x = self.linear1(x)
x = F.relu(x)
x = self.linear2(x)
return x
def train(model, opt, train_loader):
epoch_num = 10
# batch_size = 32
for epoch in tqdm(range(epoch_num)):
model.train()
for batch_id, data in enumerate(train_loader()):
x_data = data[0]
y_data = paddle.to_tensor(data[1])
y_data = paddle.unsqueeze(y_data, 1)
logits = model(x_data)
loss = F.cross_entropy(logits, y_data)
# print(dist.get_rank(), loss.item())
if batch_id % 1000 == 0:
print("epoch: {}, batch_id: {}, loss is: {}".format(epoch, batch_id, loss.numpy()))
loss.backward()
opt.step()
opt.clear_grad()
if __name__ == '__main__':
# 1. initialize parallel environment
# dist.init_parallel_env()
# init with dataset
transform = ToTensor()
cifar10_train = paddle.vision.datasets.Cifar10(mode='train', download=True,
transform=transform)
# cifar10_test = paddle.vision.datasets.Cifar10(mode='test', download=True,
# transform=transform)
# 构建分布式训练使用的数据集
# train_sampler = DistributedBatchSampler(cifar10_train, 32, shuffle=True, drop_last=True)
# train_loader = DataLoader(cifar10_train, batch_sampler=train_sampler, num_workers=4, use_shared_memory=True)
train_sampler = BatchSampler(dataset=cifar10_train, batch_size=32, shuffle=True, drop_last=True)
train_loader = DataLoader(cifar10_train, batch_sampler=train_sampler, num_workers=4, use_shared_memory=True)
# valid_sampler = DistributedBatchSampler(cifar10_test, 32, drop_last=True)
# valid_loader = DataLoader(cifar10_test, batch_sampler=valid_sampler, num_workers=2)
model = MyNet(num_classes=10)
# 三、构建分布式训练使用的网络模型
# model = paddle.DataParallel(model)
learning_rate = 0.001
opt = paddle.optimizer.Adam(learning_rate=learning_rate, parameters=model.parameters())
# dist.spawn(train, nprocs=4, gpus="0,1,2,3", args=(model, opt, train_loader))
# dist.spawn(train, nprocs=2, gpus="0,1")
train(model, opt, train_loader)
# python -m paddle.distributed.launch demo.py --devices=0,1
# python -m paddle.distributed.launch demo.py --devices=0,1,2,3
单卡log,约 58s
W0127 16:12:34.162158 32894 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0127 16:12:34.165545 32894 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
0%| | 0/10 [00:00<?, ?it/s]epoch: 0, batch_id: 0, loss is: [2.5992696]
epoch: 0, batch_id: 1000, loss is: [1.3871325]
10%|█████████████▏ | 1/10 [00:07<01:08, 7.62s/it]epoch: 1, batch_id: 0, loss is: [0.84788895]
epoch: 1, batch_id: 1000, loss is: [0.89925337]
20%|██████████████████████████▍ | 2/10 [00:13<00:51, 6.44s/it]epoch: 2, batch_id: 0, loss is: [1.2445617]
epoch: 2, batch_id: 1000, loss is: [0.70835286]
30%|███████████████████████████████████████▌ | 3/10 [00:18<00:42, 6.12s/it]epoch: 3, batch_id: 0, loss is: [0.7969709]
epoch: 3, batch_id: 1000, loss is: [0.75992584]
40%|████████████████████████████████████████████████████▊ | 4/10 [00:24<00:35, 5.97s/it]epoch: 4, batch_id: 0, loss is: [0.7136339]
epoch: 4, batch_id: 1000, loss is: [0.84568065]
50%|██████████████████████████████████████████████████████████████████ | 5/10 [00:30<00:29, 5.90s/it]epoch: 5, batch_id: 0, loss is: [0.54131997]
epoch: 5, batch_id: 1000, loss is: [0.8754035]
60%|███████████████████████████████████████████████████████████████████████████████▏ | 6/10 [00:36<00:23, 5.85s/it]epoch: 6, batch_id: 0, loss is: [0.62274516]
epoch: 6, batch_id: 1000, loss is: [0.29540402]
70%|████████████████████████████████████████████████████████████████████████████████████████████▍ | 7/10 [00:41<00:17, 5.79s/it]epoch: 7, batch_id: 0, loss is: [0.6250535]
epoch: 7, batch_id: 1000, loss is: [0.76928544]
80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 8/10 [00:47<00:11, 5.73s/it]epoch: 8, batch_id: 0, loss is: [0.63512653]
epoch: 8, batch_id: 1000, loss is: [0.57084846]
90%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 9/10 [00:53<00:05, 5.67s/it]epoch: 9, batch_id: 0, loss is: [0.44904262]
epoch: 9, batch_id: 1000, loss is: [0.54245126]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:58<00:00, 5.87s/it]
paddle.DataParallel
需要 dist.init_parallel_env()
来初始化DistributedBatchSampler
而使用普通的 BatchSampler
,这个程序会卡死,且没有任何输出(在当前的 Paddle2.4.0 是酱紫的)