由于项目代码比较复杂且可读性差…,尝试使用Hugging Face的Accelerate实现多卡的分布式训练。
Accelerate主要解决的问题是分布式训练(distributed training),在项目的开始阶段,可能要在单个GPU上跑起来,但是为了加速训练,考虑多卡训练。当然,如果想要debug代码,推荐在CPU上运行调试,因为会产生更meaningful的错误。
使用Accelerate的优势:
首先安装Accelerate ,通过pip或者conda
pip install accelerate
或者
conda install -c conda-forge accelerate
在要训练的机器上配置训练信息,输入
accelerate config
根据提示,完成配置。其他配置方法,比如直接写yaml文件等,参考官方教程。
查看配置信息:
accelerate env
https://huggingface.co/docs/accelerate/basic_tutorials/migration
device = "cuda"
model.to(device)
for batch in training_dataloader:
optimizer.zero_grad()
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
loss.backward()
optimizer.step()
scheduler.step()
如何添加Accelerate到代码中呢?
from accelerate import Accelerator
accelerator = Accelerator() # 首先创建实例
# 训练相关的传入prepare()
model, optimizer, training_dataloader, scheduler = accelerator.prepare(
model, optimizer, training_dataloader, scheduler
)
# device = "cuda"
# model.to(device)
for batch in training_dataloader:
optimizer.zero_grad()
inputs, targets = batch
# inputs = inputs.to(device)
# targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
# loss.backward()
accelerator.backward(loss)
optimizer.step()
scheduler.step()
这样就修改完了,还是挺简单的。
注:如果需要到device,此时device不再是cuda,而是
# device = 'cuda'
device = accelerator.device
https://huggingface.co/docs/accelerate/v0.17.1/en/basic_tutorials/launch
首先,将上面的代码重写到一个函数中,并将其作为脚本进行调用,如:
from accelerate import Accelerator
+ def main():
accelerator = Accelerator()
model, optimizer, training_dataloader, scheduler = accelerator.prepare(
model, optimizer, training_dataloader, scheduler
)
for batch in training_dataloader:
optimizer.zero_grad()
inputs, targets = batch
outputs = model(inputs)
loss = loss_function(outputs, targets)
accelerator.backward(loss)
optimizer.step()
scheduler.step()
+ if __name__ == "__main__":
+ main()
前面已经配置过了,这步可以省略,但是如果想要换一个训练配置,比如2个卡换到3个卡,就需要重新配置一下
accelerate config
accelerate launch {script_name.py} {--arg1} {--arg2} ...
这里只是用了最简单的命令,如果使用自己定义的配置文件启动等一些复杂的命令,参考官方教程
https://huggingface.co/docs/accelerate/main/en/usage_guides/tracking
https://docs.wandb.ai/guides/integrations/accelerate
看了半天HuggingFace教程没看明白怎么添加其他wandb run的参数(我还是太菜了!),最后在wandb的教程中找到了… 传入init_kwargs参数
示例:
from accelerate import Accelerator
# Tell the Accelerator object to log with wandb
accelerator = Accelerator(log_with="wandb")
# Initialise your wandb run, passing wandb parameters and any config information
accelerator.init_trackers(
project_name="my_project",
config={"dropout": 0.1, "learning_rate": 1e-2}
init_kwargs={"wandb": {"entity": "my-wandb-team"}}
)
...
# Log to wandb by calling `accelerator.log`, `step` is optional
accelerator.log({"train_loss": 1.12, "valid_loss": 0.8}, step=global_step)
# Make sure that the wandb tracker finishes correctly
accelerator.end_training()
最后,完整的代码如下:
from accelerate import Accelerator
def main():
accelerator = Accelerator(log_with="wandb") # 首先创建实例
accelerator.init_trackers(
project_name="my_project",
config={"dropout": 0.1, "learning_rate": 1e-2}
init_kwargs={"wandb": {"entity": "my-wandb-team"}}
)
# 训练相关的传入prepare()
model, optimizer, training_dataloader, scheduler = accelerator.prepare(
model, optimizer, training_dataloader, scheduler
)
# device = "cuda"
# model.to(device)
step = 0
for batch in training_dataloader:
optimizer.zero_grad()
inputs, targets = batch
# inputs = inputs.to(device)
# targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
accelerator.log({"train_loss": loss}, step=step)
# loss.backward()
accelerator.backward(loss)
optimizer.step()
scheduler.step()
step += 1
if __name__ == "__main__":
main()
https://huggingface.co/docs/accelerate/v0.17.1/en/index
https://docs.wandb.ai/guides/integrations/accelerate
Hugging Face Accelerate Super Charged With Weights & Biases