Colossal-AI具有下面的优点:
colossalai
无法提高显存利用率这是在测试的时候发现的,使用colossalai
以BATCH_SIZE = 16384
训练models.shufflenet_v2_x1_0
会出现显存溢出的问题;
我们已经在hpcaitech/ColossalAI-Examples上提出了issue#139,目前还没有得到回复;
colossalai run
会比直接使用python运行快一点使用colossalai run
会比直接用python快一点;
不过,使用PyCharm调试其代码时出现出现下面的错误:
/home/user/software/python/anaconda/anaconda3/envs/conda-general
/bin/python /home/user/***/***/ColossalAI-Examples/image/resnet/train.py
Traceback (most recent call last):
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/colossalai/initialize.py", line 210, in launch_from_torch
rank = int(os.environ['RANK'])
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/os.py", line 679, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'
During handling of the above exception, another exception occurred:
...
RuntimeError: Could not find 'RANK' in the torch environment, visit https://www.colossalai.org/ for more information on launching with torch
可以看到,出现了设置的问题;我们在Colossal-AI的GitHub上提出了issue,[BUG]: RuntimeError of “RANK” when running train.py of ResNet example on a single GPU #1074
后来我们继续学习了colossalai的调用过程,发现它会用到torchrun,而torchrun是一种使用text文件解释python程序的方式,例如ResNet的示例程序需要使用colossalai run --nproc_per_node 1 train.py
,可以看到,这里不是使用python
解释器来运行的,所以在使用colossalai run
的情况下无法使用PyCharm对程序调试。