多机多卡训练mmseg工程时,命令
第一台机器:
NNODES=2 NODE_RANK=0 PORT=8888 MASTER_ADDR=192.168.XX.XX sh tools/dist_train.sh ./configs/temp.py 4
第二台机器:
NNODES=2 NODE_RANK=1 PORT=8888 MASTER_ADDR=192.168.XX.XX sh tools/dist_train.sh ./configs/temp.py 4
报错信息如下:
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:8888 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
根据报错信息,可以看到是因为8888这个端口号被使用了 ,此时只需要更换PORT的端口号就可以了,比如改成29050,29051......
至此,问题解决!