部署完之后,代码也能正确跑起来了,也确实集群分散了。跑一下各种各样的代码,发现了一个错误:
$ ~/OpenMpi/bin/mpiexec -np 10 ~/NetWorkTest My rank is 2 My rank is 7 My rank is 0 My rank is 3 My rank is 6 My rank is 8 My rank is 4 My rank is 1 My rank is 5 ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[18656,1],2] Exit code: 14 --------------------------------------------------------------------------
这份代码是什么问题导致的呢?然后我不小心把 MPF_Finalize() 函数注释掉了,那么就是说明有一个进程先错误返回了。Master 进程捕获到了。
这里反映了一个事实: 集群中如果有一个进程挂掉了,那么整个进程集都会挂掉
加回去 MPF_Finalize() 函数,这个错误就没了