在Ubuntu20系统中安装了mpich-3.4.2后,运行如下的示例代码,结果报错
#include <mpi.h>
#include <iostream>
int main(int argc, char* argv[])
{
MPI_Init(&argc, &argv);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
int value = 17;
int result = MPI_Send(&value, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
if (result == MPI_SUCCESS)
std::cout << "Rank 0 OK!" << std::endl;
} else if (rank == 1) {
int value;
int result = MPI_Recv(&value, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
if (result == MPI_SUCCESS && value == 17)
std::cout << "Rank 1 OK!" << std::endl;
}
MPI_Finalize();
return 0;
}
报错内容如下:
[dell@/usr/local/bin]$ mpic++ -o mpi-test mpi-test.cpp
[dell@/usr/local/bin]$ mpirun -np 2 ./mpi-test
No protocol specified
No protocol specified
[dell] *** An error occurred in MPI_Send
[dell] *** reported by process [1293090817,0]
[dell] *** on communicator MPI_COMM_WORLD
[dell] *** MPI_ERR_RANK: invalid rank
[dell] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dell] *** and potentially your MPI job)
[dell] *** An error occurred in MPI_Send
[dell] *** reported by process [1293025281,0]
[dell] *** on communicator MPI_COMM_WORLD
[dell] *** MPI_ERR_RANK: invalid rank
[dell] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dell] *** and potentially your MPI job)
通过在网上查阅相关资料,起初我怀疑是不是因为系统中装了两个mpi软件,openmpi与mpich
于是我分别进入/usr/bin 与 /usr/local/bin两个目录下,发现 :
[dell@/usr/bin]$ls -lh | grep mpi
-rwxr-xr-x 1 root root 11K Apr 27 2016 dumpiso
-rwxr-xr-x 1 root root 47K Mar 13 00:38 glib-compile-resources
lrwxrwxrwx 1 root root 53 Mar 13 00:38 glib-compile-schemas -> ../lib/x86_64-linux-gnu/glib-2.0/glib-compile-schemas
lrwxrwxrwx 1 root root 24 Nov 6 2020 mpic++ -> /etc/alternatives/mpic++
lrwxrwxrwx 1 root root 21 Nov 6 2020 mpicc -> /etc/alternatives/mpi
lrwxrwxrwx 1 root root 23 Nov 6 2020 mpiCC -> /etc/alternatives/mpiCC
lrwxrwxrwx 1 root root 12 Apr 15 2020 mpicc.openmpi -> opal_wrapper
lrwxrwxrwx 1 root root 12 Apr 15 2020 mpiCC.openmpi -> opal_wrapper
lrwxrwxrwx 1 root root 12 Apr 15 2020 mpic++.openmpi -> opal_wrapper
lrwxrwxrwx 1 root root 24 Nov 6 2020 mpicxx -> /etc/alternatives/mpicxx
lrwxrwxrwx 1 root root 12 Apr 15 2020 mpicxx.openmpi -> opal_wrapper
lrwxrwxrwx 1 root root 25 Nov 6 2020 mpiexec -> /etc/alternatives/mpiexec
-rwxr-xr-x 1 root root 20K Mar 22 2020 mpiexec.lam
lrwxrwxrwx 1 root root 7 Apr 15 2020 mpiexec.openmpi -> orterun
lrwxrwxrwx 1 root root 24 Nov 6 2020 mpif77 -> /etc/alternatives/mpif77
lrwxrwxrwx 1 root root 12 Apr 15 2020 mpif77.openmpi -> opal_wrapper
lrwxrwxrwx 1 root root 24 Nov 6 2020 mpif90 -> /etc/alternatives/mpif90
lrwxrwxrwx 1 root root 12 Apr 15 2020 mpif90.openmpi -> opal_wrapper
lrwxrwxrwx 1 root root 25 Nov 6 2020 mpifort -> /etc/alternatives/mpifort
lrwxrwxrwx 1 root root 12 Apr 15 2020 mpifort.openmpi -> opal_wrapper
-rwxr-xr-x 1 root root 27K Mar 22 2020 mpimsg
lrwxrwxrwx 1 root root 24 Nov 6 2020 mpirun -> /etc/alternatives/mpirun
-rwxr-xr-x 1 root root 39K Mar 22 2020 mpirun.lam
lrwxrwxrwx 1 root root 7 Apr 15 2020 mpirun.openmpi -> orterun
-rwxr-xr-x 1 root root 19K Mar 22 2020 mpitask
lrwxrwxrwx 1 root root 10 Apr 15 2020 ompi-clean -> orte-clean
-rwxr-xr-x 1 root root 31K Apr 15 2020 ompi_info
lrwxrwxrwx 1 root root 11 Apr 15 2020 ompi-server -> orte-server
-rwxr-xr-x 1 root root 12K Mar 13 2020 py3compile
-rwxr-xr-x 1 root root 12K Mar 13 2020 pycompile
-rwxr-xr-x 1 root root 15K Mar 26 2020 teckit_compile
[dell@/usr/local/bin]$ls -lh | grep mpi
lrwxrwxrwx 1 root root 6 Jun 26 19:00 mpic++ -> mpicxx
-rwxr-xr-x 1 root root 9.9K Jun 26 19:00 mpicc
-rwxr-xr-x 1 root root 18K Jun 26 18:59 mpichversion
-rwxr-xr-x 1 root root 9.5K Jun 26 19:00 mpicxx
lrwxrwxrwx 1 root root 13 Jun 26 18:59 mpiexec -> mpiexec.hydra
-rwxr-xr-x 1 root root 3.7M Jun 26 18:59 mpiexec.hydra
-rwxr-xr-x 1 root root 13K Jun 26 19:00 mpif77
lrwxrwxrwx 1 root root 7 Jun 26 19:00 mpif90 -> mpifort
-rwxr-xr-x 1 root root 13K Jun 26 19:00 mpifort
lrwxrwxrwx 1 root root 13 Jun 26 18:59 mpirun -> mpiexec.hydra
-rwxr-xr-x 1 root root 35K Jun 26 18:59 mpivars
的确存在两个mpi程序
为了防止冲突,我删掉了/usr/bin目录下所有的mpi程序mpirun/mpicc/mpif77/mpif90等(openmpi)
仅保留/usr/local/bin目录下的mpi程序(mpich)
但是运行示例代码,还是报错
我怀疑会不会是库的问题,因此我查询了 mpi-test 的关联库,结果如下
[dell@~/projects/test/mpi]$ldd mpi-test
linux-vdso.so.1 (0x00007fff01b03000)
libmpi_cxx.so.40 => /lib/x86_64-linux-gnu/libmpi_cxx.so.40 (0x00007f694c0a6000)
libmpi.so.40 => /lib/x86_64-linux-gnu/libmpi.so.40 (0x00007f694bf81000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f694bd9f000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f694bd84000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f694bb92000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f694bb6d000)
libopen-rte.so.40 => /lib/x86_64-linux-gnu/libopen-rte.so.40 (0x00007f694bab3000)
libopen-pal.so.40 => /lib/x86_64-linux-gnu/libopen-pal.so.40 (0x00007f694ba05000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f694b8b6000)
libhwloc.so.15 => /lib/x86_64-linux-gnu/libhwloc.so.15 (0x00007f694b865000)
/lib64/ld-linux-x86-64.so.2 (0x00007f694c105000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f694b847000)
libevent-2.1.so.7 => /lib/x86_64-linux-gnu/libevent-2.1.so.7 (0x00007f694b7f1000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f694b7eb000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f694b7e6000)
libevent_pthreads-2.1.so.7 => /lib/x86_64-linux-gnu/libevent_pthreads-2.1.so.7 (0x00007f694b7e1000)
libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007f694b7b4000)
libltdl.so.7 => /lib/x86_64-linux-gnu/libltdl.so.7 (0x00007f694b7a7000)
果然,有一个库的引用存在问题:
libmpi.so.40 => /lib/x86_64-linux-gnu/libmpi.so.40 (0x00007f694bf81000)
这个库是mpi的库,应该链接到mpich的安装目录下 /usr/local/bin
为此,我将编译命令改为:
[dell@~/projects/test/mpi]$ mpic++ -o mpi-test mpi-test.cpp -L/usr/local/lib
[dell@~/projects/test/mpi]$ ldd mpi-test
linux-vdso.so.1 (0x00007ffdcf5c2000)
libmpi.so.12 => /usr/local/lib/libmpi.so.12 (0x00007f902a72b000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f902a523000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f902a331000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f902a1e2000)
libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007f902a1b5000)
libefa.so.1 => /lib/x86_64-linux-gnu/libefa.so.1 (0x00007f902a1a9000)
libibverbs.so.1 => /lib/x86_64-linux-gnu/libibverbs.so.1 (0x00007f902a18a000)
libnl-3.so.200 => /lib/x86_64-linux-gnu/libnl-3.so.200 (0x00007f902a167000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f902a144000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f902a13e000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f902a133000)
/lib64/ld-linux-x86-64.so.2 (0x00007f902b523000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f902a116000)
libnl-route-3.so.200 => /lib/x86_64-linux-gnu/libnl-route-3.so.200 (0x00007f902a09e000)
此时库的链接是正常的,运行程序,即可得:
[dell@~/projects/test/mpi]$mpirun -np 2 ./mpi-test
Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
Rank 0 OK!
Rank 1 OK!
程序正常运行!
总结一下,编译正常但运行出错的问题主要在于程序运行是需要找到mpi的一个动态库文件,但是在默认的动态链接库的搜索目录下刚好发现了一个同名的文件,程序误以为这是它所需要调用的库文件,但实际上不是,因此出错,解决方法就是在编译链接生成可执行文件的步骤中,指定需要连接的库的目录