当前位置: 首页 > 工具软件 > MPICH > 使用案例 >

mpich示例代码运行报错

董和泽
2023-12-01

在Ubuntu20系统中安装了mpich-3.4.2后,运行如下的示例代码,结果报错

#include <mpi.h>
#include <iostream>

int main(int argc, char* argv[])
{
  MPI_Init(&argc, &argv);

  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  if (rank == 0) {
    int value = 17;
    int result = MPI_Send(&value, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
    if (result == MPI_SUCCESS)
      std::cout << "Rank 0 OK!" << std::endl;
  } else if (rank == 1) {
    int value;
    int result = MPI_Recv(&value, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
			  MPI_STATUS_IGNORE);
    if (result == MPI_SUCCESS && value == 17)
      std::cout << "Rank 1 OK!" << std::endl;
  }
  MPI_Finalize();
  return 0;
}

报错内容如下:

[dell@/usr/local/bin]$ mpic++ -o mpi-test mpi-test.cpp
[dell@/usr/local/bin]$ mpirun -np 2 ./mpi-test
No protocol specified
No protocol specified
[dell] *** An error occurred in MPI_Send
[dell] *** reported by process [1293090817,0]
[dell] *** on communicator MPI_COMM_WORLD
[dell] *** MPI_ERR_RANK: invalid rank
[dell] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dell] ***    and potentially your MPI job)
[dell] *** An error occurred in MPI_Send
[dell] *** reported by process [1293025281,0]
[dell] *** on communicator MPI_COMM_WORLD
[dell] *** MPI_ERR_RANK: invalid rank
[dell] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dell] ***    and potentially your MPI job)

通过在网上查阅相关资料,起初我怀疑是不是因为系统中装了两个mpi软件,openmpi与mpich

于是我分别进入/usr/bin 与 /usr/local/bin两个目录下,发现 :

[dell@/usr/bin]$ls -lh | grep mpi
-rwxr-xr-x 1 root root      11K Apr 27  2016 dumpiso
-rwxr-xr-x 1 root root      47K Mar 13 00:38 glib-compile-resources
lrwxrwxrwx 1 root root       53 Mar 13 00:38 glib-compile-schemas -> ../lib/x86_64-linux-gnu/glib-2.0/glib-compile-schemas
lrwxrwxrwx 1 root root       24 Nov  6  2020 mpic++ -> /etc/alternatives/mpic++
lrwxrwxrwx 1 root root       21 Nov  6  2020 mpicc -> /etc/alternatives/mpi
lrwxrwxrwx 1 root root       23 Nov  6  2020 mpiCC -> /etc/alternatives/mpiCC
lrwxrwxrwx 1 root root       12 Apr 15  2020 mpicc.openmpi -> opal_wrapper
lrwxrwxrwx 1 root root       12 Apr 15  2020 mpiCC.openmpi -> opal_wrapper
lrwxrwxrwx 1 root root       12 Apr 15  2020 mpic++.openmpi -> opal_wrapper
lrwxrwxrwx 1 root root       24 Nov  6  2020 mpicxx -> /etc/alternatives/mpicxx
lrwxrwxrwx 1 root root       12 Apr 15  2020 mpicxx.openmpi -> opal_wrapper
lrwxrwxrwx 1 root root       25 Nov  6  2020 mpiexec -> /etc/alternatives/mpiexec
-rwxr-xr-x 1 root root      20K Mar 22  2020 mpiexec.lam
lrwxrwxrwx 1 root root        7 Apr 15  2020 mpiexec.openmpi -> orterun
lrwxrwxrwx 1 root root       24 Nov  6  2020 mpif77 -> /etc/alternatives/mpif77
lrwxrwxrwx 1 root root       12 Apr 15  2020 mpif77.openmpi -> opal_wrapper
lrwxrwxrwx 1 root root       24 Nov  6  2020 mpif90 -> /etc/alternatives/mpif90
lrwxrwxrwx 1 root root       12 Apr 15  2020 mpif90.openmpi -> opal_wrapper
lrwxrwxrwx 1 root root       25 Nov  6  2020 mpifort -> /etc/alternatives/mpifort
lrwxrwxrwx 1 root root       12 Apr 15  2020 mpifort.openmpi -> opal_wrapper
-rwxr-xr-x 1 root root      27K Mar 22  2020 mpimsg
lrwxrwxrwx 1 root root       24 Nov  6  2020 mpirun -> /etc/alternatives/mpirun
-rwxr-xr-x 1 root root      39K Mar 22  2020 mpirun.lam
lrwxrwxrwx 1 root root        7 Apr 15  2020 mpirun.openmpi -> orterun
-rwxr-xr-x 1 root root      19K Mar 22  2020 mpitask
lrwxrwxrwx 1 root root       10 Apr 15  2020 ompi-clean -> orte-clean
-rwxr-xr-x 1 root root      31K Apr 15  2020 ompi_info
lrwxrwxrwx 1 root root       11 Apr 15  2020 ompi-server -> orte-server
-rwxr-xr-x 1 root root      12K Mar 13  2020 py3compile
-rwxr-xr-x 1 root root      12K Mar 13  2020 pycompile
-rwxr-xr-x 1 root root      15K Mar 26  2020 teckit_compile
[dell@/usr/local/bin]$ls -lh | grep mpi
lrwxrwxrwx 1 root root    6 Jun 26 19:00 mpic++ -> mpicxx
-rwxr-xr-x 1 root root 9.9K Jun 26 19:00 mpicc
-rwxr-xr-x 1 root root  18K Jun 26 18:59 mpichversion
-rwxr-xr-x 1 root root 9.5K Jun 26 19:00 mpicxx
lrwxrwxrwx 1 root root   13 Jun 26 18:59 mpiexec -> mpiexec.hydra
-rwxr-xr-x 1 root root 3.7M Jun 26 18:59 mpiexec.hydra
-rwxr-xr-x 1 root root  13K Jun 26 19:00 mpif77
lrwxrwxrwx 1 root root    7 Jun 26 19:00 mpif90 -> mpifort
-rwxr-xr-x 1 root root  13K Jun 26 19:00 mpifort
lrwxrwxrwx 1 root root   13 Jun 26 18:59 mpirun -> mpiexec.hydra
-rwxr-xr-x 1 root root  35K Jun 26 18:59 mpivars

的确存在两个mpi程序

为了防止冲突,我删掉了/usr/bin目录下所有的mpi程序mpirun/mpicc/mpif77/mpif90等(openmpi)

仅保留/usr/local/bin目录下的mpi程序(mpich)

但是运行示例代码,还是报错

我怀疑会不会是库的问题,因此我查询了 mpi-test 的关联库,结果如下

[dell@~/projects/test/mpi]$ldd mpi-test
        linux-vdso.so.1 (0x00007fff01b03000)
        libmpi_cxx.so.40 => /lib/x86_64-linux-gnu/libmpi_cxx.so.40 (0x00007f694c0a6000)
        libmpi.so.40 => /lib/x86_64-linux-gnu/libmpi.so.40 (0x00007f694bf81000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f694bd9f000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f694bd84000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f694bb92000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f694bb6d000)
        libopen-rte.so.40 => /lib/x86_64-linux-gnu/libopen-rte.so.40 (0x00007f694bab3000)
        libopen-pal.so.40 => /lib/x86_64-linux-gnu/libopen-pal.so.40 (0x00007f694ba05000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f694b8b6000)
        libhwloc.so.15 => /lib/x86_64-linux-gnu/libhwloc.so.15 (0x00007f694b865000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f694c105000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f694b847000)
        libevent-2.1.so.7 => /lib/x86_64-linux-gnu/libevent-2.1.so.7 (0x00007f694b7f1000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f694b7eb000)
        libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f694b7e6000)
        libevent_pthreads-2.1.so.7 => /lib/x86_64-linux-gnu/libevent_pthreads-2.1.so.7 (0x00007f694b7e1000)
        libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007f694b7b4000)
        libltdl.so.7 => /lib/x86_64-linux-gnu/libltdl.so.7 (0x00007f694b7a7000)

果然,有一个库的引用存在问题:

 libmpi.so.40 => /lib/x86_64-linux-gnu/libmpi.so.40 (0x00007f694bf81000)

这个库是mpi的库,应该链接到mpich的安装目录下  /usr/local/bin

为此,我将编译命令改为:

[dell@~/projects/test/mpi]$ mpic++ -o mpi-test mpi-test.cpp -L/usr/local/lib
[dell@~/projects/test/mpi]$ ldd mpi-test
        linux-vdso.so.1 (0x00007ffdcf5c2000)
        libmpi.so.12 => /usr/local/lib/libmpi.so.12 (0x00007f902a72b000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f902a523000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f902a331000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f902a1e2000)
        libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007f902a1b5000)
        libefa.so.1 => /lib/x86_64-linux-gnu/libefa.so.1 (0x00007f902a1a9000)
        libibverbs.so.1 => /lib/x86_64-linux-gnu/libibverbs.so.1 (0x00007f902a18a000)
        libnl-3.so.200 => /lib/x86_64-linux-gnu/libnl-3.so.200 (0x00007f902a167000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f902a144000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f902a13e000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f902a133000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f902b523000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f902a116000)
        libnl-route-3.so.200 => /lib/x86_64-linux-gnu/libnl-route-3.so.200 (0x00007f902a09e000)

此时库的链接是正常的,运行程序,即可得:

[dell@~/projects/test/mpi]$mpirun -np 2 ./mpi-test
Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
Rank 0 OK!
Rank 1 OK!

程序正常运行!

总结一下,编译正常但运行出错的问题主要在于程序运行是需要找到mpi的一个动态库文件,但是在默认的动态链接库的搜索目录下刚好发现了一个同名的文件,程序误以为这是它所需要调用的库文件,但实际上不是,因此出错,解决方法就是在编译链接生成可执行文件的步骤中,指定需要连接的库的目录

 类似资料: