MPI LIBLINEAR FAQ

Last modified:

Some questions may be found in LIBLINEAR FAQ.


Q: Why am I getting the following warnings or even encountering crashes when running MPI LIBLINEAR?
$ mpirun -n 2 -npernode 1 --machinefile machinefile ./train ...
--------------------------------------------------------------------------
[[958,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
	Host: [machine-name]

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
...
...[btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 13335 on node peanuts exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------	

OpenFabrics (openib) is a module for high-speed data-transport across machines.
If some related component (e.g., InfiniBand) is not supported on your machine, then OpenMPI may report the warning messages.
To solve this issue, please add --mca btl ^openib into the command to exclude the openib component.

$ mpirun -n 2 -npernode 1 --machinefile machinefile --mca btl ^openib ./train ...

Q: Why am I getting the following errors?
...[btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] ... failed: Connection refused (111)

Sometimes MPI fails to choose a proper network interface. You need to manually specify the correct one by adding

--mca btl_tcp_if_include [your-network-interface]
in your mpirun command.
Note that [your-network-interface] should be replaced by the name of a network interface that used to communicate with all the other nodes.
On Unix-like system, you may type ifconfig to see all network interfaces on your machine. See Open MPI's FAQ for more details.
Q: Why am I getting the following errors?
Host key verification failed.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
Please make sure all the machines specified in your machinefile are accessible via SSH login without password prompt.
Note that the first line in machinefile should be the IP of the master machine.
If you still encounter such errors, please try to replace localhost with <MASTER-MACHINE-IP> in your machinefile.
Please contact Chih-Jen Lin for any question.