Tuesday, July 8, 2014

mca_oob_tcp_msg_recv: readv failed: Connection reset by peer

I was encountering the error from one of the users.

[compute-node1:00864] [[44805,0],0]-[[44805,1],7] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[compute-node1:00864] [[44805,0],0]-[[44805,1],0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[compute-node1:00864] [[44805,0],0]-[[44805,1],1] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[compute-node1:00864] [[44805,0],0]-[[44805,1],2] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[compute-node1:00864] [[44805,0],0]-[[44805,1],5] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpirun has exited due to process rank 3 with PID 869 on
node compute-node1 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

There are several things to check
  1. Disabled selinux
    vim /etc/selinux/config
    .....
    .....  
    SELINUX=disabled
    .....
    .....

  2. Diagnose your IB Network
    For more information, see Diagnostic Tools to diagnose Infiniband Fabric Information

  3. Check that your memory ulimit configuration is correct for /etc/security/limits.conf. See blog entry A relook at libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations

  4. If your scheduler is torque, you have to configure the pbs_mom, see Default ulimit setting in torque overide ulimit setting

No comments: