[rds-devel] FW: RDS -- hanging kernel

Thu Apr 29 22:01:23 PDT 2010

Andy,
        I have dynamic MPI test over RDS, where a rank is killed randomly, and a new process is forked and join the 'game'. I have the following /var/log/message output:

Apr 29 21:43:08 sq0n32 kernel: RDS/IB: recv completion on 172.31.64.33 had status 4, disconnecting and reconnecting
Apr 29 21:43:08 sq0n32 kernel: RDS/IB: connected to 172.31.64.33 version 3.1
Apr 29 21:44:18 sq0n32 kernel: RDS/IB: recv completion on 172.31.64.33 had status 4, disconnecting and reconnecting
Apr 29 21:44:18 sq0n32 kernel: RDS/IB: connected to 172.31.64.33 version 3.1
Apr 29 21:45:14 sq0n32 kernel: RDS/IB: send completion on 172.31.64.33 had status 11, disconnecting and reconnecting
Apr 29 21:45:14 sq0n32 kernel: RDS/IB: connected to 172.31.64.33 version 3.1
Apr 29 21:45:14 sq0n32 kernel: RDS/IB: send completion on 172.31.64.33 had status 10, disconnecting and reconnecting
Apr 29 21:45:14 sq0n32 kernel: RDS/IB: connected to 172.31.64.33 version 3.1

I understand the send completion error is remote access error (10), or remote op error (11).

What is the possible reason for recv completion error (4, local protection error) ?

--CQ

-----Original Message-----
From: Andy Grover [mailto:andy.grover at oracle.com]
Sent: Thursday, April 29, 2010 2:31 PM
To: Tang, Changqing
Cc: Tina Yang; RDS Devel
Subject: Re: [rds-devel] FW: RDS -- hanging kernel

On 04/28/2010 01:44 PM, Tang, Changqing wrote:
> I post a sendmsg() to RDS with both rdma-write control message (using
> msg_control and msg_controllen) and regular RDS message (using
> msg_iov and msg_iovlen, 48 bytes). If the rdma-write fails, I got a
> notification of RDS_RDMA_REMOTE_ERROR. Is the regular 48 bytes RDS
> message removed from RDS system as well ?

Correct, when the RDMA fails, the bcopy portion of the message is not
re-transmitted.

Regards -- Andy

>
> Thanks.
> --CQ
>
> -----Original Message-----
> From: Andy Grover [mailto:andy.grover at oracle.com]
> Sent: Monday, April 26, 2010 8:15 PM
> To: Tang, Changqing
> Cc: Tina Yang; RDS Devel
> Subject: Re: [rds-devel] FW: RDS -- hanging kernel
>
> On 04/26/2010 03:41 PM, Tang, Changqing wrote:
>> Andy, How stable is this branch?  last time I applied bug 2002 fix to
>> ofed 1.5.1 RDS, and it Has more problem, without it, it can work
>> normally; with it, it crashes machine real quick.
>
> I wouldn't run devel, I'd run oel-stable and cherry-pick the patch you
> want on top of that. oel-stable has a few additional follow-on patches
> to the fix in 2002 that may work better for you.
>
> Regards -- Andy
>
>>
>> --CQ
>>
>> -----Original Message----- From: Andy Grover
>> [mailto:andy.grover at oracle.com] Sent: Monday, April 26, 2010 5:17 PM
>> To: Tang, Changqing Cc: Tina Yang; RDS Devel Subject: Re: [rds-devel]
>> FW: RDS -- hanging kernel
>>
>> On 04/26/2010 02:58 PM, Tang, Changqing wrote:
>>> OK, thanks, Andy. How could we get this latest development code to
>>> test ?
>>
>> The current devel branch is here:
>>
>> http://www.openfabrics.org/git/?p=~agrover/ofed_1_5/linux-2.6.git;a=shortlog;h=devel
>>
>>   This is what we're working on for OFED 1.6. If you're interested in
>> the connection-killing stuff, you should be able to pull that patch
>> out and apply it to 1.5.1 relatively easily. Look for "RDS: Destroy
>> unconnected connections after 5 minutes".
>>
>> Regards -- Andy
>>
>>>
>>> --CQ
>>>
>>> -----Original Message----- From: Andy Grover
>>> [mailto:andy.grover at oracle.com] Sent: Monday, April 26, 2010 4:41
>>> PM To: Tang, Changqing Cc: Tina Yang; RDS Devel Subject: Re:
>>> [rds-devel] FW: RDS -- hanging kernel
>>>
>>> On 04/23/2010 03:14 PM, Tang, Changqing wrote:
>>>> Tina, If one of the node is down, and never recover, the RDS IB
>>>> connection to this node Is down, and never reconnected,  how
>>>> about the messages in other nodes to this down node ? Do they sit
>>>> in RDS forever or get dropped after sometime ?
>>>
>>> Yes, the current behavior is to retry forever. See this bug:
>>>
>>> https://bugs.openfabrics.org/show_bug.cgi?id=1355
>>>
>>> so we would like RDS to get smarter. There's currently a patch in
>>> the devel branch to destroy unconnected connections after 5
>>> minutes.
>>>
>>> Regards -- Andy
>>>
>>>>
>>>> Thank you.
>>>>
>>>> --CQ
>>>>
>>>> -----Original Message----- From: Tina Yang
>>>> [mailto:tina.yang at oracle.com] Sent: Saturday, April 10, 2010
>>>> 12:03 AM To: Tang, Changqing Cc: Andy Grover; RDS Devel Subject:
>>>> Re: [rds-devel] FW: RDS -- hanging kernel
>>>>
>>>> Can you put the core dump on a ftp site where I can accesss ?
>>>>
>>>>
>>>>
>>>> Tang, Changqing wrote:
>>>>> Tina, I apply the patch in bug 2002 to file rdma.c, and
>>>>> rebuild RDS, it does NOT work For me, the Pallas MPI test still
>>>>> hangs the first node almost at the same test location.
>>>>>
>>>>> Any idea?
>>>>>
>>>>> --CQ
>>>>>
>>>>> -----Original Message----- From: Tang, Changqing Sent: Friday,
>>>>> April 09, 2010 1:14 PM To: 'Tina Yang' Cc: Andy Grover; RDS
>>>>> Devel Subject: RE: [rds-devel] FW: RDS -- hanging kernel
>>>>>
>>>>> Ok, the one in bug 2002 are newer than OFED 1.5.1, is this the
>>>>> one to fix my problem?
>>>>>
>>>>> --CQ
>>>>>
>>>>> -----Original Message----- From: Tina Yang
>>>>> [mailto:tina.yang at oracle.com] Sent: Friday, April 09, 2010
>>>>> 12:49 PM To: Tang, Changqing Cc: Andy Grover; RDS Devel
>>>>> Subject: Re: [rds-devel] FW: RDS -- hanging kernel
>>>>>
>>>>> Looks like the signature of a known bug. Go fetch the
>>>>> following recently submitted patches from kernel netdev
>>>>> archive, [rds-devel][PATCH 06/13] RDS: Fix send locking issue
>>>>> [rds-devel][PATCH 09/13] RDS: Fix locking in
>>>>> rds_send_drop_to() [rds-devel] [PATCH 12/13] RDS: Do not call
>>>>> set_page_dirty() with irqs off and
>>>>> https://bugs.openfabrics.org/show_bug.cgi?id=2002
>>>>>
>>>>>
>>>>>
>>>>> Tang, Changqing wrote:
>>>>>
>>>>>> Tina, Eventually I get the vmcore for the hanging kernel. I
>>>>>> run MPI pallas test over RDS, two nodes, 5 rank on the first
>>>>>> node, 6 ranks on the second node.
>>>>>>
>>>>>> After running a while, during MPI_Allreduce() test, the
>>>>>> first node hangs. We Managed to get the vmcore from this
>>>>>> hanging node.
>>>>>>
>>>>>> It shows following:
>>>>>>
>>>>>> crash>     bt PID: 9353   TASK: ffff810416f8c7b0  CPU: 1
>>>>>> COMMAND: "pmb.x.32" #0 [ffff81042ff33e68] die_nmi at
>>>>>> ffffffff8006634d #1 [ffff81042ff33e80] die_nmi at
>>>>>> ffffffff80066393 #2 [ffff81042ff33ea0] nmi_watchdog_tick at
>>>>>> ffffffff80066af9 #3 [ffff81042ff33ef0] default_do_nmi at
>>>>>> ffffffff80066717 #4 [ffff81042ff33f40] do_nmi at
>>>>>> ffffffff80066984 #5 [ffff81042ff33f50] nmi at
>>>>>> ffffffff80065fd7 [exception RIP: .text.lock.spinlock+17] RIP:
>>>>>> ffffffff80065cf3  RSP: ffff810407383af0  RFLAGS: 00000082
>>>>>> RAX: 0000000000000246  RBX: ffff810405c21b80  RCX:
>>>>>> 0000000000002000 RDX: ffff810407383ed8 RSI: ffff810407383ed8
>>>>>> RDI: ffff810405c21df8 RBP: 0000000000000000   R8:
>>>>>> 0000000080000040   R9: 0000000000002000 R10: 0000000000002000
>>>>>> R11: 00000000ffb2a618  R12: 0000000000000001 R13:
>>>>>> ffff810407383ed8  R14: ffff810407383ed8 R15: ffff810405c21df8
>>>>>> ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018 ---<exception
>>>>>> stack>     --- #6 [ffff810407383af0] .text.lock.spinlock at
>>>>>> ffffffff80065cf3 (via _spin_lock_i rqsave) #7
>>>>>> [ffff810407383af0] rds_notify_queue_get at ffffffff8868c6a0
>>>>>> #8 [ffff810407383b60] rds_recvmsg at ffffffff8868cbb2 #9
>>>>>> [ffff810407383c10] sock_recvmsg at ffffffff80030ab8 #10
>>>>>> [ffff810407383db0] sys_recvmsg at ffffffff8003d77f #11
>>>>>> [ffff810407383f50] compat_sys_socketcall at ffffffff8022b44b
>>>>>> #12 [ffff810407383f80] cstar_do_call at ffffffff80062618
>>>>>>
>>>>>> What else do I need to provide you?
>>>>>>
>>>>>> --CQ
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>