[rds-devel] FW: RDS -- hanging kernel

Tang, Changqing changquing.tang at hp.com
Wed Apr 28 13:44:51 PDT 2010


Andy,
        Simple question, I post a sendmsg() to RDS with both rdma-write control message (using msg_control and msg_controllen) and regular RDS message (using msg_iov and msg_iovlen, 48 bytes). If the rdma-write fails, I got a notification of RDS_RDMA_REMOTE_ERROR. Is the regular 48 bytes RDS message removed from RDS system as well ?

        Thanks.

--CQ

-----Original Message-----
From: Andy Grover [mailto:andy.grover at oracle.com]
Sent: Monday, April 26, 2010 8:15 PM
To: Tang, Changqing
Cc: Tina Yang; RDS Devel
Subject: Re: [rds-devel] FW: RDS -- hanging kernel

On 04/26/2010 03:41 PM, Tang, Changqing wrote:
> Andy, How stable is this branch?  last time I applied bug 2002 fix to
> ofed 1.5.1 RDS, and it Has more problem, without it, it can work
> normally; with it, it crashes machine real quick.

I wouldn't run devel, I'd run oel-stable and cherry-pick the patch you
want on top of that. oel-stable has a few additional follow-on patches
to the fix in 2002 that may work better for you.

Regards -- Andy

>
> --CQ
>
> -----Original Message----- From: Andy Grover
> [mailto:andy.grover at oracle.com] Sent: Monday, April 26, 2010 5:17 PM
> To: Tang, Changqing Cc: Tina Yang; RDS Devel Subject: Re: [rds-devel]
> FW: RDS -- hanging kernel
>
> On 04/26/2010 02:58 PM, Tang, Changqing wrote:
>> OK, thanks, Andy. How could we get this latest development code to
>> test ?
>
> The current devel branch is here:
>
> http://www.openfabrics.org/git/?p=~agrover/ofed_1_5/linux-2.6.git;a=shortlog;h=devel
>
>  This is what we're working on for OFED 1.6. If you're interested in
> the connection-killing stuff, you should be able to pull that patch
> out and apply it to 1.5.1 relatively easily. Look for "RDS: Destroy
> unconnected connections after 5 minutes".
>
> Regards -- Andy
>
>>
>> --CQ
>>
>> -----Original Message----- From: Andy Grover
>> [mailto:andy.grover at oracle.com] Sent: Monday, April 26, 2010 4:41
>> PM To: Tang, Changqing Cc: Tina Yang; RDS Devel Subject: Re:
>> [rds-devel] FW: RDS -- hanging kernel
>>
>> On 04/23/2010 03:14 PM, Tang, Changqing wrote:
>>> Tina, If one of the node is down, and never recover, the RDS IB
>>> connection to this node Is down, and never reconnected,  how
>>> about the messages in other nodes to this down node ? Do they sit
>>> in RDS forever or get dropped after sometime ?
>>
>> Yes, the current behavior is to retry forever. See this bug:
>>
>> https://bugs.openfabrics.org/show_bug.cgi?id=1355
>>
>> so we would like RDS to get smarter. There's currently a patch in
>> the devel branch to destroy unconnected connections after 5
>> minutes.
>>
>> Regards -- Andy
>>
>>>
>>> Thank you.
>>>
>>> --CQ
>>>
>>> -----Original Message----- From: Tina Yang
>>> [mailto:tina.yang at oracle.com] Sent: Saturday, April 10, 2010
>>> 12:03 AM To: Tang, Changqing Cc: Andy Grover; RDS Devel Subject:
>>> Re: [rds-devel] FW: RDS -- hanging kernel
>>>
>>> Can you put the core dump on a ftp site where I can accesss ?
>>>
>>>
>>>
>>> Tang, Changqing wrote:
>>>> Tina, I apply the patch in bug 2002 to file rdma.c, and
>>>> rebuild RDS, it does NOT work For me, the Pallas MPI test still
>>>> hangs the first node almost at the same test location.
>>>>
>>>> Any idea?
>>>>
>>>> --CQ
>>>>
>>>> -----Original Message----- From: Tang, Changqing Sent: Friday,
>>>> April 09, 2010 1:14 PM To: 'Tina Yang' Cc: Andy Grover; RDS
>>>> Devel Subject: RE: [rds-devel] FW: RDS -- hanging kernel
>>>>
>>>> Ok, the one in bug 2002 are newer than OFED 1.5.1, is this the
>>>> one to fix my problem?
>>>>
>>>> --CQ
>>>>
>>>> -----Original Message----- From: Tina Yang
>>>> [mailto:tina.yang at oracle.com] Sent: Friday, April 09, 2010
>>>> 12:49 PM To: Tang, Changqing Cc: Andy Grover; RDS Devel
>>>> Subject: Re: [rds-devel] FW: RDS -- hanging kernel
>>>>
>>>> Looks like the signature of a known bug. Go fetch the
>>>> following recently submitted patches from kernel netdev
>>>> archive, [rds-devel][PATCH 06/13] RDS: Fix send locking issue
>>>> [rds-devel][PATCH 09/13] RDS: Fix locking in
>>>> rds_send_drop_to() [rds-devel] [PATCH 12/13] RDS: Do not call
>>>> set_page_dirty() with irqs off and
>>>> https://bugs.openfabrics.org/show_bug.cgi?id=2002
>>>>
>>>>
>>>>
>>>> Tang, Changqing wrote:
>>>>
>>>>> Tina, Eventually I get the vmcore for the hanging kernel. I
>>>>> run MPI pallas test over RDS, two nodes, 5 rank on the first
>>>>> node, 6 ranks on the second node.
>>>>>
>>>>> After running a while, during MPI_Allreduce() test, the
>>>>> first node hangs. We Managed to get the vmcore from this
>>>>> hanging node.
>>>>>
>>>>> It shows following:
>>>>>
>>>>> crash>    bt PID: 9353   TASK: ffff810416f8c7b0  CPU: 1
>>>>> COMMAND: "pmb.x.32" #0 [ffff81042ff33e68] die_nmi at
>>>>> ffffffff8006634d #1 [ffff81042ff33e80] die_nmi at
>>>>> ffffffff80066393 #2 [ffff81042ff33ea0] nmi_watchdog_tick at
>>>>> ffffffff80066af9 #3 [ffff81042ff33ef0] default_do_nmi at
>>>>> ffffffff80066717 #4 [ffff81042ff33f40] do_nmi at
>>>>> ffffffff80066984 #5 [ffff81042ff33f50] nmi at
>>>>> ffffffff80065fd7 [exception RIP: .text.lock.spinlock+17] RIP:
>>>>> ffffffff80065cf3  RSP: ffff810407383af0  RFLAGS: 00000082
>>>>> RAX: 0000000000000246  RBX: ffff810405c21b80  RCX:
>>>>> 0000000000002000 RDX: ffff810407383ed8 RSI: ffff810407383ed8
>>>>> RDI: ffff810405c21df8 RBP: 0000000000000000   R8:
>>>>> 0000000080000040   R9: 0000000000002000 R10: 0000000000002000
>>>>> R11: 00000000ffb2a618  R12: 0000000000000001 R13:
>>>>> ffff810407383ed8  R14: ffff810407383ed8 R15: ffff810405c21df8
>>>>> ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018 ---<exception
>>>>> stack>    --- #6 [ffff810407383af0] .text.lock.spinlock at
>>>>> ffffffff80065cf3 (via _spin_lock_i rqsave) #7
>>>>> [ffff810407383af0] rds_notify_queue_get at ffffffff8868c6a0
>>>>> #8 [ffff810407383b60] rds_recvmsg at ffffffff8868cbb2 #9
>>>>> [ffff810407383c10] sock_recvmsg at ffffffff80030ab8 #10
>>>>> [ffff810407383db0] sys_recvmsg at ffffffff8003d77f #11
>>>>> [ffff810407383f50] compat_sys_socketcall at ffffffff8022b44b
>>>>> #12 [ffff810407383f80] cstar_do_call at ffffffff80062618
>>>>>
>>>>> What else do I need to provide you?
>>>>>
>>>>> --CQ
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>




More information about the rds-devel mailing list