[rds-devel] FW: RDS -- hanging kernel

Andy Grover andy.grover at oracle.com
Mon Apr 26 15:17:01 PDT 2010


On 04/26/2010 02:58 PM, Tang, Changqing wrote:
> OK, thanks, Andy. How could we get this latest development code to test ?

The current devel branch is here:

http://www.openfabrics.org/git/?p=~agrover/ofed_1_5/linux-2.6.git;a=shortlog;h=devel

This is what we're working on for OFED 1.6. If you're interested in the 
connection-killing stuff, you should be able to pull that patch out and 
apply it to 1.5.1 relatively easily. Look for "RDS: Destroy unconnected 
connections after 5 minutes".

Regards -- Andy

>
> --CQ
>
> -----Original Message-----
> From: Andy Grover [mailto:andy.grover at oracle.com]
> Sent: Monday, April 26, 2010 4:41 PM
> To: Tang, Changqing
> Cc: Tina Yang; RDS Devel
> Subject: Re: [rds-devel] FW: RDS -- hanging kernel
>
> On 04/23/2010 03:14 PM, Tang, Changqing wrote:
>> Tina, If one of the node is down, and never recover, the RDS IB
>> connection to this node Is down, and never reconnected,  how about
>> the messages in other nodes to this down node ? Do they sit in RDS
>> forever or get dropped after sometime ?
>
> Yes, the current behavior is to retry forever. See this bug:
>
> https://bugs.openfabrics.org/show_bug.cgi?id=1355
>
> so we would like RDS to get smarter. There's currently a patch in the
> devel branch to destroy unconnected connections after 5 minutes.
>
> Regards -- Andy
>
>>
>> Thank you.
>>
>> --CQ
>>
>> -----Original Message----- From: Tina Yang
>> [mailto:tina.yang at oracle.com] Sent: Saturday, April 10, 2010 12:03
>> AM To: Tang, Changqing Cc: Andy Grover; RDS Devel Subject: Re:
>> [rds-devel] FW: RDS -- hanging kernel
>>
>> Can you put the core dump on a ftp site where I can accesss ?
>>
>>
>>
>> Tang, Changqing wrote:
>>> Tina, I apply the patch in bug 2002 to file rdma.c, and rebuild
>>> RDS, it does NOT work For me, the Pallas MPI test still hangs the
>>> first node almost at the same test location.
>>>
>>> Any idea?
>>>
>>> --CQ
>>>
>>> -----Original Message----- From: Tang, Changqing Sent: Friday,
>>> April 09, 2010 1:14 PM To: 'Tina Yang' Cc: Andy Grover; RDS Devel
>>> Subject: RE: [rds-devel] FW: RDS -- hanging kernel
>>>
>>> Ok, the one in bug 2002 are newer than OFED 1.5.1, is this the one
>>> to fix my problem?
>>>
>>> --CQ
>>>
>>> -----Original Message----- From: Tina Yang
>>> [mailto:tina.yang at oracle.com] Sent: Friday, April 09, 2010 12:49
>>> PM To: Tang, Changqing Cc: Andy Grover; RDS Devel Subject: Re:
>>> [rds-devel] FW: RDS -- hanging kernel
>>>
>>> Looks like the signature of a known bug. Go fetch the following
>>> recently submitted patches from kernel netdev archive,
>>> [rds-devel][PATCH 06/13] RDS: Fix send locking issue
>>> [rds-devel][PATCH 09/13] RDS: Fix locking in rds_send_drop_to()
>>> [rds-devel] [PATCH 12/13] RDS: Do not call set_page_dirty() with
>>> irqs off and https://bugs.openfabrics.org/show_bug.cgi?id=2002
>>>
>>>
>>>
>>> Tang, Changqing wrote:
>>>
>>>> Tina, Eventually I get the vmcore for the hanging kernel. I run
>>>> MPI pallas test over RDS, two nodes, 5 rank on the first node, 6
>>>> ranks on the second node.
>>>>
>>>> After running a while, during MPI_Allreduce() test, the first
>>>> node hangs. We Managed to get the vmcore from this hanging node.
>>>>
>>>> It shows following:
>>>>
>>>> crash>   bt PID: 9353   TASK: ffff810416f8c7b0  CPU: 1   COMMAND:
>>>> "pmb.x.32" #0 [ffff81042ff33e68] die_nmi at ffffffff8006634d #1
>>>> [ffff81042ff33e80] die_nmi at ffffffff80066393 #2
>>>> [ffff81042ff33ea0] nmi_watchdog_tick at ffffffff80066af9 #3
>>>> [ffff81042ff33ef0] default_do_nmi at ffffffff80066717 #4
>>>> [ffff81042ff33f40] do_nmi at ffffffff80066984 #5
>>>> [ffff81042ff33f50] nmi at ffffffff80065fd7 [exception RIP:
>>>> .text.lock.spinlock+17] RIP: ffffffff80065cf3  RSP:
>>>> ffff810407383af0  RFLAGS: 00000082 RAX: 0000000000000246  RBX:
>>>> ffff810405c21b80  RCX: 0000000000002000 RDX: ffff810407383ed8
>>>> RSI: ffff810407383ed8  RDI: ffff810405c21df8 RBP:
>>>> 0000000000000000   R8: 0000000080000040   R9: 0000000000002000
>>>> R10: 0000000000002000  R11: 00000000ffb2a618  R12:
>>>> 0000000000000001 R13: ffff810407383ed8  R14: ffff810407383ed8
>>>> R15: ffff810405c21df8 ORIG_RAX: ffffffffffffffff  CS: 0010  SS:
>>>> 0018 ---<exception stack>   --- #6 [ffff810407383af0]
>>>> .text.lock.spinlock at ffffffff80065cf3 (via _spin_lock_i
>>>> rqsave) #7 [ffff810407383af0] rds_notify_queue_get at
>>>> ffffffff8868c6a0 #8 [ffff810407383b60] rds_recvmsg at
>>>> ffffffff8868cbb2 #9 [ffff810407383c10] sock_recvmsg at
>>>> ffffffff80030ab8 #10 [ffff810407383db0] sys_recvmsg at
>>>> ffffffff8003d77f #11 [ffff810407383f50] compat_sys_socketcall at
>>>> ffffffff8022b44b #12 [ffff810407383f80] cstar_do_call at
>>>> ffffffff80062618
>>>>
>>>> What else do I need to provide you?
>>>>
>>>> --CQ
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>




More information about the rds-devel mailing list