[rds-devel] FW: RDS -- hanging kernel

Mon Apr 26 14:58:47 PDT 2010

OK, thanks, Andy. How could we get this latest development code to test ?

--CQ

-----Original Message-----
From: Andy Grover [mailto:andy.grover at oracle.com]
Sent: Monday, April 26, 2010 4:41 PM
To: Tang, Changqing
Cc: Tina Yang; RDS Devel
Subject: Re: [rds-devel] FW: RDS -- hanging kernel

On 04/23/2010 03:14 PM, Tang, Changqing wrote:
> Tina, If one of the node is down, and never recover, the RDS IB
> connection to this node Is down, and never reconnected,  how about
> the messages in other nodes to this down node ? Do they sit in RDS
> forever or get dropped after sometime ?

Yes, the current behavior is to retry forever. See this bug:

https://bugs.openfabrics.org/show_bug.cgi?id=1355

so we would like RDS to get smarter. There's currently a patch in the
devel branch to destroy unconnected connections after 5 minutes.

Regards -- Andy

>
> Thank you.
>
> --CQ
>
> -----Original Message----- From: Tina Yang
> [mailto:tina.yang at oracle.com] Sent: Saturday, April 10, 2010 12:03
> AM To: Tang, Changqing Cc: Andy Grover; RDS Devel Subject: Re:
> [rds-devel] FW: RDS -- hanging kernel
>
> Can you put the core dump on a ftp site where I can accesss ?
>
>
>
> Tang, Changqing wrote:
>> Tina, I apply the patch in bug 2002 to file rdma.c, and rebuild
>> RDS, it does NOT work For me, the Pallas MPI test still hangs the
>> first node almost at the same test location.
>>
>> Any idea?
>>
>> --CQ
>>
>> -----Original Message----- From: Tang, Changqing Sent: Friday,
>> April 09, 2010 1:14 PM To: 'Tina Yang' Cc: Andy Grover; RDS Devel
>> Subject: RE: [rds-devel] FW: RDS -- hanging kernel
>>
>> Ok, the one in bug 2002 are newer than OFED 1.5.1, is this the one
>> to fix my problem?
>>
>> --CQ
>>
>> -----Original Message----- From: Tina Yang
>> [mailto:tina.yang at oracle.com] Sent: Friday, April 09, 2010 12:49
>> PM To: Tang, Changqing Cc: Andy Grover; RDS Devel Subject: Re:
>> [rds-devel] FW: RDS -- hanging kernel
>>
>> Looks like the signature of a known bug. Go fetch the following
>> recently submitted patches from kernel netdev archive,
>> [rds-devel][PATCH 06/13] RDS: Fix send locking issue
>> [rds-devel][PATCH 09/13] RDS: Fix locking in rds_send_drop_to()
>> [rds-devel] [PATCH 12/13] RDS: Do not call set_page_dirty() with
>> irqs off and https://bugs.openfabrics.org/show_bug.cgi?id=2002
>>
>>
>>
>> Tang, Changqing wrote:
>>
>>> Tina, Eventually I get the vmcore for the hanging kernel. I run
>>> MPI pallas test over RDS, two nodes, 5 rank on the first node, 6
>>> ranks on the second node.
>>>
>>> After running a while, during MPI_Allreduce() test, the first
>>> node hangs. We Managed to get the vmcore from this hanging node.
>>>
>>> It shows following:
>>>
>>> crash>  bt PID: 9353   TASK: ffff810416f8c7b0  CPU: 1   COMMAND:
>>> "pmb.x.32" #0 [ffff81042ff33e68] die_nmi at ffffffff8006634d #1
>>> [ffff81042ff33e80] die_nmi at ffffffff80066393 #2
>>> [ffff81042ff33ea0] nmi_watchdog_tick at ffffffff80066af9 #3
>>> [ffff81042ff33ef0] default_do_nmi at ffffffff80066717 #4
>>> [ffff81042ff33f40] do_nmi at ffffffff80066984 #5
>>> [ffff81042ff33f50] nmi at ffffffff80065fd7 [exception RIP:
>>> .text.lock.spinlock+17] RIP: ffffffff80065cf3  RSP:
>>> ffff810407383af0  RFLAGS: 00000082 RAX: 0000000000000246  RBX:
>>> ffff810405c21b80  RCX: 0000000000002000 RDX: ffff810407383ed8
>>> RSI: ffff810407383ed8  RDI: ffff810405c21df8 RBP:
>>> 0000000000000000   R8: 0000000080000040   R9: 0000000000002000
>>> R10: 0000000000002000  R11: 00000000ffb2a618  R12:
>>> 0000000000000001 R13: ffff810407383ed8  R14: ffff810407383ed8
>>> R15: ffff810405c21df8 ORIG_RAX: ffffffffffffffff  CS: 0010  SS:
>>> 0018 ---<exception stack>  --- #6 [ffff810407383af0]
>>> .text.lock.spinlock at ffffffff80065cf3 (via _spin_lock_i
>>> rqsave) #7 [ffff810407383af0] rds_notify_queue_get at
>>> ffffffff8868c6a0 #8 [ffff810407383b60] rds_recvmsg at
>>> ffffffff8868cbb2 #9 [ffff810407383c10] sock_recvmsg at
>>> ffffffff80030ab8 #10 [ffff810407383db0] sys_recvmsg at
>>> ffffffff8003d77f #11 [ffff810407383f50] compat_sys_socketcall at
>>> ffffffff8022b44b #12 [ffff810407383f80] cstar_do_call at
>>> ffffffff80062618
>>>
>>> What else do I need to provide you?
>>>
>>> --CQ
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>