[rds-devel] FW: RDS -- hanging kernel
changquing.tang at hp.com
Mon Apr 26 14:58:47 PDT 2010
OK, thanks, Andy. How could we get this latest development code to test ?
From: Andy Grover [mailto:andy.grover at oracle.com]
Sent: Monday, April 26, 2010 4:41 PM
To: Tang, Changqing
Cc: Tina Yang; RDS Devel
Subject: Re: [rds-devel] FW: RDS -- hanging kernel
On 04/23/2010 03:14 PM, Tang, Changqing wrote:
> Tina, If one of the node is down, and never recover, the RDS IB
> connection to this node Is down, and never reconnected, how about
> the messages in other nodes to this down node ? Do they sit in RDS
> forever or get dropped after sometime ?
Yes, the current behavior is to retry forever. See this bug:
so we would like RDS to get smarter. There's currently a patch in the
devel branch to destroy unconnected connections after 5 minutes.
Regards -- Andy
> Thank you.
> -----Original Message----- From: Tina Yang
> [mailto:tina.yang at oracle.com] Sent: Saturday, April 10, 2010 12:03
> AM To: Tang, Changqing Cc: Andy Grover; RDS Devel Subject: Re:
> [rds-devel] FW: RDS -- hanging kernel
> Can you put the core dump on a ftp site where I can accesss ?
> Tang, Changqing wrote:
>> Tina, I apply the patch in bug 2002 to file rdma.c, and rebuild
>> RDS, it does NOT work For me, the Pallas MPI test still hangs the
>> first node almost at the same test location.
>> Any idea?
>> -----Original Message----- From: Tang, Changqing Sent: Friday,
>> April 09, 2010 1:14 PM To: 'Tina Yang' Cc: Andy Grover; RDS Devel
>> Subject: RE: [rds-devel] FW: RDS -- hanging kernel
>> Ok, the one in bug 2002 are newer than OFED 1.5.1, is this the one
>> to fix my problem?
>> -----Original Message----- From: Tina Yang
>> [mailto:tina.yang at oracle.com] Sent: Friday, April 09, 2010 12:49
>> PM To: Tang, Changqing Cc: Andy Grover; RDS Devel Subject: Re:
>> [rds-devel] FW: RDS -- hanging kernel
>> Looks like the signature of a known bug. Go fetch the following
>> recently submitted patches from kernel netdev archive,
>> [rds-devel][PATCH 06/13] RDS: Fix send locking issue
>> [rds-devel][PATCH 09/13] RDS: Fix locking in rds_send_drop_to()
>> [rds-devel] [PATCH 12/13] RDS: Do not call set_page_dirty() with
>> irqs off and https://bugs.openfabrics.org/show_bug.cgi?id=2002
>> Tang, Changqing wrote:
>>> Tina, Eventually I get the vmcore for the hanging kernel. I run
>>> MPI pallas test over RDS, two nodes, 5 rank on the first node, 6
>>> ranks on the second node.
>>> After running a while, during MPI_Allreduce() test, the first
>>> node hangs. We Managed to get the vmcore from this hanging node.
>>> It shows following:
>>> crash> bt PID: 9353 TASK: ffff810416f8c7b0 CPU: 1 COMMAND:
>>> "pmb.x.32" #0 [ffff81042ff33e68] die_nmi at ffffffff8006634d #1
>>> [ffff81042ff33e80] die_nmi at ffffffff80066393 #2
>>> [ffff81042ff33ea0] nmi_watchdog_tick at ffffffff80066af9 #3
>>> [ffff81042ff33ef0] default_do_nmi at ffffffff80066717 #4
>>> [ffff81042ff33f40] do_nmi at ffffffff80066984 #5
>>> [ffff81042ff33f50] nmi at ffffffff80065fd7 [exception RIP:
>>> .text.lock.spinlock+17] RIP: ffffffff80065cf3 RSP:
>>> ffff810407383af0 RFLAGS: 00000082 RAX: 0000000000000246 RBX:
>>> ffff810405c21b80 RCX: 0000000000002000 RDX: ffff810407383ed8
>>> RSI: ffff810407383ed8 RDI: ffff810405c21df8 RBP:
>>> 0000000000000000 R8: 0000000080000040 R9: 0000000000002000
>>> R10: 0000000000002000 R11: 00000000ffb2a618 R12:
>>> 0000000000000001 R13: ffff810407383ed8 R14: ffff810407383ed8
>>> R15: ffff810405c21df8 ORIG_RAX: ffffffffffffffff CS: 0010 SS:
>>> 0018 ---<exception stack> --- #6 [ffff810407383af0]
>>> .text.lock.spinlock at ffffffff80065cf3 (via _spin_lock_i
>>> rqsave) #7 [ffff810407383af0] rds_notify_queue_get at
>>> ffffffff8868c6a0 #8 [ffff810407383b60] rds_recvmsg at
>>> ffffffff8868cbb2 #9 [ffff810407383c10] sock_recvmsg at
>>> ffffffff80030ab8 #10 [ffff810407383db0] sys_recvmsg at
>>> ffffffff8003d77f #11 [ffff810407383f50] compat_sys_socketcall at
>>> ffffffff8022b44b #12 [ffff810407383f80] cstar_do_call at
>>> What else do I need to provide you?
More information about the rds-devel