[rds-devel] FW: RDS -- hanging kernel

Fri Apr 23 15:14:03 PDT 2010

Tina,
        If one of the node is down, and never recover, the RDS IB connection to this node
Is down, and never reconnected,  how about the messages in other nodes to this down node ?
Do they sit in RDS forever or get dropped after sometime ?

        Thank you.

--CQ

-----Original Message-----
From: Tina Yang [mailto:tina.yang at oracle.com]
Sent: Saturday, April 10, 2010 12:03 AM
To: Tang, Changqing
Cc: Andy Grover; RDS Devel
Subject: Re: [rds-devel] FW: RDS -- hanging kernel

Can you put the core dump on a ftp site where I
can accesss ?

Tang, Changqing wrote:
> Tina,
>         I apply the patch in bug 2002 to file rdma.c, and rebuild RDS, it does NOT work
> For me, the Pallas MPI test still hangs the first node almost at the same test location.
>
>         Any idea?
>
> --CQ
>
> -----Original Message-----
> From: Tang, Changqing
> Sent: Friday, April 09, 2010 1:14 PM
> To: 'Tina Yang'
> Cc: Andy Grover; RDS Devel
> Subject: RE: [rds-devel] FW: RDS -- hanging kernel
>
> Ok, the one in bug 2002 are newer than OFED 1.5.1, is this the one to fix my problem?
>
> --CQ
>
> -----Original Message-----
> From: Tina Yang [mailto:tina.yang at oracle.com]
> Sent: Friday, April 09, 2010 12:49 PM
> To: Tang, Changqing
> Cc: Andy Grover; RDS Devel
> Subject: Re: [rds-devel] FW: RDS -- hanging kernel
>
> Looks like the signature of a known bug.
> Go fetch the following recently submitted patches from
> kernel netdev archive,
> [rds-devel][PATCH 06/13] RDS: Fix send locking issue
> [rds-devel][PATCH 09/13] RDS: Fix locking in rds_send_drop_to()
> [rds-devel] [PATCH 12/13] RDS: Do not call set_page_dirty() with    irqs off
> and
> https://bugs.openfabrics.org/show_bug.cgi?id=2002
>
>
>
> Tang, Changqing wrote:
>
>> Tina,
>>         Eventually I get the vmcore for the hanging kernel. I run MPI pallas test over
>> RDS, two nodes, 5 rank on the first node, 6 ranks on the second node.
>>
>>         After running a while, during MPI_Allreduce() test, the first node hangs. We
>> Managed to get the vmcore from this hanging node.
>>
>>         It shows following:
>>
>> crash> bt
>> PID: 9353   TASK: ffff810416f8c7b0  CPU: 1   COMMAND: "pmb.x.32"
>>  #0 [ffff81042ff33e68] die_nmi at ffffffff8006634d
>>  #1 [ffff81042ff33e80] die_nmi at ffffffff80066393
>>  #2 [ffff81042ff33ea0] nmi_watchdog_tick at ffffffff80066af9
>>  #3 [ffff81042ff33ef0] default_do_nmi at ffffffff80066717
>>  #4 [ffff81042ff33f40] do_nmi at ffffffff80066984
>>  #5 [ffff81042ff33f50] nmi at ffffffff80065fd7
>>     [exception RIP: .text.lock.spinlock+17]
>>     RIP: ffffffff80065cf3  RSP: ffff810407383af0  RFLAGS: 00000082
>>     RAX: 0000000000000246  RBX: ffff810405c21b80  RCX: 0000000000002000
>>     RDX: ffff810407383ed8  RSI: ffff810407383ed8  RDI: ffff810405c21df8
>>     RBP: 0000000000000000   R8: 0000000080000040   R9: 0000000000002000
>>     R10: 0000000000002000  R11: 00000000ffb2a618  R12: 0000000000000001
>>     R13: ffff810407383ed8  R14: ffff810407383ed8  R15: ffff810405c21df8
>>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>> --- <exception stack> ---
>>  #6 [ffff810407383af0] .text.lock.spinlock at ffffffff80065cf3 (via _spin_lock_i
>> rqsave)
>>  #7 [ffff810407383af0] rds_notify_queue_get at ffffffff8868c6a0
>> #8 [ffff810407383b60] rds_recvmsg at ffffffff8868cbb2
>>  #9 [ffff810407383c10] sock_recvmsg at ffffffff80030ab8
>> #10 [ffff810407383db0] sys_recvmsg at ffffffff8003d77f
>> #11 [ffff810407383f50] compat_sys_socketcall at ffffffff8022b44b
>> #12 [ffff810407383f80] cstar_do_call at ffffffff80062618
>>
>> What else do I need to provide you?
>>
>> --CQ
>>
>>
>>
>>
>>
>>
>
>