[rds-devel] FW: RDS -- hanging kernel

Tang, Changqing changquing.tang at hp.com
Fri Apr 9 12:13:33 PDT 2010


Tina,
        I apply the patch in bug 2002 to file rdma.c, and rebuild RDS, it does NOT work
For me, the Pallas MPI test still hangs the first node almost at the same test location.

        Any idea?

--CQ

-----Original Message-----
From: Tang, Changqing
Sent: Friday, April 09, 2010 1:14 PM
To: 'Tina Yang'
Cc: Andy Grover; RDS Devel
Subject: RE: [rds-devel] FW: RDS -- hanging kernel

Ok, the one in bug 2002 are newer than OFED 1.5.1, is this the one to fix my problem?

--CQ

-----Original Message-----
From: Tina Yang [mailto:tina.yang at oracle.com]
Sent: Friday, April 09, 2010 12:49 PM
To: Tang, Changqing
Cc: Andy Grover; RDS Devel
Subject: Re: [rds-devel] FW: RDS -- hanging kernel

Looks like the signature of a known bug.
Go fetch the following recently submitted patches from
kernel netdev archive,
[rds-devel][PATCH 06/13] RDS: Fix send locking issue
[rds-devel][PATCH 09/13] RDS: Fix locking in rds_send_drop_to()
[rds-devel] [PATCH 12/13] RDS: Do not call set_page_dirty() with    irqs off
and
https://bugs.openfabrics.org/show_bug.cgi?id=2002



Tang, Changqing wrote:
> Tina,
>         Eventually I get the vmcore for the hanging kernel. I run MPI pallas test over
> RDS, two nodes, 5 rank on the first node, 6 ranks on the second node.
>
>         After running a while, during MPI_Allreduce() test, the first node hangs. We
> Managed to get the vmcore from this hanging node.
>
>         It shows following:
>
> crash> bt
> PID: 9353   TASK: ffff810416f8c7b0  CPU: 1   COMMAND: "pmb.x.32"
>  #0 [ffff81042ff33e68] die_nmi at ffffffff8006634d
>  #1 [ffff81042ff33e80] die_nmi at ffffffff80066393
>  #2 [ffff81042ff33ea0] nmi_watchdog_tick at ffffffff80066af9
>  #3 [ffff81042ff33ef0] default_do_nmi at ffffffff80066717
>  #4 [ffff81042ff33f40] do_nmi at ffffffff80066984
>  #5 [ffff81042ff33f50] nmi at ffffffff80065fd7
>     [exception RIP: .text.lock.spinlock+17]
>     RIP: ffffffff80065cf3  RSP: ffff810407383af0  RFLAGS: 00000082
>     RAX: 0000000000000246  RBX: ffff810405c21b80  RCX: 0000000000002000
>     RDX: ffff810407383ed8  RSI: ffff810407383ed8  RDI: ffff810405c21df8
>     RBP: 0000000000000000   R8: 0000000080000040   R9: 0000000000002000
>     R10: 0000000000002000  R11: 00000000ffb2a618  R12: 0000000000000001
>     R13: ffff810407383ed8  R14: ffff810407383ed8  R15: ffff810405c21df8
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <exception stack> ---
>  #6 [ffff810407383af0] .text.lock.spinlock at ffffffff80065cf3 (via _spin_lock_i
> rqsave)
>  #7 [ffff810407383af0] rds_notify_queue_get at ffffffff8868c6a0
> #8 [ffff810407383b60] rds_recvmsg at ffffffff8868cbb2
>  #9 [ffff810407383c10] sock_recvmsg at ffffffff80030ab8
> #10 [ffff810407383db0] sys_recvmsg at ffffffff8003d77f
> #11 [ffff810407383f50] compat_sys_socketcall at ffffffff8022b44b
> #12 [ffff810407383f80] cstar_do_call at ffffffff80062618
>
> What else do I need to provide you?
>
> --CQ
>
>
>
>
>




More information about the rds-devel mailing list