[rds-devel] FW: RDS -- hanging kernel
Tang, Changqing
changquing.tang at hp.com
Fri Apr 9 11:13:31 PDT 2010
Ok, the one in bug 2002 are newer than OFED 1.5.1, is this the one to fix my problem?
--CQ
-----Original Message-----
From: Tina Yang [mailto:tina.yang at oracle.com]
Sent: Friday, April 09, 2010 12:49 PM
To: Tang, Changqing
Cc: Andy Grover; RDS Devel
Subject: Re: [rds-devel] FW: RDS -- hanging kernel
Looks like the signature of a known bug.
Go fetch the following recently submitted patches from
kernel netdev archive,
[rds-devel][PATCH 06/13] RDS: Fix send locking issue
[rds-devel][PATCH 09/13] RDS: Fix locking in rds_send_drop_to()
[rds-devel] [PATCH 12/13] RDS: Do not call set_page_dirty() with irqs off
and
https://bugs.openfabrics.org/show_bug.cgi?id=2002
Tang, Changqing wrote:
> Tina,
> Eventually I get the vmcore for the hanging kernel. I run MPI pallas test over
> RDS, two nodes, 5 rank on the first node, 6 ranks on the second node.
>
> After running a while, during MPI_Allreduce() test, the first node hangs. We
> Managed to get the vmcore from this hanging node.
>
> It shows following:
>
> crash> bt
> PID: 9353 TASK: ffff810416f8c7b0 CPU: 1 COMMAND: "pmb.x.32"
> #0 [ffff81042ff33e68] die_nmi at ffffffff8006634d
> #1 [ffff81042ff33e80] die_nmi at ffffffff80066393
> #2 [ffff81042ff33ea0] nmi_watchdog_tick at ffffffff80066af9
> #3 [ffff81042ff33ef0] default_do_nmi at ffffffff80066717
> #4 [ffff81042ff33f40] do_nmi at ffffffff80066984
> #5 [ffff81042ff33f50] nmi at ffffffff80065fd7
> [exception RIP: .text.lock.spinlock+17]
> RIP: ffffffff80065cf3 RSP: ffff810407383af0 RFLAGS: 00000082
> RAX: 0000000000000246 RBX: ffff810405c21b80 RCX: 0000000000002000
> RDX: ffff810407383ed8 RSI: ffff810407383ed8 RDI: ffff810405c21df8
> RBP: 0000000000000000 R8: 0000000080000040 R9: 0000000000002000
> R10: 0000000000002000 R11: 00000000ffb2a618 R12: 0000000000000001
> R13: ffff810407383ed8 R14: ffff810407383ed8 R15: ffff810405c21df8
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <exception stack> ---
> #6 [ffff810407383af0] .text.lock.spinlock at ffffffff80065cf3 (via _spin_lock_i
> rqsave)
> #7 [ffff810407383af0] rds_notify_queue_get at ffffffff8868c6a0
> #8 [ffff810407383b60] rds_recvmsg at ffffffff8868cbb2
> #9 [ffff810407383c10] sock_recvmsg at ffffffff80030ab8
> #10 [ffff810407383db0] sys_recvmsg at ffffffff8003d77f
> #11 [ffff810407383f50] compat_sys_socketcall at ffffffff8022b44b
> #12 [ffff810407383f80] cstar_do_call at ffffffff80062618
>
> What else do I need to provide you?
>
> --CQ
>
>
>
>
>
More information about the rds-devel
mailing list