[rds-devel] FW: RDS -- hanging kernel

Tang, Changqing changquing.tang at hp.com
Wed Apr 14 21:29:35 PDT 2010


I have the 500MB kernel core file, but I don't have a way to store somewhere for you to download.  Do you have a place that I can easily upload to ?

--CQ

-----Original Message-----
From: Tina Yang [mailto:tina.yang at oracle.com]
Sent: Saturday, April 10, 2010 12:03 AM
To: Tang, Changqing
Cc: Andy Grover; RDS Devel
Subject: Re: [rds-devel] FW: RDS -- hanging kernel

Can you put the core dump on a ftp site where I
can accesss ?



Tang, Changqing wrote:
> Tina,
>         I apply the patch in bug 2002 to file rdma.c, and rebuild RDS, it does NOT work
> For me, the Pallas MPI test still hangs the first node almost at the same test location.
>
>         Any idea?
>
> --CQ
>
> -----Original Message-----
> From: Tang, Changqing
> Sent: Friday, April 09, 2010 1:14 PM
> To: 'Tina Yang'
> Cc: Andy Grover; RDS Devel
> Subject: RE: [rds-devel] FW: RDS -- hanging kernel
>
> Ok, the one in bug 2002 are newer than OFED 1.5.1, is this the one to fix my problem?
>
> --CQ
>
> -----Original Message-----
> From: Tina Yang [mailto:tina.yang at oracle.com]
> Sent: Friday, April 09, 2010 12:49 PM
> To: Tang, Changqing
> Cc: Andy Grover; RDS Devel
> Subject: Re: [rds-devel] FW: RDS -- hanging kernel
>
> Looks like the signature of a known bug.
> Go fetch the following recently submitted patches from
> kernel netdev archive,
> [rds-devel][PATCH 06/13] RDS: Fix send locking issue
> [rds-devel][PATCH 09/13] RDS: Fix locking in rds_send_drop_to()
> [rds-devel] [PATCH 12/13] RDS: Do not call set_page_dirty() with    irqs off
> and
> https://bugs.openfabrics.org/show_bug.cgi?id=2002
>
>
>
> Tang, Changqing wrote:
>
>> Tina,
>>         Eventually I get the vmcore for the hanging kernel. I run MPI pallas test over
>> RDS, two nodes, 5 rank on the first node, 6 ranks on the second node.
>>
>>         After running a while, during MPI_Allreduce() test, the first node hangs. We
>> Managed to get the vmcore from this hanging node.
>>
>>         It shows following:
>>
>> crash> bt
>> PID: 9353   TASK: ffff810416f8c7b0  CPU: 1   COMMAND: "pmb.x.32"
>>  #0 [ffff81042ff33e68] die_nmi at ffffffff8006634d
>>  #1 [ffff81042ff33e80] die_nmi at ffffffff80066393
>>  #2 [ffff81042ff33ea0] nmi_watchdog_tick at ffffffff80066af9
>>  #3 [ffff81042ff33ef0] default_do_nmi at ffffffff80066717
>>  #4 [ffff81042ff33f40] do_nmi at ffffffff80066984
>>  #5 [ffff81042ff33f50] nmi at ffffffff80065fd7
>>     [exception RIP: .text.lock.spinlock+17]
>>     RIP: ffffffff80065cf3  RSP: ffff810407383af0  RFLAGS: 00000082
>>     RAX: 0000000000000246  RBX: ffff810405c21b80  RCX: 0000000000002000
>>     RDX: ffff810407383ed8  RSI: ffff810407383ed8  RDI: ffff810405c21df8
>>     RBP: 0000000000000000   R8: 0000000080000040   R9: 0000000000002000
>>     R10: 0000000000002000  R11: 00000000ffb2a618  R12: 0000000000000001
>>     R13: ffff810407383ed8  R14: ffff810407383ed8  R15: ffff810405c21df8
>>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>> --- <exception stack> ---
>>  #6 [ffff810407383af0] .text.lock.spinlock at ffffffff80065cf3 (via _spin_lock_i
>> rqsave)
>>  #7 [ffff810407383af0] rds_notify_queue_get at ffffffff8868c6a0
>> #8 [ffff810407383b60] rds_recvmsg at ffffffff8868cbb2
>>  #9 [ffff810407383c10] sock_recvmsg at ffffffff80030ab8
>> #10 [ffff810407383db0] sys_recvmsg at ffffffff8003d77f
>> #11 [ffff810407383f50] compat_sys_socketcall at ffffffff8022b44b
>> #12 [ffff810407383f80] cstar_do_call at ffffffff80062618
>>
>> What else do I need to provide you?
>>
>> --CQ
>>
>>
>>
>>
>>
>>
>
>




More information about the rds-devel mailing list