[rds-devel] FW: RDS -- kernel dump

Tang, Changqing changquing.tang at hp.com
Fri Apr 9 12:02:27 PDT 2010


Tina,
        Here is another situation with kernel core dump.

        I make some change to RDS code and build the new rds.ko file, I directly copy
The file to /lib/modules/2.6.18-128.1.14.el5.8hp/updates/kernel/net/rds/ to replace the
Old one, and then do '/etc/init.d/openibd restart', the kernel core dump with the following
Backtrace:

crash> bt
PID: 14623  TASK: ffff8103bf63c7b0  CPU: 0   COMMAND: "rmmod"
 #0 [ffff8103a0a63c50] die at ffffffff8006cc4d
 #1 [ffff8103a0a63c80] do_invalid_op at ffffffff8006d20d
 #2 [ffff8103a0a63d40] error_exit at ffffffff8005ede9
    [exception RIP: rds_conn_destroy+161]
    RIP: ffffffff88694a42  RSP: ffff8103a0a63df8  RFLAGS: 00010216
    RAX: ffff8103eef4d2c8  RBX: ffff810413839aa8  RCX: ffff8103eef4d2d8
    RDX: ffff8103dedc0ed8  RSI: 00000000000000ff  RDI: ffff8103eef4d2c0
    RBP: ffff8103dedc0ec0   R8: 0000000000000004   R9: 000000000000003c
    R10: ffffffff803d6260  R11: 0000000000000282  R12: ffff81041c6cdc20
    R13: 00007fffd8e778e0  R14: 00007fffd8e77930  R15: 0000000000000880
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #3 [ffff8103a0a63df0] rds_conn_destroy at ffffffff88694a13
 #4 [ffff8103a0a63e10] __rds_ib_destroy_conns at ffffffff886be878
 #5 [ffff8103a0a63e40] rds_ib_remove_one at ffffffff886ba3ef
 #6 [ffff8103a0a63e60] ib_unregister_device at ffffffff88434680
 #7 [ffff8103a0a63e90] mlx4_ib_remove at ffffffff884d9fc5
 #8 [ffff8103a0a63eb0] mlx4_remove_device at ffffffff882adc97
 #9 [ffff8103a0a63ed0] mlx4_unregister_interface at ffffffff882add5a
#10 [ffff8103a0a63ef0] cleanup_module at ffffffff884e12f4
#11 [ffff8103a0a63f00] sys_delete_module at ffffffff800a4cfa
#12 [ffff8103a0a63f80] tracesys at ffffffff8005e28d (via system_call)
    RIP: 00002af40c7180f7  RSP: 00007fffd8e778d8  RFLAGS: 00000206
    RAX: ffffffffffffffda  RBX: ffffffff8005e28d  RCX: ffffffffffffffff
    RDX: 0000000000000fdf  RSI: 0000000000000880  RDI: 00007fffd8e778e0
    RBP: 0000000000000003   R8: 0000000000b1c010   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000206  R12: 00007fffd8e7a1b0
    R13: 00007fffd8e7a130  R14: 00000000d8e77930  R15: ffff8103a0a63f64
    ORIG_RAX: 00000000000000b0  CS: 0033  SS: 002b
crash>

What is this problem ? After the system reboot, RDS works fine, that means
The new rds.ko works.

--CQ

-----Original Message-----
From: Tina Yang [mailto:tina.yang at oracle.com]
Sent: Friday, April 09, 2010 12:49 PM
To: Tang, Changqing
Cc: Andy Grover; RDS Devel
Subject: Re: [rds-devel] FW: RDS -- hanging kernel

Looks like the signature of a known bug.
Go fetch the following recently submitted patches from
kernel netdev archive,
[rds-devel][PATCH 06/13] RDS: Fix send locking issue
[rds-devel][PATCH 09/13] RDS: Fix locking in rds_send_drop_to()
[rds-devel] [PATCH 12/13] RDS: Do not call set_page_dirty() with    irqs off
and
https://bugs.openfabrics.org/show_bug.cgi?id=2002



Tang, Changqing wrote:
> Tina,
>         Eventually I get the vmcore for the hanging kernel. I run MPI pallas test over
> RDS, two nodes, 5 rank on the first node, 6 ranks on the second node.
>
>         After running a while, during MPI_Allreduce() test, the first node hangs. We
> Managed to get the vmcore from this hanging node.
>
>         It shows following:
>
> crash> bt
> PID: 9353   TASK: ffff810416f8c7b0  CPU: 1   COMMAND: "pmb.x.32"
>  #0 [ffff81042ff33e68] die_nmi at ffffffff8006634d
>  #1 [ffff81042ff33e80] die_nmi at ffffffff80066393
>  #2 [ffff81042ff33ea0] nmi_watchdog_tick at ffffffff80066af9
>  #3 [ffff81042ff33ef0] default_do_nmi at ffffffff80066717
>  #4 [ffff81042ff33f40] do_nmi at ffffffff80066984
>  #5 [ffff81042ff33f50] nmi at ffffffff80065fd7
>     [exception RIP: .text.lock.spinlock+17]
>     RIP: ffffffff80065cf3  RSP: ffff810407383af0  RFLAGS: 00000082
>     RAX: 0000000000000246  RBX: ffff810405c21b80  RCX: 0000000000002000
>     RDX: ffff810407383ed8  RSI: ffff810407383ed8  RDI: ffff810405c21df8
>     RBP: 0000000000000000   R8: 0000000080000040   R9: 0000000000002000
>     R10: 0000000000002000  R11: 00000000ffb2a618  R12: 0000000000000001
>     R13: ffff810407383ed8  R14: ffff810407383ed8  R15: ffff810405c21df8
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <exception stack> ---
>  #6 [ffff810407383af0] .text.lock.spinlock at ffffffff80065cf3 (via _spin_lock_i
> rqsave)
>  #7 [ffff810407383af0] rds_notify_queue_get at ffffffff8868c6a0
> #8 [ffff810407383b60] rds_recvmsg at ffffffff8868cbb2
>  #9 [ffff810407383c10] sock_recvmsg at ffffffff80030ab8
> #10 [ffff810407383db0] sys_recvmsg at ffffffff8003d77f
> #11 [ffff810407383f50] compat_sys_socketcall at ffffffff8022b44b
> #12 [ffff810407383f80] cstar_do_call at ffffffff80062618
>
> What else do I need to provide you?
>
> --CQ
>
>
>
>
>




More information about the rds-devel mailing list