[rds-devel] Re: trying to reproduce the crash

Or Gerlitz ogerlitz at voltaire.com
Sun Feb 3 03:02:13 PST 2008


> Olaf Kirch wrote:
>  a.    when running with -d4 -t64 -T3 -D64k, I often get a crash
>     in ib_fmr_map_phys - always in the same location in mthca_arbel_map_phys. 


Hi Olaf,

I did some rds-stress runs as above over the device you are using 
MT25204 (but with the latest firmware 1.2.0) both as client and server 
and I don't manage to reproduce a crash. The code is ofed 1.3 rc3, the 
second node is connectx. So, it would be best if you can share a script 
that when running with the crash is reproduced.

Or.

Other then not crashing, I did see some problems, specifically, atomic 
order zero page allocation failure in rds_ib_recv_refill

> The following is only an harmless informational message.
> Unless you get a _continuous_flood_ of these messages it means
> everything is working fine. Allocations from irqs cannot be
> perfectly reliable and the kernel is designed to handle that.
> kswapd0: page allocation failure. order:0, mode:0x22
> 
> Call Trace: <IRQ> <ffffffff8016334b>{__alloc_pages+727}
>        <ffffffff8837c207>{:rds:rds_ib_recv_refill+232} <ffffffff8837cc08>{:rds:rds_ib_recv_cq_comp_handler+1647}
>        <ffffffff88147a91>{:mlx4_core:mlx4_eq_int+44} <ffffffff88147c92>{:mlx4_core:mlx4_msi_x_interrupt+15}
>        <ffffffff80137767>{__do_softirq+95} <ffffffff8015afbc>{handle_IRQ_event+41}
>        <ffffffff8015b086>{__do_IRQ+153} <ffffffff8010d430>{do_IRQ+59}
>        <ffffffff8010b25a>{ret_from_intr+0} <EOI> <ffffffff802da558>{_spin_lock+3}
>        <ffffffff801716f8>{page_check_address+173} <ffffffff80171eda>{page_referenced_one+97}
>        <ffffffff80172230>{page_referenced+135} <ffffffff80172217>{page_referenced+110}
>        <ffffffff80166776>{shrink_zone+562} <ffffffff801677d0>{balance_pgdat+530}
>        <ffffffff8016819e>{kswapd+308} <ffffffff80145aa2>{autoremove_wake_function+0}
>        <ffffffff8010bdce>{child_rip+8} <ffffffff8016806a>{kswapd+0}
>        <ffffffff8010bdc6>{child_rip+0}


and also reports on wrong sequence number in the client side

> An incoming message had a header which
> didn't contain the fields we expected:
>     member        expected eq             got
>        seq              14 !=              15
>  from_addr   192.168.10.85  =   192.168.10.85
>  from_port            4003  =            4003
>    to_addr   192.168.10.85  =   192.168.10.85
>    to_port            4044  =            4044
>      index               3  =               3
>         op               1  =               1
> header from 192.168.10.85:4003 to id 4044 bogus






More information about the rds-devel mailing list