[rds-devel] Re: [PATCH] Report proper error code in [was: trying to reproduce the crash]

Or Gerlitz ogerlitz at voltaire.com
Tue Feb 5 06:17:41 PST 2008


Olaf Kirch wrote:
> I fixed this, using the patch below. Now I'm making some progress:

> First, the kernel reports:
> RDS/IB: ib_alloc_fmr failed (err=-12)
> which is good - now we get a decent error code instead of a crash.

So are you somehow limiting the amount of FMRs allocated by RDS? is the 
only missing piece to do that is support for max_fmr device attribute, 
or some more instrumentation of the rds code is needed?

> A little later, it complains:
> ib_mthca 0000:05:00.0: SW2HW_MPT returned status 0x0a

yes I see that ass well, in my case with more prints, I guess b/c I have 
set the debug_level module param of ib_mthca,

> rds_ib_conn_shutdown: failed to disconnect, cm: ffff810052452c00 err -22
> ib_mthca 0000:03:00.0: Command 0d completed with status 0a
> ib_mthca 0000:03:00.0: SW2HW_MPT returned status 0x0a
> RDS/IB: rds_ib_setup_qp failed (-22)

> which doesn't sound quite as good... and things are very hosed
> from that moment on; reloading ib_mthca seems to fix things, however.

I have applied the patch to mthca, so it allows me to run much more time 
  then before, however at some point, the client side (MT25204 / 1.2.0 
firmware) crashes when the script attempts to probe out the rds module:

> RDS/IB: No client_data for device mthca0
> RDS/IB: rds_ib_setup_qp failed (-95)
> RDS/IB: rds_ib_setup_qp failed (-95)
> rds_ib_conn_shutdown: failed to disconnect, cm: ffff810040c7e400 err -22
> RDS/IB: rds_ib_setup_qp failed (-95)
> RDS/IB: rds_ib_setup_qp failed (-95)
> NET: Unregistered protocol family 28
> Unregistered RDS/ib transport
> Unable to handle kernel NULL pointer dereference at 0000000000000010 RIP: 
> <ffffffff88343f2c>{:rds:rds_ib_remove_one+18}
> PGD 4bfe4067 PUD 3b913067 PMD 0 
> Oops: 0000 [1] SMP 
> last sysfs file: /class/net/ib0/flags
> CPU 6 
> Modules linked in: rds ib_mthca ib_ipoib rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr nfs lockd nfs_acl sunrpc autofs4 edd ipv6 af_packet thermal processor fan button battery ac apparmor aamatch_pcre loop dm_mod usbhid shpchp pci_hotplug ehci_hcd uhci_hcd tg3 usbcore ext3 jbd sr_mod cdrom i2c_i801 i2c_core sg ata_piix libata sd_mod scsi_mod
> Pid: 24445, comm: rmmod Tainted: G     U 2.6.16.46-0.12-smp #1
> RIP: 0010:[<ffffffff88343f2c>] <ffffffff88343f2c>{:rds:rds_ib_remove_one+18}
> RSP: 0000:ffff81003705de88  EFLAGS: 00010296
> RAX: 0000000000000000 RBX: ffff810038025000 RCX: ffff8100380bab20
> RDX: ffff81003859e380 RSI: 0000000000000296 RDI: ffff810038025080
> RBP: ffffffff88356880 R08: ffff81003705dd78 R09: ffff810037e1a890
> R10: ffffffff8017f6e5 R11: ffffffff88356880 R12: ffffffff88356220
> R13: 0000000000000000 R14: 0000000000000880 R15: 00007fff020167d0
> FS:  00002b1ca8dda6d0(0000) GS:ffff810037f91e40(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000010 CR3: 000000005e2bd000 CR4: 00000000000006e0
> Process rmmod (pid: 24445, threadinfo ffff81003705c000, task ffff81004d9327d0)
> Stack: ffff81004194af40 ffff810038025000 ffffffff88356880 ffffffff881775b3 
>        ffffffff88356260 ffffffff88356880 0000000000000880 0000000000000000 
>        0000000000000880 ffffffff88347d86 
> Call Trace: <ffffffff881775b3>{:ib_core:ib_unregister_client+47}
>        <ffffffff88347d86>{:rds:rds_trans_exit+57} <ffffffff88347b0b>{:rds:rds_exit+47}
>        <ffffffff8014c984>{sys_delete_module+540} <ffffffff8016e7e7>{do_munmap+619}
>        <ffffffff801f088b>{__up_write+33} <ffffffff8010ad3e>{system_call+126}
> 
> Code: 48 8b 78 10 48 89 c3 48 8b 2f eb 29 48 8b 17 48 8b 47 08 48 
> RIP <ffffffff88343f2c>{:rds:rds_ib_remove_one+18} RSP <ffff81003705de88>
> CR2: 0000000000000010

Or






More information about the rds-devel mailing list