[rds-devel] Re: one_use.tgz (rds-stress.c + one_use.patch for rds
driver).
Richard Frank
richard.frank at oracle.com
Wed Jan 9 19:25:39 PST 2008
Richard Frank wrote:
> One minor fix in __rds_free_mr ()
>
> Move rds_mr_put(rs,mr) under trans_private test....otherwise we always
> auto free the mr.
>
> if (trans_private) {
> mr->r_trans->free_mr(trans_private,
> invalidate, mr->r_sg, mr->r_nents);
> rds_mr_put(rs, mr);
> }
>
> Richard Frank wrote:
>> I reduced the FMR pool size to 2k (was 16k) and no longer get this
>> crash ?
>>
>> Perhaps there are limitations on the pool sizes for different HCAs /
>> firmware versions ?
>>
>> Richard Frank wrote:
>>> applying these patches to OFED-1.3-20080107-0600
>>>
>>> git://git.openfabrics.org/ofed_1_3/linux-2.6.git ofed_kernel
>>> commit 5c2b6d5ee97ebb96362048935f0780a7d772274e
>>>
>>> test crashes both nodes running basic rds-stress test (not rdma).
>>>
>>> There was one complaint in message.c during patch - which I manually
>>> applied - perhaps that's the problem.
>>>
>>> [root at vosib6 ofa_kernel-1.3]# more net/rds/message.c.rej
>>> ***************
>>> *** 37,42 ****
>>>
>>> static unsigned int rds_exthdr_size[__RDS_EXTHDR_MAX] = {
>>> [RDS_EXTHDR_NONE] = 0,
>>> [RDS_EXTHDR_RDMA] = sizeof(struct rds_ext_header_rdma),
>>> };
>>>
>>> --- 37,43 ----
>>>
>>> static unsigned int rds_exthdr_size[__RDS_EXTHDR_MAX] = {
>>> [RDS_EXTHDR_NONE] = 0,
>>> + [RDS_EXTHDR_VERSION] = sizeof(struct rds_ext_header_version),
>>> [RDS_EXTHDR_RDMA] = sizeof(struct rds_ext_header_rdma),
>>> };
>>>
>>> Jan 9 21:12:18 vosib6 kernel: Unable to handle kernel NULL pointer
>>> dereference at virtual address 00000014
>>> Jan 9 21:12:18 vosib6 kernel: printing eip:
>>> Jan 9 21:12:18 vosib6 kernel: fab60541
>>> Jan 9 21:12:18 vosib6 kernel: *pde = 33fe6001
>>> Jan 9 21:12:18 vosib6 kernel: Oops: 0000 [#1]
>>> Jan 9 21:12:18 vosib6 kernel: SMP
>>> Jan 9 21:12:18 vosib6 kernel: Modules linked in: rds(U) rdma_ucm(U)
>>> rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoi\
>>> b(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_umad(U) ib_ucm(U)
>>> ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_c\
>>> ore(U) nfsd exportfs parport_pc lp parport autofs4 i2c_dev i2c_core
>>> nfs lockd nfs_acl sunrpc dm_mirror dm_m\
>>> ultipath dm_mod button battery ac uhci_hcd ehci_hcd hw_random shpchp
>>> md5 ipv6 e1000 floppy ata_piix libata \
>>> sg ext3 jbd aic79xx sd_mod scsi_mod
>>> Jan 9 21:12:18 vosib6 kernel: CPU: 3
>>> Jan 9 21:12:18 vosib6 kernel: EIP: 0060:[<fab60541>] Not
>>> tainted VLI
>>> Jan 9 21:12:18 vosib6 kernel: EFLAGS: 00010246
>>> (2.6.9-67.0.0.0.1.ELsmp)
>>> Jan 9 21:12:18 vosib6 kernel: EIP is at rds_ib_setup_qp+0x1d/0x219
>>> [rds]
>>> Jan 9 21:12:18 vosib6 kernel: eax: 00000000 ebx: e1571efc ecx:
>>> f6219064 edx: 00000200
>>> Jan 9 21:12:18 vosib6 kernel: esi: f7394400 edi: f4b3ce00 ebp:
>>> e1571d8c esp: f04d3ec8
>>> Jan 9 21:12:18 vosib6 kernel: ds: 007b es: 007b ss: 0068
>>> Jan 9 21:12:18 vosib6 kernel: Process rdma_cm (pid: 22898,
>>> threadinfo=f04d3000 task=f65031b0)
>>> Jan 9 21:12:18 vosib6 kernel: Stack: 00000000 00000000 f04d3ef8
>>> 00000003 c3654760 c3654760 c3653d80 c36547\
>>> 60
>>> Jan 9 21:12:18 vosib6 kernel: f04d3f08 f7f20800 f7f205b0
>>> c3653d80 f6582080 f7f20720 c3831fbc c02d64\
>>> 6d
>>> Jan 9 21:12:18 vosib6 kernel: f04d3f68 e1571efc e1571d8c
>>> f4b3ce00 f4b3ce00 fab60888 2f085740 f0
>>>
>>> Olaf Kirch wrote:
>>>> Here we go - latest set of patches attached.
>>>>
>>>> New stuff:
>>>> - RDS extension headers. Rather than stuffing more and more
>>>> things into the header, I decided to reserve 16 bytes
>>>> for "extensions" and write the plumbing for it.
>>>> - RDMA extension header. Goes with every SEND following an
>>>> RDMA operation, and contains the R_Key.
>>>> This is used by the receiver to check for (and release)
>>>> MRs marked as use_once
>>>> - Changed GET_MR interface - got rid of phys_addr, and
>>>> added use_once.
>>>> The phys_addr stuff needs more cleanup
>>>> - Version extension header. We now broadcast our supported
>>>> RDS protocol version as part of the initial CONG_MAP update.
>>>> This doesn't do much right now, but will be needed to
>>>> do rolling updates in the future.
>>>> I added this stuff now, so that we don't have to rely
>>>> on advanced crystal balling later when the time comes
>>>> where we break the protocol.
>>>>
>>>> Vlad, you mentioned that it's possible to crash the RDS stack by
>>>> telnetting to the RDS TCP port. I was unable to reproduce this -
>>>> what did you do to trigger the crash?
>>>>
>>>> Olaf
>>>>
>>>
>>
>
More information about the rds-devel
mailing list