[rds-devel] Re: one_use.tgz (rds-stress.c + one_use.patch for rds driver).

Richard Frank richard.frank at oracle.com
Wed Jan 9 19:25:15 PST 2008


Richard Frank wrote:
> I reduced the FMR pool size to 2k (was 16k) and no longer get this 
> crash ?
>
> Perhaps there are limitations on the pool sizes for different HCAs / 
> firmware versions ?
>
> Richard Frank wrote:
>> applying these patches to OFED-1.3-20080107-0600
>>
>> git://git.openfabrics.org/ofed_1_3/linux-2.6.git ofed_kernel
>> commit 5c2b6d5ee97ebb96362048935f0780a7d772274e
>>
>> test crashes both nodes running basic rds-stress test (not rdma).
>>
>> There was one complaint in message.c during patch - which I manually 
>> applied - perhaps that's the problem.
>>
>> [root at vosib6 ofa_kernel-1.3]# more net/rds/message.c.rej
>> ***************
>> *** 37,42 ****
>>
>>  static unsigned int   rds_exthdr_size[__RDS_EXTHDR_MAX] = {
>>  [RDS_EXTHDR_NONE]     = 0,
>>  [RDS_EXTHDR_RDMA]     = sizeof(struct rds_ext_header_rdma),
>>  };
>>
>> --- 37,43 ----
>>
>>  static unsigned int   rds_exthdr_size[__RDS_EXTHDR_MAX] = {
>>  [RDS_EXTHDR_NONE]     = 0,
>> + [RDS_EXTHDR_VERSION]  = sizeof(struct rds_ext_header_version),
>>  [RDS_EXTHDR_RDMA]     = sizeof(struct rds_ext_header_rdma),
>>  };
>>
>> Jan  9 21:12:18 vosib6 kernel: Unable to handle kernel NULL pointer 
>> dereference at virtual address 00000014
>> Jan  9 21:12:18 vosib6 kernel:  printing eip:
>> Jan  9 21:12:18 vosib6 kernel: fab60541
>> Jan  9 21:12:18 vosib6 kernel: *pde = 33fe6001
>> Jan  9 21:12:18 vosib6 kernel: Oops: 0000 [#1]
>> Jan  9 21:12:18 vosib6 kernel: SMP
>> Jan  9 21:12:18 vosib6 kernel: Modules linked in: rds(U) rdma_ucm(U) 
>> rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoi\
>> b(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_umad(U) ib_ucm(U) 
>> ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_c\
>> ore(U) nfsd exportfs parport_pc lp parport autofs4 i2c_dev i2c_core 
>> nfs lockd nfs_acl sunrpc dm_mirror dm_m\
>> ultipath dm_mod button battery ac uhci_hcd ehci_hcd hw_random shpchp 
>> md5 ipv6 e1000 floppy ata_piix libata \
>> sg ext3 jbd aic79xx sd_mod scsi_mod
>> Jan  9 21:12:18 vosib6 kernel: CPU:    3
>> Jan  9 21:12:18 vosib6 kernel: EIP:    0060:[<fab60541>]    Not 
>> tainted VLI
>> Jan  9 21:12:18 vosib6 kernel: EFLAGS: 00010246   
>> (2.6.9-67.0.0.0.1.ELsmp)
>> Jan  9 21:12:18 vosib6 kernel: EIP is at rds_ib_setup_qp+0x1d/0x219 
>> [rds]
>> Jan  9 21:12:18 vosib6 kernel: eax: 00000000   ebx: e1571efc   ecx: 
>> f6219064   edx: 00000200
>> Jan  9 21:12:18 vosib6 kernel: esi: f7394400   edi: f4b3ce00   ebp: 
>> e1571d8c   esp: f04d3ec8
>> Jan  9 21:12:18 vosib6 kernel: ds: 007b   es: 007b   ss: 0068
>> Jan  9 21:12:18 vosib6 kernel: Process rdma_cm (pid: 22898, 
>> threadinfo=f04d3000 task=f65031b0)
>> Jan  9 21:12:18 vosib6 kernel: Stack: 00000000 00000000 f04d3ef8 
>> 00000003 c3654760 c3654760 c3653d80 c36547\
>> 60
>> Jan  9 21:12:18 vosib6 kernel:        f04d3f08 f7f20800 f7f205b0 
>> c3653d80 f6582080 f7f20720 c3831fbc c02d64\
>> 6d
>> Jan  9 21:12:18 vosib6 kernel:        f04d3f68 e1571efc e1571d8c 
>> f4b3ce00 f4b3ce00 fab60888 2f085740 f0
>>
>> Olaf Kirch wrote:
>>> Here we go - latest set of patches attached.
>>>
>>> New stuff:
>>>  -    RDS extension headers. Rather than stuffing more and more
>>>     things into the header, I decided to reserve 16 bytes
>>>     for "extensions" and write the plumbing for it.
>>>  -    RDMA extension header. Goes with every SEND following an
>>>     RDMA operation, and contains the R_Key.
>>>     This is used by the receiver to check for (and release)
>>>     MRs marked as use_once
>>>  -    Changed GET_MR interface - got rid of phys_addr, and
>>>     added use_once.
>>>     The phys_addr stuff needs more cleanup
>>>  -    Version extension header. We now broadcast our supported
>>>     RDS protocol version as part of the initial CONG_MAP update.
>>>     This doesn't do much right now, but will be needed to
>>>     do rolling updates in the future.
>>>     I added this stuff now, so that we don't have to rely
>>>     on advanced crystal balling later when the time comes
>>>     where we break the protocol.
>>>
>>> Vlad, you mentioned that it's possible to crash the RDS stack by
>>> telnetting to the RDS TCP port. I was unable to reproduce this -
>>> what did you do to trigger the crash?
>>>
>>> Olaf
>>>   
>>
>



More information about the rds-devel mailing list