[rds-devel] Re: one_use.tgz (rds-stress.c + one_use.patch for rds driver).

Richard Frank richard.frank at oracle.com
Wed Jan 9 19:26:00 PST 2008


Richard Frank wrote:
> These patches are looking good with rds-stress and crload - with two 
> changes.
>
> 1) set default FMR pool to 2k - vs - 16k. Not sure why but more than 
> 2k and the pool fails to create and we end up crashing. We can always 
> increase with the module param if needed - but we need a stat to show 
> exhaustion...
> 2) freeing the key - moving put_mr under test for transport..
>
> We should check these patches in.
>
> We still need to sort out what to do with rds-tools/net/rds.h - this 
> file need to move to /usr/include at OFED install - as this is where 
> Oracle expects to find it.
>
> I am seeing two issues when running crload - with and without aligned 
> buffers and shared memory.
>
> 1) occasionally free_mr fails due to invalid parameter (could not find 
> key in rb_tree) . Once this occurs - the only way to clear the 
> condition is to either reload the rds driver and sometimes requires 
> reloading the HCA driver. What's interesting is allocating the MR is 
> not failing - just attempting to free it. Feels like a problem in the 
> driver - nothing is reported in var/log/messages.
>
> 2) crload (with skgxp ipc) thinks that barriers are lagging. We are 
> issuing the rdma operations + immediate - as the sender side of crload 
> (requesting rdmas) is sending more requests with keys for more rdmas 
> (can only have 16 on the wire at a time) - and the recv side (rdma 
> server) is happy to process more rdma requests - but the recv side is 
> not seeing the barrier hwm increase for prior rdmas ( at least there 
> are large delays) - and is holding on to memory for the prior 
> operations - so we end up chewing up megs of memory - until the 
> barriers finally complete - which they seem to eventually do.
>
> This could be a bug is crload and or skgxp - still looking at this.
>
>
>
> Olaf Kirch wrote:
>> Here we go - latest set of patches attached.
>>
>> New stuff:
>>  -    RDS extension headers. Rather than stuffing more and more
>>     things into the header, I decided to reserve 16 bytes
>>     for "extensions" and write the plumbing for it.
>>  -    RDMA extension header. Goes with every SEND following an
>>     RDMA operation, and contains the R_Key.
>>     This is used by the receiver to check for (and release)
>>     MRs marked as use_once
>>  -    Changed GET_MR interface - got rid of phys_addr, and
>>     added use_once.
>>     The phys_addr stuff needs more cleanup
>>  -    Version extension header. We now broadcast our supported
>>     RDS protocol version as part of the initial CONG_MAP update.
>>     This doesn't do much right now, but will be needed to
>>     do rolling updates in the future.
>>     I added this stuff now, so that we don't have to rely
>>     on advanced crystal balling later when the time comes
>>     where we break the protocol.
>>
>> Vlad, you mentioned that it's possible to crash the RDS stack by
>> telnetting to the RDS TCP port. I was unable to reproduce this -
>> what did you do to trigger the crash?
>>
>> Olaf
>>   
>



More information about the rds-devel mailing list