[rds-devel] Re: RDS/RDMA protocol

Richard Frank richard.frank at oracle.com
Mon Nov 5 09:57:32 PST 2007


Zach Brown wrote:
> You should be sending these mails to rds-devel.  There's a great chance
> that you'll want the people actively involved in RDS to respond.
>
>   
Yes - we're getting hammered about not using rds-dev - let's try and 
move all our RDS discussion - that does not contain Oracle private 
information - into rds-dev.

> Or Gerlitz wrote:
>   
>> Hi Zach,
>>
>> (I hope you can address this, as from the git changelog it seems you
>> were doing at least part of the rdma related commits)
>>     
>
> I threw together the first version of the core support which sits above
> the transports.
>
>  Can you spare few
>   
>> lines on the rds rdma design? looking at the code and using some info
>> from an email I got once from Rick I can't nail this.
>>     
>
> Rick, can you publish your nice RDS/RDMA design document somewhere?
>
>   
I'll clean it up some more and post it - still a way to go before it's 
'nice'.

Guess it's time for a man page to describe the rdma interface .

>> From the code (OFED 1.3) I see that:
>>
>> A) the "client" (the side that does --not-- do the actual RDMA) should
>> do an RDS_GET_MR setsockopt() call which pins the data, fmr it and
>> deliver back to the caller the remote address and rkey to be used.
>>
>> B) the remote addr / rkey info are probably exchanged through regular
>> rds message.
>>
>> C) the "server" (the side that does the actual RDMA) calls sendmsg,
>> where the code of rds_rdma_msghdr_parse somehow magically realizes that
>> the msghr contains rkey and raddr and hence its an RDMA here, next RDMA
>> op is issued to the HCA
>>
>>     
The message send is a compound operation - it includes a regular RDS 
send (the immediate data) and the RDMA data.

The rdma operation is pushed down the wire and then the immediate data 
is pipelined right behind the RDMA.

Keep in mind when the client requested the rdma - it sent along an 
identifier for the operation. When the rdma server responds - it encodes 
the client provided identifier in the immediate  data response message.

So when the client recv's the immediate data message (response to 
request) - it gets back the id of the request which completed and it 
knows the rdma is complete due to ordering of the RC that both ops 
traveled over.

>> D) the RDMA op completes
>>
>> ...
>>
>>     
>> F) the "server" app side somehow know that the RDMA is done
>>
>>     
The rds_barrier is used to detect which rdma operations have completed.  
When the rdma + immediate data was initiated an rdma operation ID was 
returned at sendmsg completion to the initiator. Note that sendmsg can 
return before the operations are initiated over the wire - they may be 
que'd for delivery.

The rdma operation ID is a monotonically increasing value. The RDS 
driver maintains a high water mark of completed rdma operations - and 
the barrier operation returns this value to the client.

So a single barrier call (which can block if blocking socket) can 
indicate many operations have completed.

The model for the rdma server is to "psuedo" async blast rdma's back to 
the clients - and periodically check for RMDA completions - when it 
needs to reclaim resources.

One more important point about RDMA operations - even though they are 
running over RDS - the rdma operations are themselves not reliable. That 
is, in light of certain failures, the RDMA + immediate data may be lost. 
It is up to the requester of the RDMA operation to detect the loss and 
re-request the operation.

Take a look at the modified rds-stress.c for rdma - you can see the 
usage model there.
>> G) the "client" app side gets a message from the "server" side, telling
>> that the RDMA is done and it can issue an RDS_FREE_MR setsockopt() call
>>
>>
>> Am I correct in A-D and F-G above?
>>
>> What I mainly miss is how the RDMA is being acked to the client, Rick
>> was mentioning immediate data usage but I don't see an evidence to that
>> in the code.
>>     
>
> Rick, I'll let you explain to Or how skgxp is using RDS RDMA.
>
> - z
>   



More information about the rds-devel mailing list