[rds-devel] Re: comments on the send CQ completion handler

Richard Frank richard.frank at oracle.com
Wed Jan 9 09:00:25 PST 2008

Or Gerlitz wrote:
> Richard Frank wrote:
>> All send operations should have signaled set as the last fragment is 
>> xmitted - and may have it set for one or more of the intervening 
>> fragments - based on the number of fragments sent for a message.
> Indeed, but this is not how rds_ib_xmit_rdma works, rdma is always 
> serves by one posting to the qp (i.e one fragment), and asking a 
> completion to be generated for this post is conditioned on the 
> max_unsig_bytes/wrs values.
Seems like a bug - in any case we do not need these completions - as we 
depend on the send rm for the immediate data to update the barrier hwm. 
We count on the ordering of IB to ensure the prior RDMA is complete when 
the immediate data send rm completes.

>> This raises another point - we do not need signaling of rdma 
>> completions - as all rdmas are always followed with the immediate 
>> data send - which should be signaled. We use the immediate data send 
>> completion to set the barrier  hwm.
> On the other thread you wrote that "Our planned use of immediate data 
> (notification message of rdma completion) keeps the immediate data 
> (msg) separate from the rdma data" 
Sorry for not being clearer - I was only saying that there is no overlap 
in the rdma buffer and that of the immediate data buffer - so there 
should be no issues wrt to ordering of updates to host memory for these 
> from which I understand you first want to get notification on the rdma 
> completion and only after this (and possibly some more processing is 
> done) send the immediate data, did I miss anything?
Immediate data is data sent along with an RDMA - in a single atomic op - 
from the appl perspective and rds driver perspective. On the client host 
(requester of rdma) - if the immediate send rm arrives - then it knows 
the rdma completed (data is in local memory or was read from local 
memory). On the rdma server side - when the local send rm completion 
fires - we know the rdma was placed into remote memory (write) and or 
rdma data has arrived in local host memory (read) and the send rm was 
delivered to the remote hca - and that the local HCA has completed 
processing on the local rdma buffers.

The immediate data send rm carries RDS driver wire protocol and is 
always posted after the rdma operation - even if the client has posted 
zero len immediate data. The rdma server always has a local send 
completion following the completion of the RDMA (read or write) - which 
it can depend on to raise the barrier  hwm.

The appl (rdma server) can use the immediate data to send information to 
the client - such as "your read request is complete - or - "the server 
is done with your write buffer". But if the server must do more 
processing to complete the write (from client) - say put it on durable 
store - then a separate message must be sent by the server to the client 
indicating the additional processing is complete. This separate message 
is not immediate data...

Another important point about the barrier hwm - which is implementing 
"locally complete barriers" - is that the hwm  by itself does not 
indicate the status of an rdma operation - just that the local buffers 
for the rdma are no longer in use (ownership is returned to the 
application) by the rds driver. The actual rdma may have succeeded or 
failed. It is important to ensure that we raise the hwm for failed rdma 
(immediate data send rm) ops ! It is up to the rdma server / client to 
implement a protocol to determine if the actual rdma completed 
successfully - or not (e.g. immediate data message or out of band 
message arrives) .

So if / when the RDS driver finishes processing an rdma request 
(read/write) it must raise the barrier hwm - including when it tosses 
rdma operations due to connections bouncing !

> Or.

More information about the rds-devel mailing list