[rds-devel] what is rdma immediate data and how is it used ?

Wed Nov 14 18:01:12 PST 2007

Generally, rdma operations do not provide a notification to the remote 
host when they complete. It is possible to setup for notification - 
using IB native "immediate" data (4 bytes) for example - but the data 
size is limited.

RDS rdma immediate data is a normal RDS socket message which is 
guaranteed to be delivered after the rdma completes. The immediate data 
size is limited by the socket send message operation. Furthermore, the 
rdma data and immediate data are an atomic unit - either both arrive - 
or neither arrives.

In practice the immediate is pushed down the RDS pipe immediately ( a 
few instructions ) after the rdma is posted. So the immediate data is 
racing behind the rdma operation and in theory will arrive with very low 
latency between the rdma completing and the immediate data arriving.

So what is the immediate data good for ?

Well for one, the client requesting the rdma would like to know when the 
rdma has completed. So generally, the immediate data is a message from 
the rdma server which contains some identifier that the rdma client 
provided, of which the rdma client can use to recognize the operation 
completing and do things like free the rdma key, and if the rdma is 
incoming, process the data and free the buffer, etc.

Consider the case of an rdma server used to implement a simple zero copy 
disk block server over  zero copy RDS sockets.

To read a disk block, the rdma client requests that the rdma server 
issue a disk read and then issue an rdma write back to the rdma client, 
and send a completion message (immediate data) indicating the write read 
is complete - data is in rdma client memory.

So how does the rdma disk server know when the rdma read is complete - 
assuming it's not a sync rdma read - or sync rds barrier operation - and 
that rdma server is not polling via the rds barrier operation ?

Via a common completion model implemented via poll() !

Poll() can wait for:

a) incoming messages (pollin)
b) send space available (pollout)
c) any rdma completion or a specific rdma completion (pollin)
d) congestion removed from a destination (pollin)

To write a disk block, the rdma client requests that the rdma server 
issue an rdma read from rdma client host memory into the rdma server 
memory, and then to issue the disk write and send back a completion 
message. What's interesting in this case is that the immediate data 
which could be sent as part of the rdma read is optional (could be send 
size of zero). If the immediate data is sent, the rdma client would know 
that the rdma server has completed pulling the data from the rdma client 
host memory - so it's possible for the rdma client to re-use the write 
buffer at that point. Of course this assumes that either the rdma client 
is willing to live possible data loss in light of a path failure - or 
that the rdma disk server is guaranteeing to commit the data. When the 
actual disk write completes a separate completion message is sent back 
to the rdma client.