[rds-devel] what is rdma immediate data and how is it used ?
Richard Frank
richard.frank at oracle.com
Wed Nov 14 18:01:12 PST 2007
Generally, rdma operations do not provide a notification to the remote
host when they complete. It is possible to setup for notification -
using IB native "immediate" data (4 bytes) for example - but the data
size is limited.
RDS rdma immediate data is a normal RDS socket message which is
guaranteed to be delivered after the rdma completes. The immediate data
size is limited by the socket send message operation. Furthermore, the
rdma data and immediate data are an atomic unit - either both arrive -
or neither arrives.
In practice the immediate is pushed down the RDS pipe immediately ( a
few instructions ) after the rdma is posted. So the immediate data is
racing behind the rdma operation and in theory will arrive with very low
latency between the rdma completing and the immediate data arriving.
So what is the immediate data good for ?
Well for one, the client requesting the rdma would like to know when the
rdma has completed. So generally, the immediate data is a message from
the rdma server which contains some identifier that the rdma client
provided, of which the rdma client can use to recognize the operation
completing and do things like free the rdma key, and if the rdma is
incoming, process the data and free the buffer, etc.
Consider the case of an rdma server used to implement a simple zero copy
disk block server over zero copy RDS sockets.
To read a disk block, the rdma client requests that the rdma server
issue a disk read and then issue an rdma write back to the rdma client,
and send a completion message (immediate data) indicating the write read
is complete - data is in rdma client memory.
So how does the rdma disk server know when the rdma read is complete -
assuming it's not a sync rdma read - or sync rds barrier operation - and
that rdma server is not polling via the rds barrier operation ?
Via a common completion model implemented via poll() !
Poll() can wait for:
a) incoming messages (pollin)
b) send space available (pollout)
c) any rdma completion or a specific rdma completion (pollin)
d) congestion removed from a destination (pollin)
To write a disk block, the rdma client requests that the rdma server
issue an rdma read from rdma client host memory into the rdma server
memory, and then to issue the disk write and send back a completion
message. What's interesting in this case is that the immediate data
which could be sent as part of the rdma read is optional (could be send
size of zero). If the immediate data is sent, the rdma client would know
that the rdma server has completed pulling the data from the rdma client
host memory - so it's possible for the rdma client to re-use the write
buffer at that point. Of course this assumes that either the rdma client
is willing to live possible data loss in light of a path failure - or
that the rdma disk server is guaranteeing to commit the data. When the
actual disk write completes a separate completion message is sent back
to the rdma client.
More information about the rds-devel
mailing list