[rds-devel] Re: RDS/RDMA protocol

Mon Nov 5 11:10:42 PST 2007

Attached is the latest rds_v3.h - proposal for zero copy extensions.

I will shortly post rds_v3_api.h - which matches the actual RDS v3 
driver design and implementation we settled on so far.

I updated the section dealing with waiting for rdma operations -> 
Unified wait model for RDS rdma and non-rdma operations:.

In summary, when an rdma server has nothing to do - it will wait in 
poll. It may have rdma operations outstanding (rdma reads / writes) - 
which when complete need to wake poll waiters - along with the usual 
suspects - incoming messages, the availability of send space, and 
removal of congestion... When awakened from poll - the rdma server will 
issue calls to recv / send / barrier to handle the poll conditions 
satisfied...

For example, consider a posted rdma read to pull data in from a remote 
host. Once the read is posted the rdma server may wait in poll - for any 
of - rdma read completion (poll_in) + incoming messages (poll_in) + send 
space availability (poll_out) + congestion removed (poll_out).

Richard Frank wrote:
> Zach Brown wrote:
>> You should be sending these mails to rds-devel.  There's a great chance
>> that you'll want the people actively involved in RDS to respond.
>>
>>   
> Yes - we're getting hammered about not using rds-dev - let's try and 
> move all our RDS discussion - that does not contain Oracle private 
> information - into rds-dev.
>
>> Or Gerlitz wrote:
>>  
>>> Hi Zach,
>>>
>>> (I hope you can address this, as from the git changelog it seems you
>>> were doing at least part of the rdma related commits)
>>>     
>>
>> I threw together the first version of the core support which sits above
>> the transports.
>>
>>  Can you spare few
>>  
>>> lines on the rds rdma design? looking at the code and using some info
>>> from an email I got once from Rick I can't nail this.
>>>     
>>
>> Rick, can you publish your nice RDS/RDMA design document somewhere?
>>
>>   
> I'll clean it up some more and post it - still a way to go before it's 
> 'nice'.
>
> Guess it's time for a man page to describe the rdma interface .
>
>>> From the code (OFED 1.3) I see that:
>>>
>>> A) the "client" (the side that does --not-- do the actual RDMA) should
>>> do an RDS_GET_MR setsockopt() call which pins the data, fmr it and
>>> deliver back to the caller the remote address and rkey to be used.
>>>
>>> B) the remote addr / rkey info are probably exchanged through regular
>>> rds message.
>>>
>>> C) the "server" (the side that does the actual RDMA) calls sendmsg,
>>> where the code of rds_rdma_msghdr_parse somehow magically realizes that
>>> the msghr contains rkey and raddr and hence its an RDMA here, next RDMA
>>> op is issued to the HCA
>>>
>>>     
> The message send is a compound operation - it includes a regular RDS 
> send (the immediate data) and the RDMA data.
>
> The rdma operation is pushed down the wire and then the immediate data 
> is pipelined right behind the RDMA.
>
> Keep in mind when the client requested the rdma - it sent along an 
> identifier for the operation. When the rdma server responds - it 
> encodes the client provided identifier in the immediate  data response 
> message.
>
> So when the client recv's the immediate data message (response to 
> request) - it gets back the id of the request which completed and it 
> knows the rdma is complete due to ordering of the RC that both ops 
> traveled over.
>
>>> D) the RDMA op completes
>>>
>>> ...
>>>
>>>     F) the "server" app side somehow know that the RDMA is done
>>>
>>>     
> The rds_barrier is used to detect which rdma operations have 
> completed.  When the rdma + immediate data was initiated an rdma 
> operation ID was returned at sendmsg completion to the initiator. Note 
> that sendmsg can return before the operations are initiated over the 
> wire - they may be que'd for delivery.
>
> The rdma operation ID is a monotonically increasing value. The RDS 
> driver maintains a high water mark of completed rdma operations - and 
> the barrier operation returns this value to the client.
>
> So a single barrier call (which can block if blocking socket) can 
> indicate many operations have completed.
>
> The model for the rdma server is to "psuedo" async blast rdma's back 
> to the clients - and periodically check for RMDA completions - when it 
> needs to reclaim resources.
>
> One more important point about RDMA operations - even though they are 
> running over RDS - the rdma operations are themselves not reliable. 
> That is, in light of certain failures, the RDMA + immediate data may 
> be lost. It is up to the requester of the RDMA operation to detect the 
> loss and re-request the operation.
>
> Take a look at the modified rds-stress.c for rdma - you can see the 
> usage model there.
>>> G) the "client" app side gets a message from the "server" side, telling
>>> that the RDMA is done and it can issue an RDS_FREE_MR setsockopt() call
>>>
>>>
>>> Am I correct in A-D and F-G above?
>>>
>>> What I mainly miss is how the RDMA is being acked to the client, Rick
>>> was mentioning immediate data usage but I don't see an evidence to that
>>> in the code.
>>>     
>>
>> Rick, I'll let you explain to Or how skgxp is using RDS RDMA.
>>
>> - z
>>   
>
-------------- next part --------------
/* RDS V3 interface extensions for zero copy directed (rdma) operations.
 *
 * NOTE: This is a draft proposal - it is not a design specification.
 *
 * Reliable Datagram Sockets (RDS) provide in order, non-duplicating,
 * highly available, reliable delivery of datagrams. Zero copy "directed"
 * operations (rds_rdma) however, are not reliable and have less stringent 
 * ordering requirements.
 * 
 * Directed operations involve performing an rdma to, or rdma from, a
 * specified remote buffer and possibly chasing the rdma operation with a
 * completion message (immediate) data. 
 *
 * Immediate data must not arrive before the associated rdma and must use 
 * the same physical path as the RDMA operation. It is invalid to send an 
 * rdma over one path and immediate data over another path (consider
 * path / HCA failover). If an rdma operation fails, then its respective 
 * immediate data must not arrive. 
 *
 * RDMA operation identifier: The RDS driver returns a unique (system wide) 
 * rdma operation identifier at operation submit. 
 * This operation identifier can be used in subsequent barrier operations to 
 * detect completion of all operations upto and including the specified 
 * operation id.
 *
 * A possible implementation of "operation ID" in an RDS driver would be a 
 * sequence number. As each operation is submitted it is assigned the next 
 * sequence number in order. As an RDMA operation is locally completed 
 * (success or failure), its operation ID is recorded as a high water mark 
 * of completed operations. Subseqeunt barrier(op_id) operations, can compare 
 * the submitted barrier operation id with the high water mark in the driver.
 *
 * Barriers: A barrier operation is used to detect that all previously 
 * issued operations (upto the specified operation id) are complete. 
 * Furthermore, a barrier can be scoped to indicate local completion or 
 * remote completion.
 *
 * A "locally complete" barrier indicates that the local end of the transport 
 * is finished with buffers provided for the rdma operations. Where as a 
 * "remotely complete" barrier indicates that all preceeding operations have 
 * been completed all the way to remote host memory.
 *  
 * NOTE: We currently do not have a requirement for "remote barriers".
 *
 * An rds client can depend on a barrier local completion indicating:
 *
 * 1) ownership of all local buffers used by rdma operations issued prior to 
 *    the completing barrier are now returned to the client.
 * 2) The local transport has completed all local processing of all preceeding 
 *    operations. With IB this implies that all operations have either failed 
 *    or have been successfully acked by the remote HCA.
 *
 * An rds client can depend on a barrier remote completion indicating:
 *
 * 1) Local transport processing in complete.
 * 2) Remote transport processing has completed - implying that all preceeding
 *    operations have been placed in remote host memory.
 *
 * Ordering between RDMA operations is not required. However,
 * ordering between barriers, RDMA operations, and immediate data is.
 *
 * Clients of RDS can depend on the ordering of barriers, wrt other 
 * rds operations, to "bracket" a set of rds operations such that the barrier  
 * completion indicates all prior rds operations have completed.
 *
 * For example, if the issue order is rdma r1,r2,r3, barrier b1, rdma r4;
 * r1,r2,r3 can complete in any order wrt r1,r2,r3 - but r1,r2,r3
 * must all complete before b1. r4 must complete after b1.
 * 
 * To perform a directed operation (rds_rdma), clients obtain keys for local 
 * buffers and send the keys to rdma servers.  A key is a descriptor for a 
 * local memory buffer which is obtained from a local HCA and that can be 
 * used for RDMA operations. An rdma server will use a supplied key 
 * to rdma data to or from a remote host, and optionally send back a 
 * completion message (immediate data). 
 * 
 * Keys are local to a specific HCA. If path failover occurs, keys obtained 
 * from the now non-active HCA are invalid. It is up to clients of RDS to 
 * detect rdma failures and retry operations using new keys obtained from 
 * an active HCA. Generally, a simple operation timeout on the client can be 
 * used to re-request an rdma operation.
 *
 * Gathered writes and scattered reads: The local source buffer for an rdma 
 * write and the local destination buffer for an rdma read may be an array of 
 * pointers (iobuf[n]) supporting gathered write and a scattered read 
 * operations. 
 *
 * A gathered write pulls data from an array of local buffers and rdma writes 
 * a single linear remote buffer. A scattered read pulls data from a single 
 * linear remote buffer writing into an array of local buffers.
 *    
 * Pseudo async operations are achieved by not requiring an rds_rdma operation 
 * to wait for an explicit transport completion message. An RDS 
 * client can determine when local buffers are no longer in use via a 
 * barrier operation. 
 *
 * For async operations, rds_rdma completion status only reflects successful 
 * submit of rdma operation to the local RDS driver (arg checking passed).
 *
 * If an RDS client must know that an rdma operation was successfully
 * delivered to the remote HCA, it can issue the operation with sync = TRUE. 
 * When a sync operation returns the completion status indicates status of 
 * rdma operation.
 * 
 * Note: we currently have no requirement for sync rdma operations.
 *
 * An alternative to using sync operations would be to issue a barrier with 
 * the operation id of the submitted operation. If the operation is complete 
 * the barrier will return with success, otherwise with eagain, until the 
 * specified RDMA (and all preceeding) completes.
 *
 * In general, a "barrier" operation 
 * can be used to chase multiple async ops. When the barrier completes 
 * it indicates that all prior ops are also complete. Successful
 * completion of a barrier does not imply anything about the status of prior 
 * async operations (which may have failed or succeeded) - just that they 
 * have all completed.
 *
 * Many async operations may be queued for transmission. An RDS implementation
 * is encouraged to maintain a full pipeline on the transport taking into 
 * account bandwidth delay product, and to internally que operations which 
 * exceed internal resource limitations - vs - returning resource exhaustion 
 * status.
 *
 * Load Balancing: implemented in RDS client application via multiple paths 
 * in conjunction with the bonding driver for HA. When multiple paths are 
 * configured - the RDS client will establish communication over each path
 * issuing / balancing requests over the paths. 
 *
 * Failover: Directed (rdma) operations and associated immediate data do not 
 * failover (are not reliable) - however, non directed RDS operations are 
 * reliable. It is upto the client
 * of an rds_rdma operation to detect a lost operation and re-issue the op by
 * freeing the old key, acquiring a new key, and sending it to the rdma 
 * server.
 *
 * NOTE: The Oracle rdma server client will be modified to timeout an I/O 
 * and retry.
 *
 * Fast Fail detection / Supporting Detection of path failover: 
 * Clients will probably use a long timeout to detect lost requests and 
 * re-request the data. In the case of a true path failover we need a 
 * mechanism for clients to detect that a failover has occured so they can 
 * quickly re-issue requests. 
 *
 * Perhaps the socket error q can be used for this - an error would wake a 
 * poll waiter with pollin set.
 *
 * Wait model: The general wait model for RDS zcopy operations, is to issue 
 * many async rdma operations,
 * and then to check for completion with barrier(last_op). If the last op is 
 * complete, then all prior ops are complete too. If the last op is not 
 * complete, then the barrier returns the current completed operation id.
 *
 * If the socket is non-blocking, then a barrier operation will wait until
 * the requested operation is completed.
 * 
 * Unified wait model for rds rdma and rds non-rdma operations:
 *
 * RDS rdma initiators waiting in poll() with POLL_IN should be awakened 
 * when RDMA reads complete and when waiting in poll() with POLL_OUT set 
 * should be awakened when rdma write ops complete.
 * 
 * NOTE: at this point we are not planning to use POLL_OUT / wait for rdma 
 * write completions. 
 * 
 * For example, an rdma server processing a request to read data from a remote
 * system - will post the rdma read operation and then continue processing 
 * additional incoming requests via RDS. The rdma server may wait in 
 * poll for additional requests and as well for initiated RDMA operations to
 * complete. The select list to poll will include the list of interesting
 * sockets. When the posted rdma operation completes, poll returns and the
 * client will use a barrier operation to see which rdma ops have completed
 * issuing according recv / send / rdma operations... 
 * 
 * MTU: RDS should report an MTU of least 1mbyte.
 *
 * Local memory pinning: RDS must pin local memory buffers, which are used
 * as source data (local buffers) in zero copy rdma operations, for the 
 * duration of the operation. 
 *
 * Pinning of VA for the RDMA buffer for which the client obtains an FMR key - 
 * is done within rds_get_mr(va). This memory must stay pinned until the client
 * calls rds_free_mr(va) - or the process dies - in which case 
 * the driver must clean up the pins. 
 *
 * Pinning of VA for local buffers to use in issuing the RDMA (read / write) 
 * to the remote buffer are pinned inside of rds_rdma() - and only stay pinned
 * for the duration of the RDMA read / write post to the xport (until acked 
 * by transport). This pinning/unpinning of local buffers 
 * all happens inside the driver.
 * 
 * Process death: Keys obtained by a process must be invalidated before any 
 * memory referenced by the keys is released from the dieing process and
 * before anyother process can detect the death of the dead process. Local 
 * memory pins must also be released.
 * 
 * Protection Domains: RDS should use a single PD for all node to node
 * connections. It is a requirement for keys to be valid over different
 * node to node RCs. It is legal for an RDS client to acquire a key, send
 * the key to an RDMA server - an RDMA server can pass a key to
 * another RDMA server for use.
 *
 * Protection Domains which RDS uses should be isolated from other IB 
 * / RDMA transport clients. It is illegal for a non RDS client to access
 * memory pinned for RDMA by RDS clients. 
 * 
 * Driver Versioning: 
 *
 * Wire Protocol Versioning: 
 * 
 * Rolling Upgrade: 
 *
 */

/* io buffer */
struct iobuf
{
  ub1 *va;  /* address of buffer */    
  ub4 len;  /* len of buffer */
};

/* message hdr for RDS rdma operation */
/* lcount and localva are an array of local buffer pointers and lengths 
 * (iobuf). */
/* An implementation must support at an lcount of at least 4 entries */
struct rdsmhdr
{
  ub1   lcount     /* count of entries in localva iobuf vector */
  iobuf *localva;  /* vector of local iobufs */
                   /* supports gathered write, and scattered read */
  iobuf remoteva;  /* describing remote va - single linear extent */
  key_t key;       /* memory region key for remote va */
  sockaddr sk;     /* source of or destination of rdma + immediate data*/
  ub8   operation_id; /* system wide unique identifier of rdma operation */
};

/*
 *     Return memory region key for input iobuf which is a single linear
 *     va extent.
 * 
 *     RDS driver must pin VA for I/O and obtain memory key which 
 *     can be used by remote system to perform RDMA operations to local
 *     memory.
 *
 *     NOTE: What about protection domains - should we use a single PD 
 *     for RDS - or perhaps on a cluster instance basis ?
 *
 *     Once a key is obtained it may be used for multiple directed
 *     operations. 
 *
 *     This should be a very light weight operation as its typical
 *     usage would be to acquire and release a key for every RDMA operation.
 *     For this reason it should use IB Fast Memory Regions (fmr).
 * 
 *     Cleanup at process death. The RDS driver must detect process death
 *     and free memory regions plus unpin memory held by the dieing process. 
 *     All resources must be cleaned up before any other process can detect
 *     the death of the dieing process. Further, the memory regions freed
 *     must be invalidated and the invalidations must be flushed thru to
 *     HCA. 
 *
 *     Returns:
 *
 *     esucc -> operation was successful.
 *     ebusy (eagain) -> temporary resource exhaustion
 *     einbuf -> buffer can not be mapped due to structure of buffer
 *     einval -> invalid args - programatic errors
 *     enotsupported -> This transport does not support RDMA operations.
 *     efatal -> ? no such device, etc...
 *
 */ 
int rds_get_mr(iobuf *va, key_t *key, int socket);

/* 
 *     Free the memory region key and possibly invalidate.
 *
 *     Key can be NULL.
 *
 *     socket = identifies local HCA to create key on.
 *
 *     If key is specified, then the key is freed. Freeing a key revokes 
 *     use of the key by all future RDMA operations - but is not required
 *     to be visible immediately unless an invalidate is requested.
 *
 *     If Invalidate == TRUE, issue sync_tpt to HCA forcing Key changes to 
 *     be visible immediately.
 *     
 *     Returns:
 *
 *     esucc -> operation was successful.
 *     enotsupported -> This transport does not support RDMA operations.
 *     efatal -> 
 */
int rds_free_mr(key_t key, boolean invalidate);

/* 
 *    rdma data to specified remote buffer described in rdsmhdr and
 *    send imd data as message.
 *
 *    mhdr - must contain a valid key for remote buffer and the destination
 *    ip:port (sockaddr) for both the rdma and optional immediate data.
 *
 *    if sync == TRUE -> wait for rdma completion event else return after 
 *    submitting request to transport.
 *
 *    imd - immediate data to be delivered as send message
 *    (not rdma) over same path as mhdr data. Immediate has the same size 
 *    limitations as normal RDS send operations.
 *
 *    direction - direction of rdma operation (read/write)
 *
 *    rds_rdma() is used to rdma a local buffer to/from a specified 
 *    remote buffer, and then to send immediate data as separate 
 *    message over same path.
 *
 *    Both rdma and immediate data must flow over same path. If the
 *    rdma operation fails, then the imd data must not arrive at the
 *    destination. If after the rdma is initiated, a path failover
 *    occurs, the imd send must not be performed. An assumption is 
 *    made that if the rdma fails, the imd data will not be delivered.
 *     
 *    The local source buffer for rdma write and the local destination buffer
 *    for an rdma read may be an array of pointers (iobuf[n]) supporting a 
 *    gathered write and a scattered read operations. 
 *
 *    An gathered write pulls data from an array of local buffers and rdma 
 *    writes a single linear remote buffer. A scattered read pulls data from 
 *    a single linear remote buffer writing into an array of local buffers.
 *
 *  Returns:
 *
 *  mhdr->operation_id: RDS driver returns a unique system wide identifier
 *      for an rdma operation. This operation id can be used in a barrier to 
 *      determine if this operation (and all preceeding) has/have completed.
 *
 *  esucc -> operation was successfully submitted. If "sync" then operation
 *           was successfully completed (transport ack). Does not imply data has been
 *           place into remote host memory.
 *  enotsupported -> This transport does not support RDMA operations.
 *  efatal ->
 *
 */
int rds_rdma(int socket, rdsmhdr *mhdr, boolean sync, iobuf *imd, 
	     boolean direction);

/* 
 *  Issue a barrier operation. When a barrier completes all previously
 *  issued operations to the specified destination are known to have 
 *  completed either with succ or failure status. All buffers specified
 *  for all completed operations have been returned to the RDS client.
 *
 *  Currently, the definition of "locally completed" is that the transport has 
 *  sent the message and has recieved a transport level ack indicating the remote
 *  transport end has the data. Further, that the local transport has completed its 
 *  use of the buffers and ownership of the buffers are returned to the client. A local
 *  completion does not indicate that the operation has made it to remote
 *  host memory (which would be a remote completion).
 *
 *  socket = local socket
 *
 *  operation_id -> check if zcopy operations upto operation id are complete. If NULL
 *                  then check for any / all outstanding operations.
 *
 *  If no operation_id is specified - then the barrier checks for any outstanding
 *  zcopy operation.
 *
 *  flags = SYNC -> operation is synchronous, otherwise should honor mode of socket.
 *          REMOTE -> barrier completion indicates operations have completed to remote
 *                    host memory. If not set, then local completion is indicated.
 *  NOTE: we currently are not planning to use REMOTE barrier completions.
 *
 *  dest = detination end point to flush messages to.
 *
 *  Returns:
 *
 *  esucc -> operation was successfully completed (transport ack).
 *  eagain -> operation is not complete - non blocking socket.
 *  efatal -> other fatal errors.
 *
 */
int rds_barrier(int socket, ub8 operation_id, ub1 flags, sockaddr dest);