[rds-devel] Re: RDS/RDMA protocol
Richard Frank
richard.frank at oracle.com
Mon Nov 5 11:10:42 PST 2007
Attached is the latest rds_v3.h - proposal for zero copy extensions.
I will shortly post rds_v3_api.h - which matches the actual RDS v3
driver design and implementation we settled on so far.
I updated the section dealing with waiting for rdma operations ->
Unified wait model for RDS rdma and non-rdma operations:.
In summary, when an rdma server has nothing to do - it will wait in
poll. It may have rdma operations outstanding (rdma reads / writes) -
which when complete need to wake poll waiters - along with the usual
suspects - incoming messages, the availability of send space, and
removal of congestion... When awakened from poll - the rdma server will
issue calls to recv / send / barrier to handle the poll conditions
satisfied...
For example, consider a posted rdma read to pull data in from a remote
host. Once the read is posted the rdma server may wait in poll - for any
of - rdma read completion (poll_in) + incoming messages (poll_in) + send
space availability (poll_out) + congestion removed (poll_out).
Richard Frank wrote:
> Zach Brown wrote:
>> You should be sending these mails to rds-devel. There's a great chance
>> that you'll want the people actively involved in RDS to respond.
>>
>>
> Yes - we're getting hammered about not using rds-dev - let's try and
> move all our RDS discussion - that does not contain Oracle private
> information - into rds-dev.
>
>> Or Gerlitz wrote:
>>
>>> Hi Zach,
>>>
>>> (I hope you can address this, as from the git changelog it seems you
>>> were doing at least part of the rdma related commits)
>>>
>>
>> I threw together the first version of the core support which sits above
>> the transports.
>>
>> Can you spare few
>>
>>> lines on the rds rdma design? looking at the code and using some info
>>> from an email I got once from Rick I can't nail this.
>>>
>>
>> Rick, can you publish your nice RDS/RDMA design document somewhere?
>>
>>
> I'll clean it up some more and post it - still a way to go before it's
> 'nice'.
>
> Guess it's time for a man page to describe the rdma interface .
>
>>> From the code (OFED 1.3) I see that:
>>>
>>> A) the "client" (the side that does --not-- do the actual RDMA) should
>>> do an RDS_GET_MR setsockopt() call which pins the data, fmr it and
>>> deliver back to the caller the remote address and rkey to be used.
>>>
>>> B) the remote addr / rkey info are probably exchanged through regular
>>> rds message.
>>>
>>> C) the "server" (the side that does the actual RDMA) calls sendmsg,
>>> where the code of rds_rdma_msghdr_parse somehow magically realizes that
>>> the msghr contains rkey and raddr and hence its an RDMA here, next RDMA
>>> op is issued to the HCA
>>>
>>>
> The message send is a compound operation - it includes a regular RDS
> send (the immediate data) and the RDMA data.
>
> The rdma operation is pushed down the wire and then the immediate data
> is pipelined right behind the RDMA.
>
> Keep in mind when the client requested the rdma - it sent along an
> identifier for the operation. When the rdma server responds - it
> encodes the client provided identifier in the immediate data response
> message.
>
> So when the client recv's the immediate data message (response to
> request) - it gets back the id of the request which completed and it
> knows the rdma is complete due to ordering of the RC that both ops
> traveled over.
>
>>> D) the RDMA op completes
>>>
>>> ...
>>>
>>> F) the "server" app side somehow know that the RDMA is done
>>>
>>>
> The rds_barrier is used to detect which rdma operations have
> completed. When the rdma + immediate data was initiated an rdma
> operation ID was returned at sendmsg completion to the initiator. Note
> that sendmsg can return before the operations are initiated over the
> wire - they may be que'd for delivery.
>
> The rdma operation ID is a monotonically increasing value. The RDS
> driver maintains a high water mark of completed rdma operations - and
> the barrier operation returns this value to the client.
>
> So a single barrier call (which can block if blocking socket) can
> indicate many operations have completed.
>
> The model for the rdma server is to "psuedo" async blast rdma's back
> to the clients - and periodically check for RMDA completions - when it
> needs to reclaim resources.
>
> One more important point about RDMA operations - even though they are
> running over RDS - the rdma operations are themselves not reliable.
> That is, in light of certain failures, the RDMA + immediate data may
> be lost. It is up to the requester of the RDMA operation to detect the
> loss and re-request the operation.
>
> Take a look at the modified rds-stress.c for rdma - you can see the
> usage model there.
>>> G) the "client" app side gets a message from the "server" side, telling
>>> that the RDMA is done and it can issue an RDS_FREE_MR setsockopt() call
>>>
>>>
>>> Am I correct in A-D and F-G above?
>>>
>>> What I mainly miss is how the RDMA is being acked to the client, Rick
>>> was mentioning immediate data usage but I don't see an evidence to that
>>> in the code.
>>>
>>
>> Rick, I'll let you explain to Or how skgxp is using RDS RDMA.
>>
>> - z
>>
>
-------------- next part --------------
/* RDS V3 interface extensions for zero copy directed (rdma) operations.
*
* NOTE: This is a draft proposal - it is not a design specification.
*
* Reliable Datagram Sockets (RDS) provide in order, non-duplicating,
* highly available, reliable delivery of datagrams. Zero copy "directed"
* operations (rds_rdma) however, are not reliable and have less stringent
* ordering requirements.
*
* Directed operations involve performing an rdma to, or rdma from, a
* specified remote buffer and possibly chasing the rdma operation with a
* completion message (immediate) data.
*
* Immediate data must not arrive before the associated rdma and must use
* the same physical path as the RDMA operation. It is invalid to send an
* rdma over one path and immediate data over another path (consider
* path / HCA failover). If an rdma operation fails, then its respective
* immediate data must not arrive.
*
* RDMA operation identifier: The RDS driver returns a unique (system wide)
* rdma operation identifier at operation submit.
* This operation identifier can be used in subsequent barrier operations to
* detect completion of all operations upto and including the specified
* operation id.
*
* A possible implementation of "operation ID" in an RDS driver would be a
* sequence number. As each operation is submitted it is assigned the next
* sequence number in order. As an RDMA operation is locally completed
* (success or failure), its operation ID is recorded as a high water mark
* of completed operations. Subseqeunt barrier(op_id) operations, can compare
* the submitted barrier operation id with the high water mark in the driver.
*
* Barriers: A barrier operation is used to detect that all previously
* issued operations (upto the specified operation id) are complete.
* Furthermore, a barrier can be scoped to indicate local completion or
* remote completion.
*
* A "locally complete" barrier indicates that the local end of the transport
* is finished with buffers provided for the rdma operations. Where as a
* "remotely complete" barrier indicates that all preceeding operations have
* been completed all the way to remote host memory.
*
* NOTE: We currently do not have a requirement for "remote barriers".
*
* An rds client can depend on a barrier local completion indicating:
*
* 1) ownership of all local buffers used by rdma operations issued prior to
* the completing barrier are now returned to the client.
* 2) The local transport has completed all local processing of all preceeding
* operations. With IB this implies that all operations have either failed
* or have been successfully acked by the remote HCA.
*
* An rds client can depend on a barrier remote completion indicating:
*
* 1) Local transport processing in complete.
* 2) Remote transport processing has completed - implying that all preceeding
* operations have been placed in remote host memory.
*
* Ordering between RDMA operations is not required. However,
* ordering between barriers, RDMA operations, and immediate data is.
*
* Clients of RDS can depend on the ordering of barriers, wrt other
* rds operations, to "bracket" a set of rds operations such that the barrier
* completion indicates all prior rds operations have completed.
*
* For example, if the issue order is rdma r1,r2,r3, barrier b1, rdma r4;
* r1,r2,r3 can complete in any order wrt r1,r2,r3 - but r1,r2,r3
* must all complete before b1. r4 must complete after b1.
*
* To perform a directed operation (rds_rdma), clients obtain keys for local
* buffers and send the keys to rdma servers. A key is a descriptor for a
* local memory buffer which is obtained from a local HCA and that can be
* used for RDMA operations. An rdma server will use a supplied key
* to rdma data to or from a remote host, and optionally send back a
* completion message (immediate data).
*
* Keys are local to a specific HCA. If path failover occurs, keys obtained
* from the now non-active HCA are invalid. It is up to clients of RDS to
* detect rdma failures and retry operations using new keys obtained from
* an active HCA. Generally, a simple operation timeout on the client can be
* used to re-request an rdma operation.
*
* Gathered writes and scattered reads: The local source buffer for an rdma
* write and the local destination buffer for an rdma read may be an array of
* pointers (iobuf[n]) supporting gathered write and a scattered read
* operations.
*
* A gathered write pulls data from an array of local buffers and rdma writes
* a single linear remote buffer. A scattered read pulls data from a single
* linear remote buffer writing into an array of local buffers.
*
* Pseudo async operations are achieved by not requiring an rds_rdma operation
* to wait for an explicit transport completion message. An RDS
* client can determine when local buffers are no longer in use via a
* barrier operation.
*
* For async operations, rds_rdma completion status only reflects successful
* submit of rdma operation to the local RDS driver (arg checking passed).
*
* If an RDS client must know that an rdma operation was successfully
* delivered to the remote HCA, it can issue the operation with sync = TRUE.
* When a sync operation returns the completion status indicates status of
* rdma operation.
*
* Note: we currently have no requirement for sync rdma operations.
*
* An alternative to using sync operations would be to issue a barrier with
* the operation id of the submitted operation. If the operation is complete
* the barrier will return with success, otherwise with eagain, until the
* specified RDMA (and all preceeding) completes.
*
* In general, a "barrier" operation
* can be used to chase multiple async ops. When the barrier completes
* it indicates that all prior ops are also complete. Successful
* completion of a barrier does not imply anything about the status of prior
* async operations (which may have failed or succeeded) - just that they
* have all completed.
*
* Many async operations may be queued for transmission. An RDS implementation
* is encouraged to maintain a full pipeline on the transport taking into
* account bandwidth delay product, and to internally que operations which
* exceed internal resource limitations - vs - returning resource exhaustion
* status.
*
* Load Balancing: implemented in RDS client application via multiple paths
* in conjunction with the bonding driver for HA. When multiple paths are
* configured - the RDS client will establish communication over each path
* issuing / balancing requests over the paths.
*
* Failover: Directed (rdma) operations and associated immediate data do not
* failover (are not reliable) - however, non directed RDS operations are
* reliable. It is upto the client
* of an rds_rdma operation to detect a lost operation and re-issue the op by
* freeing the old key, acquiring a new key, and sending it to the rdma
* server.
*
* NOTE: The Oracle rdma server client will be modified to timeout an I/O
* and retry.
*
* Fast Fail detection / Supporting Detection of path failover:
* Clients will probably use a long timeout to detect lost requests and
* re-request the data. In the case of a true path failover we need a
* mechanism for clients to detect that a failover has occured so they can
* quickly re-issue requests.
*
* Perhaps the socket error q can be used for this - an error would wake a
* poll waiter with pollin set.
*
* Wait model: The general wait model for RDS zcopy operations, is to issue
* many async rdma operations,
* and then to check for completion with barrier(last_op). If the last op is
* complete, then all prior ops are complete too. If the last op is not
* complete, then the barrier returns the current completed operation id.
*
* If the socket is non-blocking, then a barrier operation will wait until
* the requested operation is completed.
*
* Unified wait model for rds rdma and rds non-rdma operations:
*
* RDS rdma initiators waiting in poll() with POLL_IN should be awakened
* when RDMA reads complete and when waiting in poll() with POLL_OUT set
* should be awakened when rdma write ops complete.
*
* NOTE: at this point we are not planning to use POLL_OUT / wait for rdma
* write completions.
*
* For example, an rdma server processing a request to read data from a remote
* system - will post the rdma read operation and then continue processing
* additional incoming requests via RDS. The rdma server may wait in
* poll for additional requests and as well for initiated RDMA operations to
* complete. The select list to poll will include the list of interesting
* sockets. When the posted rdma operation completes, poll returns and the
* client will use a barrier operation to see which rdma ops have completed
* issuing according recv / send / rdma operations...
*
* MTU: RDS should report an MTU of least 1mbyte.
*
* Local memory pinning: RDS must pin local memory buffers, which are used
* as source data (local buffers) in zero copy rdma operations, for the
* duration of the operation.
*
* Pinning of VA for the RDMA buffer for which the client obtains an FMR key -
* is done within rds_get_mr(va). This memory must stay pinned until the client
* calls rds_free_mr(va) - or the process dies - in which case
* the driver must clean up the pins.
*
* Pinning of VA for local buffers to use in issuing the RDMA (read / write)
* to the remote buffer are pinned inside of rds_rdma() - and only stay pinned
* for the duration of the RDMA read / write post to the xport (until acked
* by transport). This pinning/unpinning of local buffers
* all happens inside the driver.
*
* Process death: Keys obtained by a process must be invalidated before any
* memory referenced by the keys is released from the dieing process and
* before anyother process can detect the death of the dead process. Local
* memory pins must also be released.
*
* Protection Domains: RDS should use a single PD for all node to node
* connections. It is a requirement for keys to be valid over different
* node to node RCs. It is legal for an RDS client to acquire a key, send
* the key to an RDMA server - an RDMA server can pass a key to
* another RDMA server for use.
*
* Protection Domains which RDS uses should be isolated from other IB
* / RDMA transport clients. It is illegal for a non RDS client to access
* memory pinned for RDMA by RDS clients.
*
* Driver Versioning:
*
* Wire Protocol Versioning:
*
* Rolling Upgrade:
*
*/
/* io buffer */
struct iobuf
{
ub1 *va; /* address of buffer */
ub4 len; /* len of buffer */
};
/* message hdr for RDS rdma operation */
/* lcount and localva are an array of local buffer pointers and lengths
* (iobuf). */
/* An implementation must support at an lcount of at least 4 entries */
struct rdsmhdr
{
ub1 lcount /* count of entries in localva iobuf vector */
iobuf *localva; /* vector of local iobufs */
/* supports gathered write, and scattered read */
iobuf remoteva; /* describing remote va - single linear extent */
key_t key; /* memory region key for remote va */
sockaddr sk; /* source of or destination of rdma + immediate data*/
ub8 operation_id; /* system wide unique identifier of rdma operation */
};
/*
* Return memory region key for input iobuf which is a single linear
* va extent.
*
* RDS driver must pin VA for I/O and obtain memory key which
* can be used by remote system to perform RDMA operations to local
* memory.
*
* NOTE: What about protection domains - should we use a single PD
* for RDS - or perhaps on a cluster instance basis ?
*
* Once a key is obtained it may be used for multiple directed
* operations.
*
* This should be a very light weight operation as its typical
* usage would be to acquire and release a key for every RDMA operation.
* For this reason it should use IB Fast Memory Regions (fmr).
*
* Cleanup at process death. The RDS driver must detect process death
* and free memory regions plus unpin memory held by the dieing process.
* All resources must be cleaned up before any other process can detect
* the death of the dieing process. Further, the memory regions freed
* must be invalidated and the invalidations must be flushed thru to
* HCA.
*
* Returns:
*
* esucc -> operation was successful.
* ebusy (eagain) -> temporary resource exhaustion
* einbuf -> buffer can not be mapped due to structure of buffer
* einval -> invalid args - programatic errors
* enotsupported -> This transport does not support RDMA operations.
* efatal -> ? no such device, etc...
*
*/
int rds_get_mr(iobuf *va, key_t *key, int socket);
/*
* Free the memory region key and possibly invalidate.
*
* Key can be NULL.
*
* socket = identifies local HCA to create key on.
*
* If key is specified, then the key is freed. Freeing a key revokes
* use of the key by all future RDMA operations - but is not required
* to be visible immediately unless an invalidate is requested.
*
* If Invalidate == TRUE, issue sync_tpt to HCA forcing Key changes to
* be visible immediately.
*
* Returns:
*
* esucc -> operation was successful.
* enotsupported -> This transport does not support RDMA operations.
* efatal ->
*/
int rds_free_mr(key_t key, boolean invalidate);
/*
* rdma data to specified remote buffer described in rdsmhdr and
* send imd data as message.
*
* mhdr - must contain a valid key for remote buffer and the destination
* ip:port (sockaddr) for both the rdma and optional immediate data.
*
* if sync == TRUE -> wait for rdma completion event else return after
* submitting request to transport.
*
* imd - immediate data to be delivered as send message
* (not rdma) over same path as mhdr data. Immediate has the same size
* limitations as normal RDS send operations.
*
* direction - direction of rdma operation (read/write)
*
* rds_rdma() is used to rdma a local buffer to/from a specified
* remote buffer, and then to send immediate data as separate
* message over same path.
*
* Both rdma and immediate data must flow over same path. If the
* rdma operation fails, then the imd data must not arrive at the
* destination. If after the rdma is initiated, a path failover
* occurs, the imd send must not be performed. An assumption is
* made that if the rdma fails, the imd data will not be delivered.
*
* The local source buffer for rdma write and the local destination buffer
* for an rdma read may be an array of pointers (iobuf[n]) supporting a
* gathered write and a scattered read operations.
*
* An gathered write pulls data from an array of local buffers and rdma
* writes a single linear remote buffer. A scattered read pulls data from
* a single linear remote buffer writing into an array of local buffers.
*
* Returns:
*
* mhdr->operation_id: RDS driver returns a unique system wide identifier
* for an rdma operation. This operation id can be used in a barrier to
* determine if this operation (and all preceeding) has/have completed.
*
* esucc -> operation was successfully submitted. If "sync" then operation
* was successfully completed (transport ack). Does not imply data has been
* place into remote host memory.
* enotsupported -> This transport does not support RDMA operations.
* efatal ->
*
*/
int rds_rdma(int socket, rdsmhdr *mhdr, boolean sync, iobuf *imd,
boolean direction);
/*
* Issue a barrier operation. When a barrier completes all previously
* issued operations to the specified destination are known to have
* completed either with succ or failure status. All buffers specified
* for all completed operations have been returned to the RDS client.
*
* Currently, the definition of "locally completed" is that the transport has
* sent the message and has recieved a transport level ack indicating the remote
* transport end has the data. Further, that the local transport has completed its
* use of the buffers and ownership of the buffers are returned to the client. A local
* completion does not indicate that the operation has made it to remote
* host memory (which would be a remote completion).
*
* socket = local socket
*
* operation_id -> check if zcopy operations upto operation id are complete. If NULL
* then check for any / all outstanding operations.
*
* If no operation_id is specified - then the barrier checks for any outstanding
* zcopy operation.
*
* flags = SYNC -> operation is synchronous, otherwise should honor mode of socket.
* REMOTE -> barrier completion indicates operations have completed to remote
* host memory. If not set, then local completion is indicated.
* NOTE: we currently are not planning to use REMOTE barrier completions.
*
* dest = detination end point to flush messages to.
*
* Returns:
*
* esucc -> operation was successfully completed (transport ack).
* eagain -> operation is not complete - non blocking socket.
* efatal -> other fatal errors.
*
*/
int rds_barrier(int socket, ub8 operation_id, ub1 flags, sockaddr dest);
More information about the rds-devel
mailing list