RDS Wire Specification 3.1

Last updated Feb 24 2017

Reliable Datagram Sockets (RDS) are a high-performance, low-latency reliable connectionless protocol for delivering datagrams. RDS provides reliable, ordered datagram delivery by using a single reliable transport between two nodes. RDS may be built over any transport that provides reliable datagram delivery such as TCP or Infiniband Verbs Reliable Connected connections.

From the application's point of view, the RDS connection is set up using IP addresses that uniquely identify the sending and receiving nodes, and 16-bit port numbers to identify the RDS socket end-points at each node. The RDS port space is entirely independant of TCP, UDP or any other port-based protocol.

When RDS uses TCP as the underlying transport, the application data is encapsulated in an RDS header and tunnelled over TCP to the destination node at some well-known TCP port where it will be decapsulated and passed up on a PF_RDS socket. The TCP port 16385 [RDGS] has been assigned by IANA for the RDS-over-TCP service.The setup and management of the TCP connection is transparent to the application, which can access the RDS socket through a simple BSD socket API. Multiple RDS services may thus share a single TCP connection, and a common congestion-management algorithm for that TCP connection. The application may use a single local socket endpoint to reliably send and receive datagrams to multiple destinations without needing any additional setup explicitly initiated by the application for each destination.

When the application attempts to send a datagram using the POSIX sendmsg() on the PF_RDS socket using TCP as the underlying transport, the implementation at the socket layer will initiate the TCP three-way handshake to the RDS decapsulation service that would be listening at some well-known port [RDGS]. The application's datagram is then encapsulated in an RDS header and sent to the server as a unicast TCP packet whose TCP source port is an anonymous port, and the TCP destination port would be [RDGS]. At the server, the RDS service listening at the [RDGS] port would decapsulate the RDS header and deliver it to any PF_RDS socket that matches the destination port number specified in the RDS header.

RDS Protocol Limits

Maximum Datagram payload: 1MB
Maximum RDMA Read/Write size: 1MB
Atomic operation size: 8 bytes

RDS 3.1 header

size (bits)	name	description
be64	h_sequence	sequence number
be64	h_ack	sequence number of last received message
be32	h_len	length of the message payload
be16	h_sport	port on source node
be16	h_dport	port on destination node
8	h_flags	Described below
8	h_credit	Credits given (used for credit-based flow control)
32	h_padding	padding for 64-bit struct alignment
16	h_csum	1's complement header checksum
128	h_exthdr	optional extension header space

Total size: 40 (0x28) bytes

Flags

0x01: CONG_BITMAP: Message payload is CONG_MAP_BYTES (8192) bytes long and contains the other node's RDS port congestion map. Local node will not send messages for those ports until a later congestion update arrives, marking the port as no longer congested. See Congestion Handling section below.
0x02: ACK_REQUIRED: Receiving node may not lazily ack this message.
0x04: RETRANSMITTED: Message was retransmitted. Receiving node will discard if an old sequence number is seen along with this flag; an old sequence number without this flag could mean the remote node has restarted, so it should be accepted.

Extension Headers

The 128 bits of space in h_exthdr may optionally be used for extension headers. The RDS header is capable of accommodating multiple exthdrs, but the current specification is limited to either one or zero.

EXTHDR code	Name	Format	Description
0	EXTHDR_NONE	N/A	No further exthdrs in RDS header
1	EXTHDR_VERSION	be32: version	UNUSED. Indicate RDS version supported on first message sent
2	EXTHDR_RDMA	be32: rdma_rkey	List RDMA r_key just used by RDMA operation preceding this message, so it may be freed. See RDMA section for more details.
3	EXTHDR_RDMA_DEST	be32: rdma_rkey; be32: rdma_offset	Passes values to remote node that may be used to perform an RDMA operation. See RDMA section for more details.
5	EXTHDR_NPATHS	be16: number of mprds paths	Number of Multipath RDS paths supported by the sender. See MPRDS section for details
6	EXTHDR_GEN_NUM	be32: Generation number of sender	Unique number identifying the sender. See section titled "RDS Generation Number" below

The initial byte should be the extension header type, followed by the extension header payload. Maximum payload is therefore 15 bytes.

Sequence numbers and acknowledgements

Rationale

While the lower-level connection-oriented protocols RDS uses support reliable transport themselves, this does not ensure messages are actually received by the receiving socket, but only that the low-level transport has. RDS implements its own reliability mechanism to protect against this case.

Implementation

When a transport-layer connection is established, A sequence number is used for all messages, except for congestion map updates and ack-only messages, starting at 1 and incrementing by 1 for each full message (not frag).

The receiving node should place the latest successfully received sequence number in the h_ack field of outgoing messages. Upon receiving a header with ACK_REQUIRED bit set in h_flags, the receiving node will send an ack-only message, if it is unable to piggyback the ack in an outgoing message of its own. It is recommended to limit ack-only messages, possibly by ensuring only one ack-only message is in the transmit queue at any time.

ACK-only Messages

ACK-only messages consist of just the RDS header. h_sport, h_dport, and h_len are 0. There are no extension headers. All flags are 0.

Congestion control

Rationale

Once data arrives for a socket, there is no guarantee that that socket's process will process that data in a timely manner. The socket's receive buffer may fill up. If this happens, all additional data arriving for that socket will cause it to exceed per-socket buffer size limits until the socket's process can handle the backlog. RDS implements per-socket congestion notification, to prevent messages being sent to ports with full receive buffers.

Congestion is expected to be rare.

Implementation

A port is congested when the number of bytes in the socket's receive buffer is greater than or equal to the maximum socket receive buffer, as set by SO_RCVBUF, for example.

When a receiving port either becomes congested, or when a port is no longer congested, RDS sends out a congestion map update to all other remote nodes it is connected to.

A congestion map update has the CONG_BITMAP bit set in the RDS header's h_flags field, and contains 8192 bytes of data. This 8KB of data is treated as an array of 1024 64-bit little-endian bitfields. array[0] contains the congestion status of ports 0-63, array[1] contains 64-127, and so on.

Messages received for a port while the port congested must still be accepted.

Connection re-establishment

If the transport connection is broken for any reason, both nodes will attempt to reconnect indefinitely. If reconnection attempts fail, a node may try to reconnect less often, up to only once per second.

Ping/Pong

An RDS socket may ping another node by sending a message to the remote node's port 0. The RDS implementation on the remote node will send a reply to the sending port, from port 0. The response will contain no data payload.

Protocol negotiation

Negotiation of RDS features and level of support is transport-specific; please see the relevant section.

RDMA

Some transports support RDMA operations, and there are extension headers used to enable this. The operations themselves do not use the RDS wire protocol, but an RDS message is typically sent to request an RDMA op, as well as trailing an RDMA op. The trailing message indicates the op is complete and resources may be freed. Since RDS is responsible for key and memory region management, r_key data is not passed as part of the message payload, but as part of the RDS header.

EXTHDR RDMA DEST

When an RDS client asks another node to perform an RDMA operation to the client's memory, the remote node must be given a token that allows it access to that memory. The RDMA DEST exthdr is used to pass this information to the remote node's RDS implementation. The remote node's RDS will give the remote client an opaque reference "cookie" that allows the remote client to initiate an RDMA operation to the local client's memory.

EXTHDR RDMA

The RDMA exthdr is used in the message trailing the RDMA operation, to indicate the r_key just used. The receiving RDS implementation should ensure the associated memory region is accessible coherently by the CPU, and may also free the memory region mapping, if the local client requested use-once behavior.

RDS/IB Transport

RDS traffic to a remote node is conveyed over a single Reliable Connected (RC) connection.

Fragmented messages (frags)

Messages with payloads over 4096 (FRAG_SIZE) bytes will be broken up into individual WRs and sent in-order without intervening non-ack-only messages to the receiving node. Each frag except the last shall contain a payload of FRAG_SIZE bytes. Each frag's RDS header shall be identical. (h_len shall be the length of the entire message, not just the frag.)

Connection establishment

RDS/IB listens for incoming connections on port [RDGS]. RDS uses RDMA private connect parameter data, both when initiating and accepting a connection.

size (bits)	name	description
be32	dp_saddr	originating IPv4 address
be32	dp_daddr	destination IPv4 address
8	dp_protocol_major	RDS major version number. Different major versions are incompatible.
8	dp_protocol_minor	RDS minor version number
be16	dp_protocol_minor_mask	indicates which minor versions are supported. bit 0 = "x.0", bit 1 = "x.1" etc.
be32	dp_reserved1	reserved for future use
be64	dp_ack_seq	if reconnecting, sequence of last received message
be32	dp_credit	initial flow control credits, 0 disables FC

Flow Control

Flow control is an optional feature that may be enabled, depending on protocol negotiation. If either end fails to offer initial flow control credits, it is disabled.

Flow control prevents a sending node from queuing messages for transmission that may be dropped by the receiver due to lack of available receive buffers. For protocols such as InfiniBand that implement hardware-level flow control and retry, RDS flow control is not needed. However, other transports such as iWARP do not implement HW flow control, and therefore RDS flow control may be used.

RDS implements a standard credit-based flow control mechanism. Initial credits to send are granted during connection negotiation. The sending node may initiate that many Work Requests (WRs) to the receiving node before it must stop sending. As the receiver handles Work Completion Events and re-posts Receive WRs, it will give the sender more credits piggybacked on outgoing messages. Each side must ensure that it sends more credits to the other end before running out of credits itself.

RDS/TCP Transport Security Considerations

Since RDS-over-TCP uses TCP/IP as the transport, it is vulnerable to all forms of Internet attack such as those described in [RFC 4953]. When RDS is run over Internet paths that have not been secured by other means, the underlying TCP channel used by the RDS connection SHOULD be protected using some form of Authentication such as TCP-AO [RFC5925]. As with STT, standard IP security mechanisms such as IPSEC encryption can be implemnted on STT packets, though the interaction with middle-boxes must be taken into account. When the RDS connection traverses long-haul Internet paths, and the underlying transport is disconnected, the client MUST NOT try to re-establish connectivity indefinitely, but should apply exponential back-off to reconnect attempts. The parameters controlling the exponential back-off SHOULD be a tunable for the system

Multipath RDS (MPRDS)

Mprds is multipathed-RDS, primarily intended for RDS-over-TCP (though the concept can be extended to other transports). The classical implementation of RDS-over-TCP is implemented by demultiplexing multiple PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, port]) over a single TCP socket between the 2 IP addresses involved. This has the limitation that it ends up funneling multiple RDS flows over a single TCP flow, thus it is

upper-bounded to the single-flow bandwidth,
suffers from head-of-line blocking for all the RDS sockets.

Better throughput (for a fixed small packet size, MTU) can be achieved by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed RDS (mprds). Each such TCP/IP flow constitutes a path for the RDS/TCP connection. RDS sockets will be attached to a path based on some hash (e.g., of local address and RDS port number) and packets for that RDS socket will be sent over the attached path using TCP to segment/reassemble RDS datagrams on that path.

Transports may announce themselves as multipath capable registration with the RDS core module. When the transport is multipath-capable, the packet egress path in the RDS core module will hash outgoing traffic across multiple paths. The outgoing hash is computed based on the local address and port that the PF_RDS socket is bound to.

Additionally, even if the transport is MP capable, we may be peering with some node that does not support mprds, or supports a different number of paths. As a result, the peering nodes need to agree on the number of paths to be used for the connection. This is done by sending out a control packet exchange before the first data packet. The control packet exchange must have completed prior to outgoing hash completion in the egress path when the transport is multipath capable.

The control packet is an RDS ping packet (i.e., packet to RDS destination port 0) with the ping packet having a RDS extension header option of type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the number of paths supported by the sender. The "probe" ping packet will get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>) The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately be able to compute the min(sender_paths, receiver_npaths). The pong sent in response to a probe-ping should contain the receiver's npaths when the receiver is mprds-capable.

If the receiver is not mprds-capable, the exthdr in the ping will be ignored. In this case the pong will not have any exthdrs, so the sender of the probe-ping can default to single-path mprds.

RDS Generation Number

The RDS transport has to be able to distinguish between two types of failure events:

when the transport fails (e.g., TCP connection reset) but the RDS socket/connection layer on both sides stays the same
when the peer's RDS layer itself resets (e.g., due to module reload or machine reboot at the peer)

In case (a) both sides must reconnect and continue the RDS messaging without any message loss or disruption to the message sequence numbers, and this is achieved by rds_send_path_reset().

In case (b) we should reset all rds_connection state to the new incarnation of the peer. Examples of state that needs to be reset are next expected rx sequence number from, or messages to be retransmitted to, the new incarnation of the peer.

To achieve this, the RDS handshake probe added as part of MPRDS is enhanced so that sender and receiver of the RDS ping-probe will add a generation number as part of the RDS_EXTHDR_GEN_NUM extension header. Each peer stores local and remote generation numbers as part of each rds_connection. Changes in generation number will be detected via incoming handshake probe ping request or response and will allow the receiver to reset rds_connection state.