[rds-devel] What is RDS and why did we build it ?

Thu Nov 15 08:26:11 PST 2007

Reliable Datagram Sockets (RDS) provide in order, non-duplicating, 
highly available, low overhead, reliable delivery of datagrams between 
hundreds of thousands of non-connected endpoints.

Furthermore, the design center for RDS is:

-  implement reliable delivery !
-  high scalability - several hundred thousand end points and tens of 
thousands of local processes.
-  hundreds of nodes.
-  does not require large O/S or NIC resources to maintain end points 
(for example QPs on HCAs).
-  very low operation cost in terms of CPU cycles per send / recv operation.
-  is a socket interface - existing UDP apps usual run on RDS with no 
modifications or require minimal change.
-  must not introduce latencies when sending to new end points.
-  sends complete immediately from sender perspective - does not imply 
send made it to destination - but will
   be  delivered if destination exists.
-  extremely low latency - as close to the wire as possible - no 
artificial latencies introduced via flow control
   protocols, acking, windowing, etc - e.g. what we do to make UDP reliable.
-  must deal with slow receiver process not emptying limited size recv 
q's - by back pressuring senders.
-  must provide transparent HA for inter and intra HCA (NIC) port failover.
-  must support local loopback destinations.
-  be extensible for zero copy send / recv ops and zero copy rdma operations
-  leverage native RNIC (RDMA NIC) capabilities such as IB and IWARP.
-  minimal code base to enhance supportability.

The usage model:

Connection less reliable delivery of data grams:

Using a single local endpoint (socket), an application must be able send 
and recv reliably delivered datagrams to 100s of thousands of end points 
with no additional setup (connecting to) for destination end points.

Messages must be delivered unless RDS discovers that the destination 
socket does not exist - in which case the message can be tossed. Tossing 
messages is only done on the destination host - if / when it discovers 
that the destination socket (ip:port) does not exist. 

Memory model:

Local send memory is unlimited (in theory) from the sender's 
perspective. The sender app does not expect to tune / setup send space. 
What this says is that the application expects to be able to que an 
unlimited number of send operations. However, realistically O/S level 
send space limitations (so_sndbuf) on sockets may result in enobuf back 
pressure on senders. In the exhaustion case, the sender will wait for 
send buffer space via pollout..

Receive side memory is tuned by the application - via so_rcvbuf. In 
essence the application tunes receive memory to implement a elastic 
store of incoming messages while it is doing other work. If / when the 
recv memory limit for an endpoint is reached (the end point is 
congested) , then RDS must back pressure senders to that destination - 
without blocking sends to non-congested destinations.

High Availability:

The RDS implementation must provide no-loss transparent fail over across 
ports within and across multiple NICS for reliable data grams.

Message Sizes:

 From 0 bytes (yes zero byte messages) to 1 meg in size assuming socket 
so_sndbuf and so_rcvbuf are set up accordingly.

So why did we build RDS ?

Oracle provides a reliable datagram IPC to its internal clients. 
Currently, for portability and scalability - we use UDP and implement 
the reliability in user mode -  that is - we implement acking / 
windowing / fragmenting / re-ordering, etc to provide a reliable IPC to 
our internal clients.

Our IPC over UDP works but we run into stability issues under heavy load 
as the reliability is based on user mode processes implementing 
retransmit timers, etc - which are subject to skew under heavy CPU 
loading - resulting in lost datagrams (could not service recv q so IP 
dropped the datagram) - causing retransmits - which get dropped, etc.

While our UDP IPC hangs together - deterministic low latency sends - do 
not happen 1) the user mode process on the remote end must ack the send, 
2) these acks can get lost requiring retransmit, etc.

The net is that during crunch time - you know end of month processing - 
which should be a six hour job running on a n-node cluster - consuming 
100% CPU on all nodes - could and would stretch to 8 or 10 or more hours 
and sometimes our IPC logical connections would break due to timing out 
messages that virtually never arrived.

Now from a pure performance perspective - some of our internal IPC 
clients are extremely sensitive to send latency. Not requiring a user 
mode acks (or driver for that matter) significantly improves real work 
performance for Oracle. Several Oracle customers have shown large gains 
with RDS and IB.

Basically - we wanted to build an IPC that matches the Oracle IPC model 
- 1) allowing us to reduce our IPC library to a small pass thru to a O/S 
IPC (improved supportability)  2) knowing that a driver would have more 
control / deterministic processing to deal with reliable delivery of 
messages.

We investigated user mode IPCs (two years) - both uDAPL and iTAPI - 
which while interesting turned out to have many other issues including:

1) very high resource requirements on NICS (hundreds of thousands of QPs).
2) were connection oriented - and had very slow connection setup 
latencies - rates of a few k per second - vs - 10s of thousands per second.
3) required all memory (gigs) to be pre-registered / had verly slow 
registration interfaces precluding dynamic registration. Again 
registration required significant NIC resources.
4) clean up of registered memory was problematic given process death 
scenarios (could not get it stable).
5) cost of performing zero copy operations - even with static 
pre-registration - was approximately equal to
    CPU utilized by doing 8k buffer copy.

Since most of Oracle xfers (which could leverage zero copy ) were 8k and 
below - along with the issues stated
above - along with large new IPC libraries to support them - we opted out.

The combination of requirements listed above were not achievable with 
connection oriented protocols and or non-connection oriented protocols / 
interfaces. Especially, wrt the desire to ultimately have zero copy 
operations, extremely low overhead and low latency send operations, 
coupled with our receive side elastic store - back pressure model.

While IPOIB showed improved through-put - the cost in terms of increased 
CPU per operation and that fact we we still had to run our user mode 
reliability layer - leaves lots of room for improvement on the table.

Hence our original proposal for a reliable datagram socket was born - 
which I have attached to this email.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Proposal_for_a_Reliable_Datagram_Socket_Interface.doc
Type: application/vnd.ms-word
Size: 51712 bytes
Desc: not available
Url : http://oss.oracle.com/pipermail/rds-devel/attachments/20071115/9b3e8d6c/Proposal_for_a_Reliable_Datagram_Socket_Interface-0001.bin