[rds-devel] What is RDS and why did we build it ?
richard.frank at oracle.com
Thu Nov 15 08:26:11 PST 2007
Reliable Datagram Sockets (RDS) provide in order, non-duplicating,
highly available, low overhead, reliable delivery of datagrams between
hundreds of thousands of non-connected endpoints.
Furthermore, the design center for RDS is:
- implement reliable delivery !
- high scalability - several hundred thousand end points and tens of
thousands of local processes.
- hundreds of nodes.
- does not require large O/S or NIC resources to maintain end points
(for example QPs on HCAs).
- very low operation cost in terms of CPU cycles per send / recv operation.
- is a socket interface - existing UDP apps usual run on RDS with no
modifications or require minimal change.
- must not introduce latencies when sending to new end points.
- sends complete immediately from sender perspective - does not imply
send made it to destination - but will
be delivered if destination exists.
- extremely low latency - as close to the wire as possible - no
artificial latencies introduced via flow control
protocols, acking, windowing, etc - e.g. what we do to make UDP reliable.
- must deal with slow receiver process not emptying limited size recv
q's - by back pressuring senders.
- must provide transparent HA for inter and intra HCA (NIC) port failover.
- must support local loopback destinations.
- be extensible for zero copy send / recv ops and zero copy rdma operations
- leverage native RNIC (RDMA NIC) capabilities such as IB and IWARP.
- minimal code base to enhance supportability.
The usage model:
Connection less reliable delivery of data grams:
Using a single local endpoint (socket), an application must be able send
and recv reliably delivered datagrams to 100s of thousands of end points
with no additional setup (connecting to) for destination end points.
Messages must be delivered unless RDS discovers that the destination
socket does not exist - in which case the message can be tossed. Tossing
messages is only done on the destination host - if / when it discovers
that the destination socket (ip:port) does not exist.
Local send memory is unlimited (in theory) from the sender's
perspective. The sender app does not expect to tune / setup send space.
What this says is that the application expects to be able to que an
unlimited number of send operations. However, realistically O/S level
send space limitations (so_sndbuf) on sockets may result in enobuf back
pressure on senders. In the exhaustion case, the sender will wait for
send buffer space via pollout..
Receive side memory is tuned by the application - via so_rcvbuf. In
essence the application tunes receive memory to implement a elastic
store of incoming messages while it is doing other work. If / when the
recv memory limit for an endpoint is reached (the end point is
congested) , then RDS must back pressure senders to that destination -
without blocking sends to non-congested destinations.
The RDS implementation must provide no-loss transparent fail over across
ports within and across multiple NICS for reliable data grams.
From 0 bytes (yes zero byte messages) to 1 meg in size assuming socket
so_sndbuf and so_rcvbuf are set up accordingly.
So why did we build RDS ?
Oracle provides a reliable datagram IPC to its internal clients.
Currently, for portability and scalability - we use UDP and implement
the reliability in user mode - that is - we implement acking /
windowing / fragmenting / re-ordering, etc to provide a reliable IPC to
our internal clients.
Our IPC over UDP works but we run into stability issues under heavy load
as the reliability is based on user mode processes implementing
retransmit timers, etc - which are subject to skew under heavy CPU
loading - resulting in lost datagrams (could not service recv q so IP
dropped the datagram) - causing retransmits - which get dropped, etc.
While our UDP IPC hangs together - deterministic low latency sends - do
not happen 1) the user mode process on the remote end must ack the send,
2) these acks can get lost requiring retransmit, etc.
The net is that during crunch time - you know end of month processing -
which should be a six hour job running on a n-node cluster - consuming
100% CPU on all nodes - could and would stretch to 8 or 10 or more hours
and sometimes our IPC logical connections would break due to timing out
messages that virtually never arrived.
Now from a pure performance perspective - some of our internal IPC
clients are extremely sensitive to send latency. Not requiring a user
mode acks (or driver for that matter) significantly improves real work
performance for Oracle. Several Oracle customers have shown large gains
with RDS and IB.
Basically - we wanted to build an IPC that matches the Oracle IPC model
- 1) allowing us to reduce our IPC library to a small pass thru to a O/S
IPC (improved supportability) 2) knowing that a driver would have more
control / deterministic processing to deal with reliable delivery of
We investigated user mode IPCs (two years) - both uDAPL and iTAPI -
which while interesting turned out to have many other issues including:
1) very high resource requirements on NICS (hundreds of thousands of QPs).
2) were connection oriented - and had very slow connection setup
latencies - rates of a few k per second - vs - 10s of thousands per second.
3) required all memory (gigs) to be pre-registered / had verly slow
registration interfaces precluding dynamic registration. Again
registration required significant NIC resources.
4) clean up of registered memory was problematic given process death
scenarios (could not get it stable).
5) cost of performing zero copy operations - even with static
pre-registration - was approximately equal to
CPU utilized by doing 8k buffer copy.
Since most of Oracle xfers (which could leverage zero copy ) were 8k and
below - along with the issues stated
above - along with large new IPC libraries to support them - we opted out.
The combination of requirements listed above were not achievable with
connection oriented protocols and or non-connection oriented protocols /
interfaces. Especially, wrt the desire to ultimately have zero copy
operations, extremely low overhead and low latency send operations,
coupled with our receive side elastic store - back pressure model.
While IPOIB showed improved through-put - the cost in terms of increased
CPU per operation and that fact we we still had to run our user mode
reliability layer - leaves lots of room for improvement on the table.
Hence our original proposal for a reliable datagram socket was born -
which I have attached to this email.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 51712 bytes
Desc: not available
Url : http://oss.oracle.com/pipermail/rds-devel/attachments/20071115/9b3e8d6c/Proposal_for_a_Reliable_Datagram_Socket_Interface-0001.bin
More information about the rds-devel