[rds-devel] waking rds poll() sleepers on congestion notification

Zach Brown zach.brown at oracle.com
Thu May 17 12:07:06 PDT 2007


Hi All.  Rick asked that I look into what it would take to implement the
notion of waking poll() when a remote peer sees congestion lift.  I have
some hopes of moving towards the list and away from unarchived personal
emails, so I'm sending it here.  *crosses fingers*.

rds_poll() always returns POLLOUT.  If you try to send and get
EWOULDBLOCK because the remote receiver is congested then you don't get
to wait for POLLOUT to be raised before sending again -- it's always
raised.  This is worked around by implementing some kind of exponential
back-off while retrying the send.  This can be done by increasing the
timeout given to poll().  At each poll() timeout expiry the send is
tried again.

This idea of waking poll() waiters on congestion notification hopes to
cut down on the latency between when the send could succeed and when the
poll() timeout hits and the send is retried.  If we receive a congestion
bitmap update from the remote node we wake poll() waiters, giving them a
chance to retry their send.

I'll first describe the simplest possible way to implement this.  It'd
give us something to profile and improve upon, or discard in disgust.
This is only feasible if congestion updates are infrequent, as they
should be.

rds_poll() doesn't know if the caller is using a timeout.  It doesn't
know if the poll() caller has a send that has experienced congestion.
All it knows is that a task *might* sleep in poll() with this socket in
a pollfd.  It doesn't even know the mask that applies to this socket.

The simplest mechanism works with this near-total lack of context and
wakes all tasks in poll() that include an rds socket when any congestion
notification update comes in.  The simplest implementation of this would
be four lines:

	add a global wait_queue_head above rds_poll
	call poll_wait(file, &rds_poll_waitqueue, wait); in rds_poll()
	wake the global wait queue, if active, in rds_cong_map_updated()

I'd be tempted to leave it at that to start.  It could be made more
clever if it is shown to be a problem in the future.

- z



More information about the rds-devel mailing list