[rds-devel] RDS IB transport software flow control?

Wed Nov 7 01:05:37 PST 2007

On Tuesday 06 November 2007 19:26, Richard Frank wrote:
> Memory consumption is bounded by so_sndbuf + so_rcvbuf * number of 
> active endpoints.

Mostly, yes. Except for two problems.

One, if you just use so_rcvbuf to bound you memory consumption,
every process will have to be given max_amount_of_memory / num_processes,
which is utterly fair but doesn't allow for peaks in network traffic.
This is why TCP has these tcp_rmem tunables which provide a global
upper bound across all sockets. One of my recent patches introduced
a global RX limit very similar to that.

Second, there's a delay in there:

 -	several peers sending to the same socket
 -	recvbuf fills up, we flag congestion
 -	peers keep sending
 -	congestion map is transmitted
 -	peers receive congestion map
 -	peers do not allow apps to queue more packets
	for the congested destionation, but drain their
	send queues

So you have a delay of several milliseconds during which clients
can keep sending - and the receiver keeps allocation recvbufs for
those.

On the other hand, it's hard to think of something better.
Or Gerlitz referred to the credit based system of SDP, but SDP
is a stream protocol where you have one pipe with a single sender/
consumer at each end.

With RDS, the prevalent mode of operation is the disconnected
socket, where you have no idea who is going to send how much to
whom, so it's rather difficult to assign credit in advance,
without doing something silly like so_rcvbuf / number_of_ib_connections

It's probably worth thinking about a way that allows the
receiver to drop packets that cannot be delivered because
the queue is full. This would require some sort of NAK packet
that tells the peer that some particular packet was not accepted
and needs to be postponed/retransmitted:

 -	When the receiver drops a packet because the queue is
	full, it immediately sends a NACK that contains the
	sequence number of the packet that was dropped.

	To be examined - is there any situation where a NACK can
	get lost, eg during HCA failover?

 -	When the sender receives a NACK, it puts the rejected
	message (and any other messages for that port) onto 
	a special queue (conn->c_cong_queue), and marks the
	destination port in the congestion map.

 -	The sender keeps transmitting messages off its queue
	as usual.

	When the application tries to send to a destination
	that is marked as congested, we could queue the message
	on the c_cong_queue rather than returning an error.
	This would take care of all the hassles of having to
	deal with congestion in user space at all.

 -	When the application on the destionation host is ready
	to take a few messages off its receive queue, the port
	is marked uncongested in the map, and a map update is
	queued to all connections.

 -	When the sender receives the map update, it checks its
	queue of rejected messages (c_cong_queue) and retransmits
	those for which the port is no longer marked as congested.

In its entirety, this certainly isn't something we should introduce
for ofed 1.2.5.x, but let's think about it for 1.3.

However, it may be worth to at least implement the c_cong_queue,
and use it to queue messages an application may send to a congested
destination. Right now, we have this very bad interaction between
congestion and poll - any congestion map update will essentially
wake up all processes waiting in poll. Thundering herd all over...

If you do not expose the fact that a given destination is congested
to user space, this problem just vanishes. The application will happily
queue up messages, and if the send queue is full, it stops - period.
When a destination becomes uncongested, the kernel will move these
messages to the send queue, and as ACKs start coming back, the
application is woken up to send more.

And it makes for a much saner poll implementation too :)

Any comments?

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax