[rds-devel] [PATCH 2/2] RDS/IB: add connection level flow control

Wed Nov 14 06:04:20 PST 2007

On Wednesday 14 November 2007 11:23, Or Gerlitz wrote:
> I think this flow can be simplified to have the --receiver-- send credit
> update packet
> when it sees that the credits at the sender are below some low water mark.

It works both ways. The sender requests an ACK packet when its credits
are too low. The receiver will also schedule an ACK when the sender ate
half its window.

> Also: with this approach, can we remove the ACK packets from RDS/IB
> altogether? since you know what
> was the last sequence number, and the sequence number is incremental, IB RC
> in order delivery
> guarantees that  all the  packets (RDS fragments) up to this number are in
> memory.

Yes, that's certainly a possibility.

However, after experimenting with this code some more, I have some doubts.

 a)	with a credit based scheme, we will never be able to use
	a shared receive queue (because you don't know in advance how
	to allocate credits)

 b)	We're currently making heavy use of ACK packets, which complicate
	matters quite a bit - the whole send path is serialized using the
	c_send_sem semaphore, except for ACKs. So additional locking is
	required for that.

	Also, right now we always send an ACK when we think we need to.
	With flow control, you can't do that anymore.

 c)	The patch as it is now does not completely prevent situations
	where both sides are stalled after using up their credits.

> Second point, I have failed so far to fully understand the RDS congestion
> management, however, it is possible
> that with the flow control protocol, there's no need to have congestion
> control wire protocol and it can be based
> on the information present in the credit management?

No. The point of RDS is you have *one* reliable connection (TCP, IB RC)
connecting two hosts (or rather, two IP addresses A and B). This single
connection carries all traffic from any port X at address A to any
port at address B.

We do not want to throttle the entire connection (ie all traffic flowing
from A -> B) just because a single port on B is not serviced quickly
enough by the application listening on it.

So when the application owning port X on address B is slow in picking up
the packets on its receive queue, we actually need to throttle the sender(s)
on all the sending hosts who want to send to that port.

This is what the congestion map is for. It's a bit map that the receiver
maintains, with one bit for each port. We have one local congestion map
reflecting the state of all local ports, and for each connection to a remote
IP, we have a second congestion map reflecting the state of the remote
ports.

When we receive a packet and find that this pushes the number of messages
queued to port X over the SO_RCVBUF limit, we set the corresponding bit
in the local congestion map, and transmit the updated map across all
connections currently active.

Likewise, when the application pulls messages from the receive queue
and we find that memory charged to the send queue drops below
SO_RCVBUF, the congestion map bit corresponding to that port is
cleared, and another update is sent.

The congestion map is checked in sendmsg, prior to queueing new
messages.

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax