[rds-devel] Re: RDS IB transport software flow control?

Tue Nov 6 08:25:41 PST 2007

Or Gerlitz wrote:
> Richard Frank wrote:
>> Yes - it relies on IB RNRs - reasoning was simply to keep the RDS 
>> wire protocol to a minimum.
>
> Thinking on the matter, on the one hand the current RDS/IB code does 
> not have software flow control but it does do explicit ACK-ing on each 
> message.
>
Just to be clear - RDS does not explicitly ack individual socket sends - 
it periodically (when requested) sends back an RC level high water mark 
ack indicating all sends that have arrived over an RC. The reason for 
the ack is to enable replay of sends when paths fail. In practice we 
issue the ack to 1) free send side resources 2) reduce replay in light 
of a path failure.  The ack is requested by the send side when we cross 
a threshold of send side resource consumption.

So we do have RC level flow control in RDS - but from the send side 
resource consumption - not from a recv side buffer management 
perspective - which is what you are proposing.

Our design principle for RDS has been to keep it simple - a minimalist 
approach - the idea being - that less code is good from a 
maintainability perspective.

We do leverage IB hardware flow control for RDS - which seems to work 
well 1) in practice (real application load) we do not see RNRs - and 
therefore RNRs are not swamping the network. 2) When RNRs do occur (test 
driven loads) the hardware flow control coupled with the driver 
reposting of recv buffers is very efficient .

A couple of additional optimizations for our existing flow control would 
be to 1) add srq support - this will reduce the possibility of RNRs 2) 
use an rdma write - vs - message for the ack to remove the requirement 
of having a recv buffer posted to handle the ack.

Perhaps we will need recv side flow control - if/when we find that the 
IB hardware flow control becomes an issue - maybe that's just around the 
corner. We'd like to see some data showing this is a real problem.

> It might be a win/win to have software flow control protocol and which 
> indirectly provide ACK-ing functionality, since IB RC keep the order 
> of messages, if you know how many credits you have on the remote side, 
> you know what was the last messages the accepted and the credit update 
> is sent after the message placed into memory. The flow control info 
> can be piggybacked on messages going the other direction and below 
> some water mark, we would need a "credit-update" message. Its not so 
> complex, the SDP flow control defined in the IB spec can be used as a 
> starting point.
>
> The win/win here, comes from not doing ACK on every packet as been 
> done now (and I hear that ways are being searched re how to optimize 
> this, eg with RDMA, etc), not doing ACK for protocol based on IB RC, 
> something which creates doubts and confusion for users / reviewers / 
> implementors and more important, not loading the IB network with RNR 
> NAK messages where you have for some reason an imbalanced setup and 
> not having the HCA to arm and act on RNR timers, etc.
>
> Or.
>