[rds-devel] Re: RDS IB transport software flow control?

Wed Nov 7 06:33:11 PST 2007

Richard Frank wrote:
> Just to be clear - RDS does not explicitly ack individual socket sends - 
> it periodically (when requested) sends back an RC level high water mark 
> ack indicating all sends that have arrived over an RC. The reason for 
> the ack is to enable replay of sends when paths fail. In practice we 
> issue the ack to 1) free send side resources 2) reduce replay in light 
> of a path failure.  The ack is requested by the send side when we cross 
> a threshold of send side resource consumption.

Looking on the code (v3, ofed 1.3) I see the following comment:

>  * When the remote host receives our ack they'll free the sent message from
>  * their send queue.  To decrease the latency of this we always send an ack
>  * immediately after we've received messages.
>  *
>  * For simplicity, we only have one ack in flight at a time.  This puts
>  * pressure on senders to have deep enough send queues to absorb the latency of
>  * a single ack frame being in flight.  This might not be good enough.

which says that "ack is sent immediatly after receiving messages" ?

> Our design principle for RDS has been to keep it simple - a minimalist 
> approach - the idea being - that less code is good from a 
> maintainability perspective.

Still, I am not sure that a credit management protocol from which you 
can deduce what messages are acked (as of the in-order property of IB 
RC) would be not more complex then this acking, say LOC wise.

> We do leverage IB hardware flow control for RDS - which seems to work 
> well 1) in practice (real application load) we do not see RNRs - and 
> therefore RNRs are not swamping the network. 2) When RNRs do occur (test 
> driven loads) the hardware flow control coupled with the driver 
> reposting of recv buffers is very efficient .

AFAIK, iWARP does not have HW flow control, so relying on RNR NAKs 
narrows the scope of RDS to IB only.

> A couple of additional optimizations for our existing flow control would 
> be to 1) add srq support - this will reduce the possibility of RNRs 2) 
> use an rdma write - vs - message for the ack to remove the requirement 
> of having a recv buffer posted to handle the ack.

Can you elaborate a little on RDMA ACKs vs SEND ACKs? I guess you don't 
mean to rdma-write-with-immediate since this also consumes a WR at the 
receiver side. If it just rdma-write, how would the sender be notified 
on the ACK reception, would it do polling-on-memory?

> Perhaps we will need recv side flow control - if/when we find that the 
> IB hardware flow control becomes an issue - maybe that's just around the 
> corner. We'd like to see some data showing this is a real problem.

My feeling is that this scenario is waiting for you around the corner as 
you say, but I can't prove it as this point of my knowledge and hands on 
RDS, we will see.

Or.