[rds-devel] Re: RDS IB transport software flow control?

Richard Frank richard.frank at oracle.com
Wed Nov 7 07:36:53 PST 2007


Or Gerlitz wrote:
> Richard Frank wrote:
>> Just to be clear - RDS does not explicitly ack individual socket 
>> sends - it periodically (when requested) sends back an RC level high 
>> water mark ack indicating all sends that have arrived over an RC. The 
>> reason for the ack is to enable replay of sends when paths fail. In 
>> practice we issue the ack to 1) free send side resources 2) reduce 
>> replay in light of a path failure.  The ack is requested by the send 
>> side when we cross a threshold of send side resource consumption.
>
> Looking on the code (v3, ofed 1.3) I see the following comment:
>
>>  * When the remote host receives our ack they'll free the sent 
>> message from
>>  * their send queue.  To decrease the latency of this we always send 
>> an ack
>>  * immediately after we've received messages.
>>  *
>>  * For simplicity, we only have one ack in flight at a time.  This puts
>>  * pressure on senders to have deep enough send queues to absorb the 
>> latency of
>>  * a single ack frame being in flight.  This might not be good enough.
>
> which says that "ack is sent immediatly after receiving messages" ?
>
I'm pretty sure the code is not doing this - we should update the 
comments - and get our the design docs as you keep reminding us to do.

>> Our design principle for RDS has been to keep it simple - a 
>> minimalist approach - the idea being - that less code is good from a 
>> maintainability perspective.
>
> Still, I am not sure that a credit management protocol from which you 
> can deduce what messages are acked (as of the in-order property of IB 
> RC) would be not more complex then this acking, say LOC wise.
>
>> We do leverage IB hardware flow control for RDS - which seems to work 
>> well 1) in practice (real application load) we do not see RNRs - and 
>> therefore RNRs are not swamping the network. 2) When RNRs do occur 
>> (test driven loads) the hardware flow control coupled with the driver 
>> reposting of recv buffers is very efficient .
>
> AFAIK, iWARP does not have HW flow control, so relying on RNR NAKs 
> narrows the scope of RDS to IB only.
>
Then the RDS transport module for IWARP will need to add the flow 
control - but right now - the IB transport for RDS is using the IB HW 
flow control.

BTW - if we could get by with out any ACKs we'd glad to do it - they are 
pain with lots of side effects. Adding more seems like more of a bad 
thing - perhaps we must do it - but it's going to take some real data to 
make a convincing case.

One thing we might want to consider is - if the IB transport knows that 
we have a single HCA configured - then we can do away with the current 
HA ACKing  - as it is only needed for HA with multiple adapters ! This 
sounds a bit off - but consider all the single HCA boards out 
there....running the HA acking is pure overhead.

>> A couple of additional optimizations for our existing flow control 
>> would be to 1) add srq support - this will reduce the possibility of 
>> RNRs 2) use an rdma write - vs - message for the ack to remove the 
>> requirement of having a recv buffer posted to handle the ack.
>
> Can you elaborate a little on RDMA ACKs vs SEND ACKs? I guess you 
> don't mean to rdma-write-with-immediate since this also consumes a WR 
> at the receiver side. If it just rdma-write, how would the sender be 
> notified on the ACK reception, would it do polling-on-memory?
>
RDMA back the ack frame for an RC. And yes, on the send side - when a 
send completes, and or when the CQ is drained, or when a send is 
initiated - check the current ACK hwm for the RC and release send 
buffers which have been acked.

>> Perhaps we will need recv side flow control - if/when we find that 
>> the IB hardware flow control becomes an issue - maybe that's just 
>> around the corner. We'd like to see some data showing this is a real 
>> problem.
>
> My feeling is that this scenario is waiting for you around the corner 
> as you say, but I can't prove it as this point of my knowledge and 
> hands on RDS, we will see.
>
Or.
>



More information about the rds-devel mailing list