[rds-devel] Re: RDS IB transport software flow control?
Richard Frank
richard.frank at oracle.com
Wed Nov 7 07:36:53 PST 2007
Or Gerlitz wrote:
> Richard Frank wrote:
>> Just to be clear - RDS does not explicitly ack individual socket
>> sends - it periodically (when requested) sends back an RC level high
>> water mark ack indicating all sends that have arrived over an RC. The
>> reason for the ack is to enable replay of sends when paths fail. In
>> practice we issue the ack to 1) free send side resources 2) reduce
>> replay in light of a path failure. The ack is requested by the send
>> side when we cross a threshold of send side resource consumption.
>
> Looking on the code (v3, ofed 1.3) I see the following comment:
>
>> * When the remote host receives our ack they'll free the sent
>> message from
>> * their send queue. To decrease the latency of this we always send
>> an ack
>> * immediately after we've received messages.
>> *
>> * For simplicity, we only have one ack in flight at a time. This puts
>> * pressure on senders to have deep enough send queues to absorb the
>> latency of
>> * a single ack frame being in flight. This might not be good enough.
>
> which says that "ack is sent immediatly after receiving messages" ?
>
I'm pretty sure the code is not doing this - we should update the
comments - and get our the design docs as you keep reminding us to do.
>> Our design principle for RDS has been to keep it simple - a
>> minimalist approach - the idea being - that less code is good from a
>> maintainability perspective.
>
> Still, I am not sure that a credit management protocol from which you
> can deduce what messages are acked (as of the in-order property of IB
> RC) would be not more complex then this acking, say LOC wise.
>
>> We do leverage IB hardware flow control for RDS - which seems to work
>> well 1) in practice (real application load) we do not see RNRs - and
>> therefore RNRs are not swamping the network. 2) When RNRs do occur
>> (test driven loads) the hardware flow control coupled with the driver
>> reposting of recv buffers is very efficient .
>
> AFAIK, iWARP does not have HW flow control, so relying on RNR NAKs
> narrows the scope of RDS to IB only.
>
Then the RDS transport module for IWARP will need to add the flow
control - but right now - the IB transport for RDS is using the IB HW
flow control.
BTW - if we could get by with out any ACKs we'd glad to do it - they are
pain with lots of side effects. Adding more seems like more of a bad
thing - perhaps we must do it - but it's going to take some real data to
make a convincing case.
One thing we might want to consider is - if the IB transport knows that
we have a single HCA configured - then we can do away with the current
HA ACKing - as it is only needed for HA with multiple adapters ! This
sounds a bit off - but consider all the single HCA boards out
there....running the HA acking is pure overhead.
>> A couple of additional optimizations for our existing flow control
>> would be to 1) add srq support - this will reduce the possibility of
>> RNRs 2) use an rdma write - vs - message for the ack to remove the
>> requirement of having a recv buffer posted to handle the ack.
>
> Can you elaborate a little on RDMA ACKs vs SEND ACKs? I guess you
> don't mean to rdma-write-with-immediate since this also consumes a WR
> at the receiver side. If it just rdma-write, how would the sender be
> notified on the ACK reception, would it do polling-on-memory?
>
RDMA back the ack frame for an RC. And yes, on the send side - when a
send completes, and or when the CQ is drained, or when a send is
initiated - check the current ACK hwm for the RC and release send
buffers which have been acked.
>> Perhaps we will need recv side flow control - if/when we find that
>> the IB hardware flow control becomes an issue - maybe that's just
>> around the corner. We'd like to see some data showing this is a real
>> problem.
>
> My feeling is that this scenario is waiting for you around the corner
> as you say, but I can't prove it as this point of my knowledge and
> hands on RDS, we will see.
>
Or.
>
More information about the rds-devel
mailing list