[rds-devel] Re: Looking at a proposal from the folks to have use_once memory keys.

Fri Jan 4 07:23:15 PST 2008

On Friday 04 January 2008 15:06, Richard Frank wrote:
>  >>
> 
> So, one could possibly make a case for ignoring the RDMA op
> on a retransmit.
> <<
> 
> Yes - by design we choose to drop RDMA ops - they are not reliable - and both the rdma op
> and the immediate data rm must be dropped together - they are an atomic unit.  
> 
> If the client app sends in a bad key - this may break the connection as you've outlined
> - but the bad rdma will get dropped and the connection will reform.

Alright. I think I'm beginning to understand.

First off, we indeed drop RDMA ops on the floor. I looked in the wrong
place - it's in rds_rdma_drop_ops.

Second, this explains the trouble I'm having with rds-stress: with too 
many threads/too deep queues, we run out of MRs almost immediately, and
a few threads die - after having asked a peer to RDMA to/from one of
its local buffers. Now the peer tries to perform the RDMA, which fails
because the thread died, and the MR went away. Now the connection goes
down, and all messages on the send queue with RDMA ops get thrown away.
Yes, *all* pending messages with RDMA ops, not just the ones that failed.
Of course, rds-stress gets confused because messages disappeared, and
the whole thing falls apart.

Is this really what we want? I think I'd like something more robust.

I think we should handle this differently. For instance, the send
code could easily catch retransmitted messages with RDMA ops and
skip them. This would allow other threads to continue operating.
Optionally, we could inform the sending thread that an RDMA op failed.

> <<
> BTW: Can you explain what *exactly* goes wrong with HCA failover,
> that makes us do all this ACK/retransmit business? Up to which point
> can a WC get lost on the receiving host? And how does that relate
> to RDMA ops?
> >>
> 
> Here's my limited understanding...
> 
> The HCA issues the hardware ack (put on the wire) for a send it recv'd before 1) the send data
> (could be rdma) is pushed to host memory and 2) the WC is queued. So it is possible for the 
> sending side HCA to get back an IB ack - and que a local completion for the send (or RDMA). 
> At this point the remote HCA fails and the neither the data nor WC make it to remote host memory. 
> However, the  local host thinks that it did due to the ACK and completion generated - so we lose the data.     

I see. I was under the impression that data (RDMA, data going to S/G buffers)
would hit host memory right away without being buffered on the HCA, and
that it was a matter of WCs not getting copied to host memory. But if
we're not even guaranteed that RDMA ops hit memory, then there's really
no way to do RDMA reliably in the face of HCA failure.

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax