[rds-devel] Re: Looking at a proposal from the folks to have use_once memory keys.

Fri Jan 4 08:39:00 PST 2008

Olaf Kirch wrote:
> On Friday 04 January 2008 15:06, Richard Frank wrote:
>   
>>  >>
>>
>> So, one could possibly make a case for ignoring the RDMA op
>> on a retransmit.
>> <<
>>
>> Yes - by design we choose to drop RDMA ops - they are not reliable - and both the rdma op
>> and the immediate data rm must be dropped together - they are an atomic unit.  
>>
>> If the client app sends in a bad key - this may break the connection as you've outlined
>> - but the bad rdma will get dropped and the connection will reform.
>>     
>
> Alright. I think I'm beginning to understand.
>
> First off, we indeed drop RDMA ops on the floor. I looked in the wrong
> place - it's in rds_rdma_drop_ops.
>
> Second, this explains the trouble I'm having with rds-stress: with too 
> many threads/too deep queues, we run out of MRs almost immediately, and
> a few threads die - after having asked a peer to RDMA to/from one of
> its local buffers. 
We should be seeing eagain from get_mr in this case - and do some ugly 
wait / wake path - waiting for mrs to become available ?

> Now the peer tries to perform the RDMA, which fails
> because the thread died, and the MR went away. 
OK - this is expected behavior when the key is invalid.

> Now the connection goes
> down, and all messages on the send queue with RDMA ops get thrown away.
>   
We should only drop the rdma ops which had been sent - at least once.

Our thinking - was to let the clients detect the loss of the rdma ops / 
immediate data completions via a long timeout and retry the operations 
allocating new keys, etc.

This seems in efficient - but again losing an rdma (connection breaking 
/ failover) is a rare case  - was the bet.

In the case of a connection reforming over the same HCA (connection 
broke due to bad key) - we should not need to toss all RDMA ops - as the 
keys are still valid - and the HCA has not lost any state. However, this 
opens the issue of sorting out which rdma caused the connection to fail 
- and tossing just it - as you have pointed out..

> Yes, *all* pending messages with RDMA ops, not just the ones that failed.
> Of course, rds-stress gets confused because messages disappeared, and
> the whole thing falls apart.
>
> Is this really what we want? I think I'd like something more robust.
>
> I think we should handle this differently. For instance, the send
> code could easily catch retransmitted messages with RDMA ops and
> skip them. This would allow other threads to continue operating.
>   
Yes - this would be better.
> Optionally, we could inform the sending thread that an RDMA op failed.
>
>   
RDMAs are submitted async (pseudo async) via sendmsg which only 
indicates the request was submitted / queued. The clients are using the 
barrier op to detect rdma op completions but not the status of the 
individual rdma ops (success or failure). We'd need to add a new 
interface to return status for submitted ops..

>> <<
>> BTW: Can you explain what *exactly* goes wrong with HCA failover,
>> that makes us do all this ACK/retransmit business? Up to which point
>> can a WC get lost on the receiving host? And how does that relate
>> to RDMA ops?
>>     
>> Here's my limited understanding...
>>
>> The HCA issues the hardware ack (put on the wire) for a send it recv'd before 1) the send data
>> (could be rdma) is pushed to host memory and 2) the WC is queued. So it is possible for the 
>> sending side HCA to get back an IB ack - and que a local completion for the send (or RDMA). 
>> At this point the remote HCA fails and the neither the data nor WC make it to remote host memory. 
>> However, the  local host thinks that it did due to the ACK and completion generated - so we lose the data.     
>>     
>
> I see. I was under the impression that data (RDMA, data going to S/G buffers)
> would hit host memory right away without being buffered on the HCA, and
> that it was a matter of WCs not getting copied to host memory. But if
> we're not even guaranteed that RDMA ops hit memory, then there's really
> no way to do RDMA reliably in the face of HCA failure.
>   
This is true for normal sends too..

"IF" I understand this  - The HCA is not buffering all the data - before 
moving to host memory - but part of it may be in the HCA and lost.

> Olaf
>