[rds-devel] Re: Looking at a proposal from the folks to have
use_once memory keys.
Richard Frank
richard.frank at oracle.com
Fri Jan 4 08:39:00 PST 2008
Olaf Kirch wrote:
> On Friday 04 January 2008 15:06, Richard Frank wrote:
>
>> >>
>>
>> So, one could possibly make a case for ignoring the RDMA op
>> on a retransmit.
>> <<
>>
>> Yes - by design we choose to drop RDMA ops - they are not reliable - and both the rdma op
>> and the immediate data rm must be dropped together - they are an atomic unit.
>>
>> If the client app sends in a bad key - this may break the connection as you've outlined
>> - but the bad rdma will get dropped and the connection will reform.
>>
>
> Alright. I think I'm beginning to understand.
>
> First off, we indeed drop RDMA ops on the floor. I looked in the wrong
> place - it's in rds_rdma_drop_ops.
>
> Second, this explains the trouble I'm having with rds-stress: with too
> many threads/too deep queues, we run out of MRs almost immediately, and
> a few threads die - after having asked a peer to RDMA to/from one of
> its local buffers.
We should be seeing eagain from get_mr in this case - and do some ugly
wait / wake path - waiting for mrs to become available ?
> Now the peer tries to perform the RDMA, which fails
> because the thread died, and the MR went away.
OK - this is expected behavior when the key is invalid.
> Now the connection goes
> down, and all messages on the send queue with RDMA ops get thrown away.
>
We should only drop the rdma ops which had been sent - at least once.
Our thinking - was to let the clients detect the loss of the rdma ops /
immediate data completions via a long timeout and retry the operations
allocating new keys, etc.
This seems in efficient - but again losing an rdma (connection breaking
/ failover) is a rare case - was the bet.
In the case of a connection reforming over the same HCA (connection
broke due to bad key) - we should not need to toss all RDMA ops - as the
keys are still valid - and the HCA has not lost any state. However, this
opens the issue of sorting out which rdma caused the connection to fail
- and tossing just it - as you have pointed out..
> Yes, *all* pending messages with RDMA ops, not just the ones that failed.
> Of course, rds-stress gets confused because messages disappeared, and
> the whole thing falls apart.
>
> Is this really what we want? I think I'd like something more robust.
>
> I think we should handle this differently. For instance, the send
> code could easily catch retransmitted messages with RDMA ops and
> skip them. This would allow other threads to continue operating.
>
Yes - this would be better.
> Optionally, we could inform the sending thread that an RDMA op failed.
>
>
RDMAs are submitted async (pseudo async) via sendmsg which only
indicates the request was submitted / queued. The clients are using the
barrier op to detect rdma op completions but not the status of the
individual rdma ops (success or failure). We'd need to add a new
interface to return status for submitted ops..
>> <<
>> BTW: Can you explain what *exactly* goes wrong with HCA failover,
>> that makes us do all this ACK/retransmit business? Up to which point
>> can a WC get lost on the receiving host? And how does that relate
>> to RDMA ops?
>>
>> Here's my limited understanding...
>>
>> The HCA issues the hardware ack (put on the wire) for a send it recv'd before 1) the send data
>> (could be rdma) is pushed to host memory and 2) the WC is queued. So it is possible for the
>> sending side HCA to get back an IB ack - and que a local completion for the send (or RDMA).
>> At this point the remote HCA fails and the neither the data nor WC make it to remote host memory.
>> However, the local host thinks that it did due to the ACK and completion generated - so we lose the data.
>>
>
> I see. I was under the impression that data (RDMA, data going to S/G buffers)
> would hit host memory right away without being buffered on the HCA, and
> that it was a matter of WCs not getting copied to host memory. But if
> we're not even guaranteed that RDMA ops hit memory, then there's really
> no way to do RDMA reliably in the face of HCA failure.
>
This is true for normal sends too..
"IF" I understand this - The HCA is not buffering all the data - before
moving to host memory - but part of it may be in the HCA and lost.
> Olaf
>
More information about the rds-devel
mailing list