[rds-devel] Re: Looking at a proposal from the SAGE folks to have use_once memory keys.

Fri Jan 4 05:45:53 PST 2008

Olaf Kirch wrote:
> Hi Rick,
>
> if you have no objections, I'd like to take this discussion to the rds-devel
> list. Okay for me to bounce your and my message to the list?
>   
Done !
> On Friday 04 January 2008 00:44, Richard Frank wrote:
>   
>> The idea is to piggy back the key for an rdma send op - in the message 
>> header of the rm chasing the rdma (immediate data) - so when the rm is 
>> recv'd by the driver - the driver can auto free the memory key (assuming 
>> one exists in the rm header)..
>>
>> 1) apps do not explicitly free a use_once key - this closes a possible 
>> race when a duplicate update may be sent by the rdma server.
>> 2) this also reduces the window / lifetime of memory keys (a very 
>> limited resource).
>>
>> If this makes sense to do  - then we might want a flag on the rdma send 
>> call to enable auto freeing of the key ?
>>
>> What do you think ?
>>     
>
> The reason to try this is certainly valid. But I'm not sure the proposed
> approach is the best way to do it. We rely on the peer to present the
> key of a MR that we should destroy - what if the peer lies, or is confused?
> The semantics of use_once buffers should be a local affair, if possible.
>   
The MR would be handed in with the RDMA send operation - so if the peer 
is using the wrong MR - it was able to issue an RDMA too !

> Isn't there a way for the completion queue handler to identify the MR
> based on the RDMA work completion? Alternatively, we could always include
> the R_Key in the immediate data of the RDMA WR. Then the peer should
> get a work completion with IB_WC_RECV_RDMA_WITH_IMM. It can then go and
> grab the R_Key, and check whether the memory range can now be freed.
>
> There is a problem with RDMA and retransmits, and use_once buffers make
> it worse. Assume
>
>   
We do not retransmit RDMA operations - at least that's out intention - > 
RDMA ops are not reliable.  I think we already do this ?

>  -	node A sets up a memory range, and sends a message X requesting
> 	that node B reads from that range.
>  -	node B performs the RDMA read, and sends along a message Y.
> 	It also sends an RDS ACK that says it received message X
>  -	node A receives the message, and the application destroys
> 	the MR
>  -	Something causes a connection re-establishment
>  -	Node B looks at its message queue and finds message Y, for
> 	which it hasn't received an RDS ACK yet. So it retransmits
> 	the message
>  -	Node A receives the RDMA operation, but the r_key is now
> 	invalid. It tears down the connection and re-establishes it.
>  -	Repeat the last two steps ad inf
>
>   
If a connection breaks - we throw rdma ops on the floor - so we should 
not resend "y".

> With normal operations, there's a certain window between delivery of
> message Y, and the application destroying the MR. With use_once buffers
> this window goes away. Connection recovery will be impossible if the
> connection goes down while we have an RDMA to a use_once buffer in flight.
>
> On one hand, it is certainly the application's job to make sure no peer
> has a pending RDMA read/write when it destroys a MR. On the other hand,
> the application has no control over the RDS layer retransmission,
> and it doesn't know whether a peer has a RDMA operation on its retransmit
> queue.
>
> One way to fix this would be to track which MRs have outstanding RDMA
> operations in progress. But that would mean the application would
> need to inform the kernel about who is going to access that buffer,
> and we would require another round trip to that peer ("ACK, I received
> your message Y" - "Double ACK, I received your ACK and removed message
> Y from the retransmit queue") before we allow the application to release
> that MR.
>
> This kind of sucks.
>
> So what we really need to do at the RDS layer is to make sure that a
> message that has been received successfully will never be retransmitted
> by the peer. That's certainly a departure from previous semantics,
> but I can't see how to fix this problem without either making RDMA
> slower, or changing connection recovery.
>
> What we currently do after a connection is re-established is we
> take all messages on the retransmit queue and stick them on the
> send queue. We could change that and wait until we receive the
> initial packet on that connection, which will be a congestion map
> update. This packet comes with a piggy-backed ACK that tells us
> which packets the peer has received already, so we can discard
> them from the retrans queue - and *then* queue the remaining
> messages.
>
> Olaf
>