[rds-devel] Re: process death - with outstanding rdma operations

Tue Jan 22 07:58:59 PST 2008

Or Gerlitz wrote:
> Richard Frank wrote:
>> Or Gerlitz wrote:
>
>>> So a possible design you suggest here is that the client app would 
>>> set a per key timer and revoke the key when the timer expires, OK. 
>>> As for client process death making RDS to revoke the key, it means 
>>> that RDS has to manage per process book-accounting for registrations 
>>> (namely pages locked and keys) done by it on behalf of that process, 
>>> correct?
>
>> Yes - I believe we are doing this now.
>
> Its unclear from your response if what you refer to the first case 
> (client app revokes) or the second case (RDS driver revokes).

The client can always revoke a key and request immediate invalidation.

The driver will automatically revoke keys and invalidate (fence) when a 
process dies.

The driver will also ensure in-progress rdma operations (on the wire) - 
are run down before letting the dieing process die.

>
>> longer valid resulting in an access error. In this case the RC is 
>> silently reconnected by the transport. Ideally, only the rdmas with 
>> access errors are dropped.
>
> does it means that buddy rdma that complete with flush error are 
> retried? I was thinking that rdmas are not reliable.
Yes - rdma's are not reliable in V3 - they can be dropped - but we 
ideally try hard to not drop them. Basically - it means that if an rdma 
gets an access error - it is dropped.  Access errors could be due to 
poorly behaved client / rdma server handing in bad keys, etc - or due to 
fail over between HCAs.

>
>> The idea is to limit the potential bad behavior of processes (DOS) to 
>> the set of processes within a group - or more importantly to exclude 
>> processes outside of a group from affecting another group.
>>
>> The RDS driver / transport - creates an RC per group - vs - system 
>> wide. Each group has a private RC that it is  sharing with all 
>> processes in the group ror send / rdma operations.
>
> So with this design, from the view point of the IB stack, each RDS 
> process group is associated with a port (listener) etc 
> (pd,qp,fmr_pool) on the local node (vs the situation today is that rds 
> calls rdma_listen once on one port)
>
We still call rdma listen on a single port - to listen for incoming 
connections.
> How do you manage this - today RDS connects to the remote side based 
> on the --ip-- address provided to sendmsg() and ignores the port, so 
> now per process groups ports are exchanged out of band and RDS would 
> take into account the port provided by the app when they call sendmsg()?
>
Correct - with the current proposal (patch has not been accepted) - the 
destination address is composed of destination(IP:PORT) + sender GID. On 
a sendmsg - this tuple is hashed to lookup a local RDS connection. If 
one does not exist, and new connect is requested. We pass along the 
connecting side GID in connect private data. The accepting side hashes 
the IP:PORT:GID (from private data) to lookup local accept side connect 
struct - and will create a new one if needed. If one exists - then a 
connect race has occurred - which sorts it self out.

So from a sender perspective the send side is hashing to find an RDS 
connection to use for sending to a destination. Of course on the recv 
side - recv messages can be arriving over any of the established RDS 
connections.

The net is that processes of the same group share an RDS connection for 
sends (including rdmas) which provides a level of isolation.

 From an HA perspective - RDS connections are reliable - transport level 
connections transparently come and go underneath them.

> Or
>