[rds-devel] Re: [PATCH 1/2] RDS/IB: Handle connection request in case of failover.

Vladimir Sokolovsky vlad at dev.mellanox.co.il
Tue Jun 12 03:56:33 PDT 2007


Zach Brown wrote:
> 
> On Jun 10, 2007, at 10:59 AM, Vladimir Sokolovsky wrote:
> 
>> Hi Zach,
>> During the testing I get into the BUG_ON(ic->i_cm_id) (file:
>> net/rds/ib-cm.c line: 252)
>> So, checking _CONNECTED and _CONNECTING is not enough.
> 
> Why not?  That test for CONNECTING just above the BUG should see that an 
> established connection already has resources, queue shutdown, and quit.  
> Are you seeing this bug while running with your patch to clear 
> CONNECTING as connections become established?
> 
> - z
> 
> _______________________________________________
> rds-devel mailing list
> rds-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/rds-devel
> 

I think that the following flow happens:

* Setup:
	Node A, connected with 2 IB ports to the IB switch
	Node B, connected with 2 IB ports to the same IB switch

* Flow:
	Run rds-sink on the node A
	Run rds-gen on the node B

	Then disconnect the active IB port on the node A
	After ~20 sec node A will call to rds_shutdown_worker and then 			tries 
to recreate the connection.

 From here I got one of the three flows:

1. Connection established after number of retries and test continues to 
run over the second IB port. This happens in about 5% of tests.

2. Node B still not aware of connection failure will get
RDMA_CM_EVENT_CONNECT_REQUEST (both _CONNECTED and _CONNECTING are set)
and call rds_shutdown_worker which clears both _CONNECTED and
_CONNECTING then node B may get another RDMA_CM_EVENT_CONNECT_REQUEST
(from node A which got reject on previous request) before
rds_ib_conn_shutdown finished and then get into	BUG_ON(ic->i_cm_id). 
This case occurs also in about 5% of tests.
The patch 1/2 fix this issue.

3. There is a race between set and clear _CONNECTING bit in 
rds_shutdown_worker and rds_connect_worker. In this case connection will 
never be established. conn_reset and ib_connect_raced are growing. This 
case is in about 90% of tests.
The patch 2/2 fix this issue.

With these 2 patches in most of the cases failover works. But sometimes 
I got a race on rdma_destroy_id in rds_ib_conn_shutdown function.
Now, I am working to resolve this issue.


Zach,
OFED-1.2-rc5 should be released tomorrow, so, I want to add these 
patches to this OFED rc. We can remove/change them for the main OFED 
release.

Regards,
Vladimir










More information about the rds-devel mailing list