[rds-devel] Re: [PATCH 1/2] RDS/IB: Handle connection request
in case of failover.
Vladimir Sokolovsky
vlad at dev.mellanox.co.il
Tue Jun 12 03:56:33 PDT 2007
Zach Brown wrote:
>
> On Jun 10, 2007, at 10:59 AM, Vladimir Sokolovsky wrote:
>
>> Hi Zach,
>> During the testing I get into the BUG_ON(ic->i_cm_id) (file:
>> net/rds/ib-cm.c line: 252)
>> So, checking _CONNECTED and _CONNECTING is not enough.
>
> Why not? That test for CONNECTING just above the BUG should see that an
> established connection already has resources, queue shutdown, and quit.
> Are you seeing this bug while running with your patch to clear
> CONNECTING as connections become established?
>
> - z
>
> _______________________________________________
> rds-devel mailing list
> rds-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/rds-devel
>
I think that the following flow happens:
* Setup:
Node A, connected with 2 IB ports to the IB switch
Node B, connected with 2 IB ports to the same IB switch
* Flow:
Run rds-sink on the node A
Run rds-gen on the node B
Then disconnect the active IB port on the node A
After ~20 sec node A will call to rds_shutdown_worker and then tries
to recreate the connection.
From here I got one of the three flows:
1. Connection established after number of retries and test continues to
run over the second IB port. This happens in about 5% of tests.
2. Node B still not aware of connection failure will get
RDMA_CM_EVENT_CONNECT_REQUEST (both _CONNECTED and _CONNECTING are set)
and call rds_shutdown_worker which clears both _CONNECTED and
_CONNECTING then node B may get another RDMA_CM_EVENT_CONNECT_REQUEST
(from node A which got reject on previous request) before
rds_ib_conn_shutdown finished and then get into BUG_ON(ic->i_cm_id).
This case occurs also in about 5% of tests.
The patch 1/2 fix this issue.
3. There is a race between set and clear _CONNECTING bit in
rds_shutdown_worker and rds_connect_worker. In this case connection will
never be established. conn_reset and ib_connect_raced are growing. This
case is in about 90% of tests.
The patch 2/2 fix this issue.
With these 2 patches in most of the cases failover works. But sometimes
I got a race on rdma_destroy_id in rds_ib_conn_shutdown function.
Now, I am working to resolve this issue.
Zach,
OFED-1.2-rc5 should be released tomorrow, so, I want to add these
patches to this OFED rc. We can remove/change them for the main OFED
release.
Regards,
Vladimir
More information about the rds-devel
mailing list