[rds-devel] Re: [PATCH] workaround for RDMA error disconnects

Or Gerlitz ogerlitz at voltaire.com
Wed Jan 30 06:50:34 PST 2008


>     This is actually more of a workaround than a fix. When we tear down
>     a connection that errored out on a RDMA gone into the weeds (eg because
>     the remote key is no longer valid), we were seeing hangs in the shutdown
>     code, waiting for send WQEs to get flushed. This did not happen all the
>     time - so we add a timeout of 1 second and proceed. This is not a great
>     fix, but a practical one that keeps us going until we have a real fix.

>     Signed-off-by: Olaf Kirch <olaf.kirch at oracle.com>

Hi Olaf,

First, when you reported on the problem to Roland, he asked you over 
which HCA and FW version you see this. He also said that if its 
connectx, it can be a FW issue, since its quite new product. I think you 
did not reply his question, so can you run $ ibv_devinfo on your 
machines and send us the output.

Second, I'd like to reproduce this, so what command line for rds-stress 
do I need to run in order to see these lost rdmas?

> --- a/net/rds/ib_cm.c
> +++ b/net/rds/ib_cm.c
> @@ -512,35 +512,38 @@ void rds_ib_conn_shutdown(struct rds_connection *conn)

> +		/* Always move the QP to error state */
> +		if (ic->i_cm_id->qp) {
> +			qp_attr.qp_state = IB_QPS_ERR;
> +			err = ib_modify_qp(ic->i_cm_id->qp, &qp_attr, IB_QP_STATE);
> +			if (err) {
> +				printk(KERN_WARNING "rds_ib_conn_shutdown: failed to"
> +					   " modify QP to ERR state: id %p qp %p err %d\n",
> +					   ic->i_cm_id, ic->i_cm_id->qp, err);
>  			}

Are you suspecting that the RDMA-CM does not move the QP state even when 
rdma_disconnect returned 0 ?

Or




More information about the rds-devel mailing list