[rds-devel] QP error event with RDS

Cristian Dittamo c.dittamo at list-group.com
Wed May 11 05:59:11 PDT 2011


Thank you Venkat.

I tried to find the problem at fabric level, thus I executed the perfquery and ibqueryerrors tools. 

Follows their outputs. It seems there are cable connection problems, i.e. LinkDowned = 1.

I will try to change switch’s sockets.

 

[root at host1 RDSINFO]# perfquery

# Port counters: Lid 5 port 1

PortSelect:......................1

CounterSelect:...................0x1400

SymbolErrors:....................0

LinkRecovers:....................0

LinkDowned:......................1

RcvErrors:.......................125

RcvRemotePhysErrors:.............0

RcvSwRelayErrors:................0

XmtDiscards:.....................10

XmtConstraintErrors:.............0

RcvConstraintErrors:.............0

CounterSelect2:..................0x00

LinkIntegrityErrors:.............0

ExcBufOverrunErrors:.............0

VL15Dropped:.....................0

XmtData:.........................4294967295

RcvData:.........................4294967295

XmtPkts:.........................51939068

RcvPkts:.........................75491204

XmtWait:.........................2601824

 

[root at host1 conf]# ibqueryerrors -r

Suppressing:

Errors for 0x1e8c0000dc93fb "host3 HCA-1"

   GUID 0x1e8c0000dc93fb port 1: [XmtDiscards == 1] [XmtWait == 4969894]

       Link info:      4   1[  ] ==( 4X  5.0 Gbps Active/  LinkUp)==>  0x000b8cffff004944      2   12[  ] "MT47396 Infiniscale-III Mellanox Technologies" ( )

Errors for 0x1fc6000004d976 "host4 HCA-1"

   GUID 0x1fc6000004d976 port 1: [XmtDiscards == 1] [XmtWait == 5821946]

       Link info:      3   1[  ] ==( 4X  5.0 Gbps Active/  LinkUp)==>  0x000b8cffff004944      2   11[  ] "MT47396 Infiniscale-III Mellanox Technologies" ( )

Errors for 0x1fc600000587b6 "host2 HCA-1"

   GUID 0x1fc600000587b6 port 1: [LinkDowned == 1] [RcvErrors == 125] [XmtDiscards == 10] [XmtWait == 2601824]

       Link info:      5   1[  ] ==( 4X  5.0 Gbps Active/  LinkUp)==>  0x000b8cffff004944      2    2[  ] "MT47396 Infiniscale-III Mellanox Technologies" ( )

Errors for 0xb8cffff004944 "MT47396 Infiniscale-III Mellanox Technologies"

   GUID 0xb8cffff004944 port ALL: [LinkRecovers == 13] [LinkDowned == 6] [RcvSwRelayErrors == 492] [XmtDiscards == 1]

   GUID 0xb8cffff004944 port 1: [RcvSwRelayErrors == 139]

       Link info:      2   1[  ] ==( 4X  5.0 Gbps Active/  LinkUp)==>  0x001e8c0000dc942f      1    1[  ] "host1 HCA-1" ( )

   GUID 0xb8cffff004944 port 2: [LinkRecovers == 13] [LinkDowned == 6] [RcvSwRelayErrors == 155]

       Link info:      2   2[  ] ==( 4X  5.0 Gbps Active/  LinkUp)==>  0x001fc600000587b6      5    1[  ] "host2 HCA-1" ( )

   GUID 0xb8cffff004944 port 11: [RcvSwRelayErrors == 97] [XmtDiscards == 1]

       Link info:      2  11[  ] ==( 4X  5.0 Gbps Active/  LinkUp)==>  0x001fc6000004d976      3    1[  ] "host4 HCA-1" ( )

   GUID 0xb8cffff004944 port 12: [RcvSwRelayErrors == 101]

       Link info:      2  12[  ] ==( 4X  5.0 Gbps Active/  LinkUp)==>  0x001e8c0000dc93fb      4    1[  ] "host3 HCA-1" ( )

Errors for 0x1e8c0000dc942f "host1 HCA-1"

   GUID 0x1e8c0000dc942f port 1: [XmtDiscards == 1] [XmtWait == 6270721]

       Link info:      1   1[  ] ==( 4X  5.0 Gbps Active/  LinkUp)==>  0x000b8cffff004944      2    1[  ] "MT47396 Infiniscale-III Mellanox Technologies" ( )

 

From: Venkat Venkatsubra [mailto:venkat.x.venkatsubra at oracle.com] 
Sent: Wednesday, May 11, 2011 2:50 PM
To: c.dittamo at list-group.com
Cc: rds-devel at oss.oracle.com
Subject: Re: [rds-devel] QP error event with RDS

 

Hello,

 

Status 12 is IB_WC_RETRY_EXC_ERR (include/rdma/ib_verbs.h).

The IB transport layer gave up after retrying to transmit a number of times.

 

Can you mail me the rds-info output on the sending as well as the receiving side ?

(a snapshot before and after the run)

 

Did the receiving side have buffers posted to receive ?

 

Venkat


----- Original Message -----
From: c.dittamo at list-group.com
To: rds-devel at oss.oracle.com
Sent: Wednesday, May 11, 2011 5:38:42 AM GMT -06:00 US/Canada Central
Subject: [rds-devel] QP error event with RDS

Hi, 

I am hitting the following QP error 

RDS/IB: send completion on address had status 12, disconnecting and reconnecting

during my application execution on a Linux RHEL5.5 (kernel 2.6.32.32). 

My application is a 4 nodes distributed client-server program that leverages the RDS features (i.e. sendmsg and recvmsg) only, i.e. without RDMA. I checked all cables connections and they are fine. All IB (Mellanox) drivers were loaded.

Any ideas why RDS returns this error?

Thank you.

 

 

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/rds-devel/attachments/20110511/c7651282/attachment-0001.html 


More information about the rds-devel mailing list