[rds-devel] Re: [PATCH] rds-stress: always send up-to-date sequence number

Thu Jan 3 14:06:54 PST 2008

On Thursday 03 January 2008 22:37, Olaf Kirch wrote:
> From: Olaf Kirch <olaf.kirch at oracle.com>
> 
> rds-stress: always send up-to-date sequence number
> 
> With the recent changes in support of RDMA, ACK packets would sometimes
> get sent with the wrong sequence number. This patch makes sure we always
> put the current seqno in the header.

This patch fixes the troubles with rds-stress bailing out with several
threads for the non-RDMA case. Embarrassing bug, so it would certainly be
nice to have this in the git tree :-)

However, rds-stress still barfs when I run with -D65536 -d2 -t 32. It seems
the failure scenario goes like this:

 -	we happily send stuff
 -	then we run out of MRs, supposedly because we just allocated
	too many of them.
	Bug #1: rds-stress exits when it encounters this
 -	But then something else happens, which is rather strange. The
	kernel log says

	RDS/IB: map_fmr failed (errno=-11)
	RDS/IB: map_fmr failed (errno=-11)
	RDS/ib: unhandled QP event 3 on connection to 10.1.2.98
	RDS/IB: completion on 10.1.2.98 had status 5, disconnecting and reconnecting

	If I'm not mistaken, event 3 is IB_EVENT_QP_ACCESS_ERR.
	What are we supposed to do when we see this event? Is there a way
	to handle this gracefully?

 -	Then things go downhill quite rapidly. Most if not all rds-stress
	threads die because of a missing RDS message:

	An incoming message had a header which
	didn't contain the fields we expected:
	    member        expected eq             got
	       seq              12 !=              13
	 from_addr       10.1.2.98  =       10.1.2.98
	 from_port            4010  =            4010
	   to_addr       10.1.2.98  =       10.1.2.98
	   to_port            4029  =            4029
	     index               0  =               0
	        op               1  =               1

	As you can see, the sequence number skipped...

	This does not seem to be a numbering problem on the send side,
	but rather related to the FMR EAGAIN problem. This message corruption
	occurs *only* when I choose a thread count large enough to trigger
	the FMR EAGAIN (-D65536 -d2 -t20 works, -t21 doesn't).

	So my initial suspicion was that these messages get dropped quietly
	by the RDS kernel code when the connection bounces. Which would be
	rather bad. But after staring at this a little longer, this does not
	seem to be the case, as everything else matches - in particular, the
	index field should be different if we indeed lost a message.

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax