[rds-devel] Re: [PATCH] rds-stress: always send up-to-date sequence
number
Olaf Kirch
olaf.kirch at oracle.com
Thu Jan 3 14:06:54 PST 2008
On Thursday 03 January 2008 22:37, Olaf Kirch wrote:
> From: Olaf Kirch <olaf.kirch at oracle.com>
>
> rds-stress: always send up-to-date sequence number
>
> With the recent changes in support of RDMA, ACK packets would sometimes
> get sent with the wrong sequence number. This patch makes sure we always
> put the current seqno in the header.
This patch fixes the troubles with rds-stress bailing out with several
threads for the non-RDMA case. Embarrassing bug, so it would certainly be
nice to have this in the git tree :-)
However, rds-stress still barfs when I run with -D65536 -d2 -t 32. It seems
the failure scenario goes like this:
- we happily send stuff
- then we run out of MRs, supposedly because we just allocated
too many of them.
Bug #1: rds-stress exits when it encounters this
- But then something else happens, which is rather strange. The
kernel log says
RDS/IB: map_fmr failed (errno=-11)
RDS/IB: map_fmr failed (errno=-11)
RDS/ib: unhandled QP event 3 on connection to 10.1.2.98
RDS/IB: completion on 10.1.2.98 had status 5, disconnecting and reconnecting
If I'm not mistaken, event 3 is IB_EVENT_QP_ACCESS_ERR.
What are we supposed to do when we see this event? Is there a way
to handle this gracefully?
- Then things go downhill quite rapidly. Most if not all rds-stress
threads die because of a missing RDS message:
An incoming message had a header which
didn't contain the fields we expected:
member expected eq got
seq 12 != 13
from_addr 10.1.2.98 = 10.1.2.98
from_port 4010 = 4010
to_addr 10.1.2.98 = 10.1.2.98
to_port 4029 = 4029
index 0 = 0
op 1 = 1
As you can see, the sequence number skipped...
This does not seem to be a numbering problem on the send side,
but rather related to the FMR EAGAIN problem. This message corruption
occurs *only* when I choose a thread count large enough to trigger
the FMR EAGAIN (-D65536 -d2 -t20 works, -t21 doesn't).
So my initial suspicion was that these messages get dropped quietly
by the RDS kernel code when the connection bounces. Which would be
rather bad. But after staring at this a little longer, this does not
seem to be the case, as everything else matches - in particular, the
index field should be different if we indeed lost a message.
Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax
More information about the rds-devel
mailing list