[rds-devel] Re: trying to reproduce the crash
Olaf Kirch
olaf.kirch at oracle.com
Sun Feb 3 23:47:41 PST 2008
On Sunday 03 February 2008 12:02, Or Gerlitz wrote:
> I did some rds-stress runs as above over the device you are using
> MT25204 (but with the latest firmware 1.2.0) both as client and server
> and I don't manage to reproduce a crash. The code is ofed 1.3 rc3, the
> second node is connectx. So, it would be best if you can share a script
> that when running with the crash is reproduced.
Hrm, very strange.
Okay, so here's what I do. On one side (call it host_a), I do something like
while true; do
rds-stress -R -r $host_a -p 4000
done
On the other side (host_b) I do this:
while sleep 1; do
rmmod rds
sleep 1
insmod rds.ko
rds-stress -R -r $host_b -s $host_a -p 4000 -c -d4 -t32 -T3 -D64k
done
This reproduces the crash in 5-20 minutes. It's possible that this requires SMP
machines on both ends; this setup has a 4-node and a 2-node machine, both 64bit
Xeons, and both with 1GB of RAM.
> Other then not crashing, I did see some problems, specifically, atomic
> order zero page allocation failure in rds_ib_recv_refill
Yeah, that's one of the uglier parts of RDS. It wants to refill the recv ring
from the recv cq handler.
> and also reports on wrong sequence number in the client side
>
> > An incoming message had a header which
> > didn't contain the fields we expected:
> > member expected eq got
> > seq 14 != 15
> > from_addr 192.168.10.85 = 192.168.10.85
> > from_port 4003 = 4003
> > to_addr 192.168.10.85 = 192.168.10.85
> > to_port 4044 = 4044
> > index 3 = 3
> > op 1 = 1
> > header from 192.168.10.85:4003 to id 4044 bogus
That is a symptom of RDMA operations getting dropped on the floor, usually when
the connection is dropped and re-established. Is there a message in syslog
that coincides with this?
Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax
More information about the rds-devel
mailing list