[rds-devel] Re: trying to reproduce the crash

Olaf Kirch olaf.kirch at oracle.com
Sun Feb 3 23:47:41 PST 2008


On Sunday 03 February 2008 12:02, Or Gerlitz wrote:
> I did some rds-stress runs as above over the device you are using 
> MT25204 (but with the latest firmware 1.2.0) both as client and server 
> and I don't manage to reproduce a crash. The code is ofed 1.3 rc3, the 
> second node is connectx. So, it would be best if you can share a script 
> that when running with the crash is reproduced.

Hrm, very strange.

Okay, so here's what I do. On one side (call it host_a), I do something like

	while true; do
		rds-stress -R -r $host_a -p 4000
	done

On the other side (host_b) I do this:

	while sleep 1; do
		rmmod rds
    		sleep 1
    		insmod rds.ko
		rds-stress -R -r $host_b -s $host_a -p 4000 -c -d4 -t32 -T3 -D64k
	done

This reproduces the crash in 5-20 minutes. It's possible that this requires SMP
machines on both ends; this setup has a 4-node and a 2-node machine, both 64bit
Xeons, and both with 1GB of RAM.

> Other then not crashing, I did see some problems, specifically, atomic 
> order zero page allocation failure in rds_ib_recv_refill

Yeah, that's one of the uglier parts of RDS. It wants to refill the recv ring
from the recv cq handler.

> and also reports on wrong sequence number in the client side
> 
> > An incoming message had a header which
> > didn't contain the fields we expected:
> >     member        expected eq             got
> >        seq              14 !=              15
> >  from_addr   192.168.10.85  =   192.168.10.85
> >  from_port            4003  =            4003
> >    to_addr   192.168.10.85  =   192.168.10.85
> >    to_port            4044  =            4044
> >      index               3  =               3
> >         op               1  =               1
> > header from 192.168.10.85:4003 to id 4044 bogus

That is a symptom of RDMA operations getting dropped on the floor, usually when
the connection is dropped and re-established. Is there a message in syslog
that coincides with this?

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax



More information about the rds-devel mailing list