[rds-devel] Re: trying to reproduce the crash

Sun Feb 3 23:56:10 PST 2008

Olaf Kirch wrote:
> Okay, so here's what I do. On one side (call it host_a), I do something like
> 	while true; do
> 		rds-stress -R -r $host_a -p 4000
> 	done
> 
> On the other side (host_b) I do this:
> 
> 	while sleep 1; do
> 		rmmod rds
>     		sleep 1
>     		insmod rds.ko
> 		rds-stress -R -r $host_b -s $host_a -p 4000 -c -d4 -t32 -T3 -D64k
> 	done
> 
> This reproduces the crash in 5-20 minutes. It's possible that this requires SMP
> machines on both ends; this setup has a 4-node and a 2-node machine, both 64bit
> Xeons, and both with 1GB of RAM.

Okay, I will try the exact scripts, also you use -R and -c on the client 
side which I don't, so will add them as well. My nodes has two CPUs, 
each with four cores and 2GB RAM, so its somehow different configuration 
in that respect.

>> and also reports on wrong sequence number in the client side
>>> An incoming message had a header which
>>> didn't contain the fields we expected:
>>>     member        expected eq             got
>>>        seq              14 !=              15
>>> header from 192.168.10.85:4003 to id 4044 bogus

> That is a symptom of RDMA operations getting dropped on the floor, usually when
> the connection is dropped and re-established. Is there a message in syslog
> that coincides with this?

not sure to follow here, you mean a message from the rds kernel module? 
there are plenty of messages on QP error 3, recv completion with error 
10,5,4, etc, do you mean to these messages?

Or.