[rds-devel] Re: trying to reproduce the crash

Or Gerlitz ogerlitz at voltaire.com
Sun Feb 3 23:56:10 PST 2008


Olaf Kirch wrote:
> Okay, so here's what I do. On one side (call it host_a), I do something like
> 	while true; do
> 		rds-stress -R -r $host_a -p 4000
> 	done
> 
> On the other side (host_b) I do this:
> 
> 	while sleep 1; do
> 		rmmod rds
>     		sleep 1
>     		insmod rds.ko
> 		rds-stress -R -r $host_b -s $host_a -p 4000 -c -d4 -t32 -T3 -D64k
> 	done
> 
> This reproduces the crash in 5-20 minutes. It's possible that this requires SMP
> machines on both ends; this setup has a 4-node and a 2-node machine, both 64bit
> Xeons, and both with 1GB of RAM.

Okay, I will try the exact scripts, also you use -R and -c on the client 
side which I don't, so will add them as well. My nodes has two CPUs, 
each with four cores and 2GB RAM, so its somehow different configuration 
in that respect.


>> and also reports on wrong sequence number in the client side
>>> An incoming message had a header which
>>> didn't contain the fields we expected:
>>>     member        expected eq             got
>>>        seq              14 !=              15
>>> header from 192.168.10.85:4003 to id 4044 bogus

> That is a symptom of RDMA operations getting dropped on the floor, usually when
> the connection is dropped and re-established. Is there a message in syslog
> that coincides with this?

not sure to follow here, you mean a message from the rds kernel module? 
there are plenty of messages on QP error 3, recv completion with error 
10,5,4, etc, do you mean to these messages?

Or.






More information about the rds-devel mailing list