[Ocfs2-users] Upgrading debian etch to lenny causes host crash in a vmware.server 2.0 environment

Mon Mar 30 16:47:30 PDT 2009

Setup netconsole to catch the oops log.

The message you have provided shows the node death detection
only. Not the cause of the the node death. The netconsole log of the
oopsed node will tell us as to why it oopsed.

Christoph Ackermann wrote:
> Hello.
>
> We used a ten host cluster for a vmware-server 2.0.0 environment 
> (Developement and QA) running debian etch 2.6.18-6 kernel (32 AND 64 
> bit versions) on different hardware with absolut no crashes for almost 
> two years. Most server are based on TYAN TA 26  with  Dual -QuadCore 
> Xeon and 32 GB RAM and a few TA24 with Opterons. Heartbeat Network is 
> completly separated with an own HP Switch. Our SAN is a good ol' 
> ESA12000, a SUN 2540, Brocade swiches and Emulex LPe1150 HBAs.
>
> Last weekend i did a general upgrade from debian etch to lenny with 
> 2.6.26 Kernel and OCFS2 V1.4.1-1 and everything looked good so far. 
> SAN/OCFS2 Performance is very good and Multipathing works fine,. We 
> can copy/move/stat files in a very heavy way - Everything works fine. 
> BUT, if we start one or a few VMs (2-3GB each) on a host it crashes 
> immediately oe after a while even if VMs are idling around or produces 
> only small amount of I/O.
>
> Today i did some tests with only three nodes an on LUN formatted 
> withOCFS2 V1.4 (a LUN with FS v1.2 too). After some copying files into 
> VM (W2K3) we observes a node shut down eviction:
>
> Mar 30 16:53:31 vmserver1 -- MARK --
> Mar 30 17:08:49 vmserver1 kernel: [225184.104062] o2net: connection to 
> node vmserver0 (num 0) at 10.17.93.100:7777 has been idle for 120.0 
> seconds, shutting it down.
> Mar 30 17:08:49 vmserver1 kernel: [225184.104137] 
> (0,0):o2net_idle_timer:1468 here are some times that might help debug 
> the situation: (tmr 1238425609.435866 now 1238425729.433160 dr 
> 1238425609.435849 adv 1238425609.435873:1238425609.435874 func 
> (70f40876:504) 1238425539.438056:1238425539.438068)
> Mar 30 17:08:49 vmserver1 kernel: [225184.104052] o2net: no longer 
> connected to node vmserver0 (num 0) at 10.17.93.100:7777
> Mar 30 17:10:52 vmserver1 kernel: [225327.140209] 
> (5871,0):o2dlm_eviction_cb:258 o2dlm has evicted node 0 from group 
> FF8AF63DC502444687E77BB08635D07C
> Mar 30 17:10:53 vmserver1 kernel: [225329.019274] 
> (5812,0):o2dlm_eviction_cb:258 o2dlm has evicted node 0 from group 
> FF8AF63DC502444687E77BB08635D07C
> Mar 30 17:23:03 vmserver1 kernel: [226171.570861] o2net: connected to 
> node vmserver0 (num 0) at 10.17.93.100:7777
> Mar 30 17:23:05 vmserver1 kernel: [226172.945357] ocfs2_dlm: Node 0 
> joins domain FF8AF63DC502444687E77BB08635D07C
> Mar 30 17:23:05 vmserver1 kernel: [226172.945357] ocfs2_dlm: Nodes in 
> domain ("FF8AF63DC502444687E77BB08635D07C"): 0 1 5
>
>
>
> The vmware-server configuration is not changed, only new network 
> devices are build by vmware-config.pl tool with recent kernel header. 
> We have absolutely no Idea why this happens and we are unable to work 
> until this system runs like all the times. We moved some VMs to local 
> disks, because i won't to get strangled by my colleagues... ;-)
>
> Actually we have tree LUNs mapped:
> ESA12000 with LVM2  FS:  4KB Blocks and Cluster size of 128K (running 
> for years with 1.2 FS "default"...)
> SUNVOL01 and SUNVOL02 FS: 4KB Blocks and Clustersize of 512KB (running 
> since some weeks and now with V1.4 FS "max-feature")
>
>
> A second issue is, that Kernel OOPS won't turn into Kernel Panics:
>
> /proc/sys/kernel/panic_on_oops  is "1" and /proc/sys/kernel/panic is 
> "30" like user manual advices BUT Kernel hangs on OOPS and no reboot 
> happens...
>
> /etc/default/o2cb: I increased all values....
> O2CB_ENABLED=true
> O2CB_BOOTCLUSTER=ocfs2
> O2CB_HEARTBEAT_THRESHOLD=121
> O2CB_IDLE_TIMEOUT_MS=120000
> O2CB_KEEPALIVE_DELAY_MS=10000
> O2CB_RECONNECT_DELAY_MS=10000
>
>
> These two kernel versions are used:
>
> Linux vmserver2 2.6.26-1-amd64 #1 SMP Fri Mar 13 17:46:45 UTC 2009 
> x86_64 GNU/Linux
> Linux vmserver1 2.6.26-1-686-bigmem #1 SMP Fri Mar 13 18:52:29 UTC 
> 2009 i686 GNU/Linux
>
>
> /etc/ocfs2/cluster.conf:
> node:
>         ip_port = 7777
>         ip_address = 10.17.93.100
>         number = 0
>         name = vmserver0
>         cluster = ocfs2
>
> node:
>         ip_port = 7777
>         ip_address = 10.17.93.101
>         number = 1
>         name = vmserver1
>         cluster = ocfs2
>
> [snip]
>
> cluster:
>         node_count = 10
>         name = ocfs2
>
>
> We don't know what causing these crashes. Wait Idle on host are very 
> low, vm-network traffic is minimal to zero. Should we increase 
> O2CB_HEARTBEAT_THRESHOLD to a value of some minutes to check behavior?
>
>
> Any hints are welcome
>
> Regards,
>
>  Christoph
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users