<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-15"

 http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Sunil<br>

<br>

Sorry for this mistake, you can fetch an actual log at

<a class="moz-txt-link-freetext"

 href="http://exturl.blue-order.com/192.168.34.100-netconsole.log.gz">http://exturl.blue-order.com/192.168.34.100-netconsole.log.gz</a>

<br>

<br>

As i observed VM startet at 12:24 and freezes at (about) 12:26:20;

Cluster kicks off node0 at 12:28:10. and i let this machine untouched

until 13:23:00. No reboot happens, so i did a hard reset (panic_on_oops

is enabled).<br>

<br>

VMs running on the same and other hosts don't disturb any function

relatet to ocfs2 and/or SAN environment if they uses local disks

(/dev/cciss.. md's  etc.).<br>

<br>

Thanks in advance,<br>

<br>

 Christoph<br>

<br>

<br>

<br>

Sunil Mushran schrieb:

<blockquote cite="mid:49D15A12.2090700@oracle.com" type="cite">

  <pre wrap="">Setup netconsole to catch the oops log.

The message you have provided shows the node death detection

only. Not the cause of the the node death. The netconsole log of the

oopsed node will tell us as to why it oopsed.

Christoph Ackermann wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">Hello.

We used a ten host cluster for a vmware-server 2.0.0 environment

(Developement and QA) running debian etch 2.6.18-6 kernel (32 AND 64

bit versions) on different hardware with absolut no crashes for almost

two years. Most server are based on TYAN TA 26  with  Dual -QuadCore

Xeon and 32 GB RAM and a few TA24 with Opterons. Heartbeat Network is

completly separated with an own HP Switch. Our SAN is a good ol'

ESA12000, a SUN 2540, Brocade swiches and Emulex LPe1150 HBAs.

Last weekend i did a general upgrade from debian etch to lenny with

2.6.26 Kernel and OCFS2 V1.4.1-1 and everything looked good so far.

SAN/OCFS2 Performance is very good and Multipathing works fine,. We

can copy/move/stat files in a very heavy way - Everything works fine.

BUT, if we start one or a few VMs (2-3GB each) on a host it crashes

immediately oe after a while even if VMs are idling around or produces

only small amount of I/O.

Today i did some tests with only three nodes an on LUN formatted

withOCFS2 V1.4 (a LUN with FS v1.2 too). After some copying files into

VM (W2K3) we observes a node shut down eviction:

Mar 30 16:53:31 vmserver1 -- MARK --

Mar 30 17:08:49 vmserver1 kernel: [225184.104062] o2net: connection to

node vmserver0 (num 0) at 10.17.93.100:7777 has been idle for 120.0

seconds, shutting it down.

Mar 30 17:08:49 vmserver1 kernel: [225184.104137]

(0,0):o2net_idle_timer:1468 here are some times that might help debug

the situation: (tmr 1238425609.435866 now 1238425729.433160 dr

1238425609.435849 adv 1238425609.435873:1238425609.435874 func

(70f40876:504) 1238425539.438056:1238425539.438068)

Mar 30 17:08:49 vmserver1 kernel: [225184.104052] o2net: no longer

connected to node vmserver0 (num 0) at 10.17.93.100:7777

Mar 30 17:10:52 vmserver1 kernel: [225327.140209]

(5871,0):o2dlm_eviction_cb:258 o2dlm has evicted node 0 from group

FF8AF63DC502444687E77BB08635D07C

Mar 30 17:10:53 vmserver1 kernel: [225329.019274]

(5812,0):o2dlm_eviction_cb:258 o2dlm has evicted node 0 from group

FF8AF63DC502444687E77BB08635D07C

Mar 30 17:23:03 vmserver1 kernel: [226171.570861] o2net: connected to

node vmserver0 (num 0) at 10.17.93.100:7777

Mar 30 17:23:05 vmserver1 kernel: [226172.945357] ocfs2_dlm: Node 0

joins domain FF8AF63DC502444687E77BB08635D07C

Mar 30 17:23:05 vmserver1 kernel: [226172.945357] ocfs2_dlm: Nodes in

domain ("FF8AF63DC502444687E77BB08635D07C"): 0 1 5

The vmware-server configuration is not changed, only new network

devices are build by vmware-config.pl tool with recent kernel header.

We have absolutely no Idea why this happens and we are unable to work

until this system runs like all the times. We moved some VMs to local

disks, because i won't to get strangled by my colleagues... ;-)

Actually we have tree LUNs mapped:

ESA12000 with LVM2  FS:  4KB Blocks and Cluster size of 128K (running

for years with 1.2 FS "default"...)

SUNVOL01 and SUNVOL02 FS: 4KB Blocks and Clustersize of 512KB (running

since some weeks and now with V1.4 FS "max-feature")

A second issue is, that Kernel OOPS won't turn into Kernel Panics:

/proc/sys/kernel/panic_on_oops  is "1" and /proc/sys/kernel/panic is

"30" like user manual advices BUT Kernel hangs on OOPS and no reboot

happens...

/etc/default/o2cb: I increased all values....

O2CB_ENABLED=true

O2CB_BOOTCLUSTER=ocfs2

O2CB_HEARTBEAT_THRESHOLD=121

O2CB_IDLE_TIMEOUT_MS=120000

O2CB_KEEPALIVE_DELAY_MS=10000

O2CB_RECONNECT_DELAY_MS=10000

These two kernel versions are used:

Linux vmserver2 2.6.26-1-amd64 #1 SMP Fri Mar 13 17:46:45 UTC 2009

x86_64 GNU/Linux

Linux vmserver1 2.6.26-1-686-bigmem #1 SMP Fri Mar 13 18:52:29 UTC

2009 i686 GNU/Linux

/etc/ocfs2/cluster.conf:

node:

        ip_port = 7777

        ip_address = 10.17.93.100

        number = 0

        name = vmserver0

        cluster = ocfs2

node:

        ip_port = 7777

        ip_address = 10.17.93.101

        number = 1

        name = vmserver1

        cluster = ocfs2

[snip]

cluster:

        node_count = 10

        name = ocfs2

We don't know what causing these crashes. Wait Idle on host are very

low, vm-network traffic is minimal to zero. Should we increase

O2CB_HEARTBEAT_THRESHOLD to a value of some minutes to check behavior?

Any hints are welcome

Regards,

 Christoph

------------------------------------------------------------------------

_______________________________________________

Ocfs2-users mailing list

<a class="moz-txt-link-abbreviated"

 href="mailto:Ocfs2-users@oss.oracle.com">Ocfs2-users@oss.oracle.com</a>

<a class="moz-txt-link-freetext"

 href="http://oss.oracle.com/mailman/listinfo/ocfs2-users">http://oss.oracle.com/mailman/listinfo/ocfs2-users</a>

    </pre>

  </blockquote>

  <pre wrap=""><!---->

  </pre>

</blockquote>

<br>

<pre class="moz-signature" cols="72">-- 

____________________________________

Christoph Ackermann

System Integration Engineer

____________________________________

Blue Order Technologies AG

Europaallee 10

D-67657 Kaiserslautern

T +49 (0) 631 303-5251

F +49 (0) 631 303-5209

<a class="moz-txt-link-freetext" href="http://www.blue-order.com">http://www.blue-order.com</a> 

___________________________________

Firmensitz: Kaiserslautern

Amtsgericht: Kaiserslautern HRB 3476

USt.-IdNr. DE202956676

Vorstand: Andreas Eder, Klaus Gesmann, 

Vorsitzender des Aufsichtsrates: Dr. Wilhelm Krüger

</pre>

</body>

</html>