<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN">

<HTML>

<HEAD>

  <META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=UTF-8">

  <META NAME="GENERATOR" CONTENT="GtkHTML/3.18.3">

</HEAD>

<BODY>

Hi,<BR>

<BR>

I've cluster ocfs2 with 8 nodes and 2 devices mapped from Disk Storage to this nodes (disks are formatted, file systems ocfs created)<BR>

<BR>

I can start the cluster on each node and mount device - this works fine.<BR>

<BR>

Let say my first node name is host1 and node numer is 0 and ip address 172.28.4.1<BR>

my second node name is host2 node number 1 and ip address 172.28.4.2 and i do nothing on other nodes (but the device is mounted on every node).<BR>

<BR>

when I run find /mount_point -type f on host1 it searches and displays files.<BR>

Before the find ends,&nbsp; on host2 I remove IP address from interface (the network connection is broken) and the find on host1 freeze.<BR>

This is the log on host1:<BR>

<BR>

Jun 24 12:36:33 host1 kernel: [ 1816.861233] o2net: connection to node host2 (num 1) at 172.28.4.2:7777 has been idle for 30.0 seconds, shutting it down.<BR>

Jun 24 12:36:33 host1 kernel: [ 1816.861242] (0,5):o2net_idle_timer:1468 here are some times that might help debug the situation: (tmr 1245839763.115691 now 1245839793.115494 dr 1245839763.115676 adv 1245839763.115691:1245839763.115691 func (cd6c8a07:500) 1245839758.695001:1245839758.695003)<BR>

Jun 24 12:36:33 host1 kernel: [ 1816.861260] o2net: no longer connected to node host2 (num 1) at 172.28.4.2:7777<BR>

<BR>

Few minutes later the find can search again (I do not kill the proccess)<BR>

and I have in my logs:<BR>

Jun 24 12:38:41 host1 kernel: [ 2011.612478] (5935,0):o2dlm_eviction_cb:258 o2dlm has evicted node 1 from group C9113043842642AD9694FDF0E9BE6E29<BR>

Jun 24 12:38:42 host1 kernel: [ 2013.370655] (5950,5):dlm_get_lock_resource:839 C9113043842642AD9694FDF0E9BE6E29:$RECOVERY: at least one node (1) to recover before lock mastery can begin<BR>

Jun 24 12:38:42 host1 kernel: [ 2013.370661] (5950,5):dlm_get_lock_resource:873 C9113043842642AD9694FDF0E9BE6E29: recovery map is not empty, but must master $RECOVERY lock now<BR>

Jun 24 12:38:42 host1 kernel: [ 2013.378061] (5950,5):dlm_do_recovery:524 (5950) Node 0 is the Recovery Master for the Dead Node 1 for Domain C9113043842642AD9694FDF0E9BE6E29<BR>

<BR>

Is that normal that I can't access (from any of health node) to the ocfs until this few minutes? I do not need to write for 2 minutes but this kind of break for read is unacceptable<BR>

<BR>

I have the default settings for HB:<BR>

O2CB_ENABLED=true<BR>

O2CB_BOOTCLUSTER=ocfs2<BR>

O2CB_HEARTBEAT_THRESHOLD=31<BR>

O2CB_IDLE_TIMEOUT_MS=30000<BR>

O2CB_KEEPALIVE_DELAY_MS=2000<BR>

O2CB_RECONNECT_DELAY_MS=2000<BR>

<BR>

ocfs2-tools 1.4.1 (debian lenny)<BR>

kernel 2.6.26-2-amd64<BR>

+multipath<BR>

+bonding<BR>

<BR>

modinfo ocfs2<BR>

filename:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; /lib/modules/2.6.26-2-amd64/kernel/fs/ocfs2/ocfs2.ko<BR>

license:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; GPL<BR>

author:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Oracle<BR>

version:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.5.0<BR>

description:&nbsp;&nbsp;&nbsp; OCFS2 1.5.0<BR>

srcversion:&nbsp;&nbsp;&nbsp;&nbsp; B19D847BA86E871E41B7A64<BR>

depends:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; jbd,ocfs2_stackglue,ocfs2_nodemanager<BR>

vermagic:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.6.26-2-amd64 SMP mod_unload modversions<BR>

<BR>

Any advise?<BR>

<BR>

Peter<BR>

<BR>

</BODY>

</HTML>