[Ocfs2-users] CRS keeps killing the other node.

Tue Dec 27 09:27:06 CST 2005

> I have an Oracle RAC test environment that consists of 2 nodes. The
> nodes are running Redhat ES4 update 2.
> These two nodes are using a firewire disk with OCFS2 filesystem as shared
> disk.
>
> Oracle Clusterware is installed perfectly fine on these two nodes.
>
> The problem now is, it seems the hosts always kill (fence) each other.
> For example, say currently node 1 is hang (fenced), and node 2 is
> active. If I cold restart node 1, node 2 in a few minutes will hang
> (fenced) with caps lock and scroll lock blink continuously.
> Now if I cold restart node 2m then node 1 will hang (fenced) with caps
> lock and scroll lock blink continuously.
>
> This is the output of syslog from one of the nodes:
> Dec 19 13:44:35 testdb02 kernel: (2398,0):o2net_set_nn_state:421 no
> longer connected to node testdb01 at 172.16.1.1:7777
> Dec 19 13:44:35 testdb02 kernel: (4468,0):ocfs2_replay_journal:1125
> Recovering node 0 from slot 1 on device (8,17)
> Dec 19 13:44:35 testdb02 kernel: (4469,0):ocfs2_replay_journal:1125
> Recovering node 0 from slot 1 on device (8,18)
>
>
> This is the one from testdb01:
>
> Dec 18 17:47:08 testdb01 kernel: (0,0):o2net_idle_timer:1330 connection to
> node
> testdb02 num 1 at 172.16.1.2:7777 has been idle for 10 seconds, shutting
> it do
> wn.
> Dec 18 17:47:08 testdb01 kernel: (0,0):o2net_idle_timer:1341 here are some
> times that might help debug the situation: (tmr 1134953218.565854 now
> 1134953228.56
> 4548 dr 1134953218.565842 adv 1134953218.565855:1134953218.565856 func
>
> (df59be0e:505) 1134953138.728170:1134953138.728179)
> Dec 18 17:47:08 testdb01 kernel: (2342,0):o2net_set_nn_state:421 no longer
> connected to node testdb02 at 172.16.1.2:7777
> Dec 18 17:47:17 testdb01 kernel: (5061,1):ocfs2_replay_journal:1125
> Recovering
> node 1 from slot 0 on device (8,17)
> Dec 18 17:47:17 testdb01 kernel: (5062,0):ocfs2_replay_journal:1125
> Recovering
> node 1 from slot 0 on device (8,18)
> Dec 18 17:47:18 testdb01 kernel: kjournald starting. Commit interval 5
> seconds
> Dec 18 17:47:18 testdb01 kernel: kjournald starting. Commit interval 5
> seconds
>
>
> Does anybody know what is going on?
>
> Thank You

I have had the same problem of fencing nodes, even with faster disks (over
fibre from a SAN). I have found someone setting this timeout to as high as
10 Minutes with external disks (USB 2 and Firewire).

Maybe someone closer to the development of OCFS2 can shed some more light
on this and what the caveats are.

As suggested by other people on this list I have increased the heartbeat
from the default of 7 to 30. This leads to an effective timeout of (30-1)
x2 = 58 seconds.

On SLES, this is in /etc/sysconfig/o2cb (not sure for RedHat).
O2CB_HEARTBEAT_THRESHOLD=30

HTH
--
mike