[Ocfs2-users] CRS keeps killing the other node.

Irwan Hadi ihblist at gmail.com
Mon Dec 19 15:30:21 CST 2005


I have an Oracle RAC test environment that consists of 2 nodes. The
nodes are running Redhat ES4 update 2.
These two nodes are using a firewire disk with OCFS2 filesystem as shared disk.

Oracle Clusterware is installed perfectly fine on these two nodes.

The problem now is, it seems the hosts always kill (fence) each other.
For example, say currently node 1 is hang (fenced), and node 2 is
active. If I cold restart node 1, node 2 in a few minutes will hang
(fenced) with caps lock and scroll lock blink continuously.
Now if I cold restart node 2m then node 1 will hang (fenced) with caps
lock and scroll lock blink continuously.

This is the output of syslog from one of the nodes:
Dec 19 13:44:35 testdb02 kernel: (2398,0):o2net_set_nn_state:421 no
longer connected to node testdb01 at 172.16.1.1:7777
Dec 19 13:44:35 testdb02 kernel: (4468,0):ocfs2_replay_journal:1125
Recovering node 0 from slot 1 on device (8,17)
Dec 19 13:44:35 testdb02 kernel: (4469,0):ocfs2_replay_journal:1125
Recovering node 0 from slot 1 on device (8,18)


This is the one from testdb01:

Dec 18 17:47:08 testdb01 kernel: (0,0):o2net_idle_timer:1330 connection to node
testdb02 num 1 at 172.16.1.2:7777 has been idle for 10 seconds, shutting it do
wn.
Dec 18 17:47:08 testdb01 kernel: (0,0):o2net_idle_timer:1341 here are some
times that might help debug the situation: (tmr 1134953218.565854 now
1134953228.56
4548 dr 1134953218.565842 adv 1134953218.565855:1134953218.565856 func

(df59be0e:505) 1134953138.728170:1134953138.728179)
Dec 18 17:47:08 testdb01 kernel: (2342,0):o2net_set_nn_state:421 no longer
connected to node testdb02 at 172.16.1.2:7777
Dec 18 17:47:17 testdb01 kernel: (5061,1):ocfs2_replay_journal:1125 Recovering
node 1 from slot 0 on device (8,17)
Dec 18 17:47:17 testdb01 kernel: (5062,0):ocfs2_replay_journal:1125 Recovering
node 1 from slot 0 on device (8,18)
Dec 18 17:47:18 testdb01 kernel: kjournald starting. Commit interval 5 seconds
Dec 18 17:47:18 testdb01 kernel: kjournald starting. Commit interval 5 seconds


Does anybody know what is going on?

Thank You


More information about the Ocfs2-users mailing list