[Ocfs2-users] General questions about ocfs2 errors

Mon Nov 6 13:40:59 PST 2006

Hi,

Maybe someone could elaborate on these re-occuring ocfs2 errors that
always results in a reboot of 1 or more systems.

Our setup:

3 node cluster
Ocfs2 v. 1.2.1
OpenSuse 10.1
SAN storage uses Iscsi for disk access.

Cluster settings:

# O2CB_ENABELED: 'true' means to load the driver on boot.
O2CB_ENABLED=true

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=ocfs2

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=60

Kernel line parameters:

elevator=deadline panic=5
I have used the "deadline" or not testing to see if this will help.

The messages we receive are simply this:

Node 0

Nov  4 10:54:10 atl02010305 kernel: o2net: connection to node
atl02010310 (num 0) at 192.168.3.110:7777 has been idle for 10 seconds,
shutting it down.
Nov  4 10:54:10 atl02010305 kernel: (0,0):o2net_idle_timer:1309 here are
some times that might help debug the situation: (tmr 1162655640.698739
now 1162655650.695937 dr 1162655640.698734 adv
1162655640.698739:1162655640.698739 func (ca3835ec:504)
1162654980.779007:1162654980.779011)
Nov  4 10:54:10 atl02010305 kernel: o2net: no longer connected to node
atl02010310 (num 0) at 192.168.3.110:7777

And the complimentary

Node 1
Nov  4 10:54:11 atl02010310 kernel: o2net: connection to node
atl02010305 (num 1) at 192.168.3.105:7777 has been idle for 10 seconds,
shutting it down.
Nov  4 10:54:11 atl02010310 kernel: (32479,0):o2net_idle_timer:1309 here
are some times that might help debug the situation: (tmr
1162655640.698521 now 1162655650.701661 dr 1
162655650.695829 adv 1162655640.698525:1162655640.698525 func
(ca3835ec:505) 1162654980.778881:1162654980.778886)
Nov  4 10:54:11 atl02010310 kernel: o2net: no longer connected to node
atl02010305 (num 1) at 192.168.3.105:7777

This showed up shortly after and repeated for hours:

Nov  4 11:00:00 atl02010310 kernel:
(32540,1):dlm_send_remote_convert_request:398 ERROR: status = -107
Nov  4 11:00:00 atl02010310 kernel:
(32540,1):dlm_wait_for_node_death:371 32E007178FA24E87B45ECDDE6F7D5D52:
waiting 5000ms for notification of death of node 1
Nov  4 11:00:04 atl02010310 sshd[5242]: Accepted publickey for nagios
from 192.168.3.102 port 44292 ssh2

Node 3

saw nothing.

So I wonder why neither node rebooted from a kernel panic? Or what
happened, in general.

Weren't they supposed to fence etc..?

Randy Ramsdell