[Ocfs2-users] Ocfs2 errors on 3 node cluster

Tue Nov 14 14:04:16 PST 2006

It will be easier if you file a bug on oss.oracle.com/bugzilla with all
the details. Like messages files from all nodes, etc.

Why are you using 1.2.1? 1.2.3 has been out for few months now.

Randy Ramsdell wrote:
> Hi,
>
> Maybe someone could elaborate on these re-occuring ocfs2 errors that
> always results in a reboot of 1 or more systems.
>
> Our setup:
>
> 3 node cluster
> Ocfs2 v. 1.2.1
> OpenSuse 10.1
> SAN storage uses Iscsi for disk access.
>
>
> Cluster settings:
>
> # O2CB_ENABELED: 'true' means to load the driver on boot.
> O2CB_ENABLED=true
>
> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
> O2CB_BOOTCLUSTER=ocfs2
>
> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
> O2CB_HEARTBEAT_THRESHOLD=60
>
>
> Kernel line parameters:
>
> elevator=deadline panic=5
> I have used the "deadline" or not testing to see if this will help.
>
>
> The messages we receive are simply this:
>
>
> Node 0
>
> Nov  4 10:54:10 atl02010305 kernel: o2net: connection to node
> atl02010310 (num 0) at 192.168.3.110:7777 has been idle for 10 seconds,
> shutting it down.
> Nov  4 10:54:10 atl02010305 kernel: (0,0):o2net_idle_timer:1309 here are
> some times that might help debug the situation: (tmr 1162655640.698739
> now 1162655650.695937 dr 1162655640.698734 adv
> 1162655640.698739:1162655640.698739 func (ca3835ec:504)
> 1162654980.779007:1162654980.779011)
> Nov  4 10:54:10 atl02010305 kernel: o2net: no longer connected to node
> atl02010310 (num 0) at 192.168.3.110:7777
>
>
> And the complimentary
>
> Node 1
> Nov  4 10:54:11 atl02010310 kernel: o2net: connection to node
> atl02010305 (num 1) at 192.168.3.105:7777 has been idle for 10 seconds,
> shutting it down.
> Nov  4 10:54:11 atl02010310 kernel: (32479,0):o2net_idle_timer:1309 here
> are some times that might help debug the situation: (tmr
> 1162655640.698521 now 1162655650.701661 dr 1
> 162655650.695829 adv 1162655640.698525:1162655640.698525 func
> (ca3835ec:505) 1162654980.778881:1162654980.778886)
> Nov  4 10:54:11 atl02010310 kernel: o2net: no longer connected to node
> atl02010305 (num 1) at 192.168.3.105:7777
>
>
> This showed up shortly after and repeated for hours:
>
>
> Nov  4 11:00:00 atl02010310 kernel:
> (32540,1):dlm_send_remote_convert_request:398 ERROR: status = -107
> Nov  4 11:00:00 atl02010310 kernel:
> (32540,1):dlm_wait_for_node_death:371 32E007178FA24E87B45ECDDE6F7D5D52:
> waiting 5000ms for notification of death of node 1
> Nov  4 11:00:04 atl02010310 sshd[5242]: Accepted publickey for nagios
> from 192.168.3.102 port 44292 ssh2
>
>
> Node 3
>
> saw nothing.
>
>
> So I wonder why neither node rebooted from a kernel panic? Or what
> happened, in general.
>
> Weren't they supposed to fence etc..?
>
> Randy Ramsdell
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>