[Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout

Thu Nov 16 11:49:33 PST 2006

On nodes db01 and db03 hb timed-out at 17:12:49. However, the nodes
did not fully panic. As in, the network was shutdown but the hb thread
was still going strong for some reason.

Within 10 secs of that, by 17:12:59, db02 detected loss of network
connectivity with both nodes db01 and db03. However, it was still
seeing the nodes hb on disk and assumed that they were alive. As per
quorum rules, it paniced.

So the qs is: what was happening on nodes db01 and db03 after 17:12:49?

Peter Santos wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Folks,
> 	
> I'm trying to piece together what happened during a recent event where our 3 node RAC cluster had problems.
> It appears that all 3 nodes restarted .. which is likely to occur if all 3 nodes cannot communicate with the
> shared ocfs2 storage.
>
> I did find out from our SA, that this happened during the time he was replacing a failed drive on the storage
> and the storage was in a degraded mode.  I'm trying to understand if the 3 nodes had a difficult time accessing
> the shared ocfs2 volume or was it a tcp connectivity issue. There is nobody currently using the cluster ..so
> it should have been idle from a user perspective.
>
>
> prompt># cat /etc/fstab | grep ocfs2
>
> /dev/sdb1  /ocfs2       ocfs2      _netdev,datavolume,nointr  0 0
> /dev/sdb2  /backups     ocfs2      _netdev,datavolume,nointr  0 0
>
> we have 2 ocfs2 volumes.. once if for the voting and ocr files, while the other is to be used as a
> shared storage for backups of archivelog files etc.
>
>
> /var/log/messages
>
>
> NODE1 (dbo1)
> ========================================================================================================
> Nov 15 17:12:49 dbo1 kernel: (13,3):o2hb_write_timeout:270 ERROR: Heartbeat write timeout to device sdb2
> 				    after 12000 milliseconds
> Nov 15 17:12:49 dbo1 kernel: Heartbeat thread (13) printing last 24 blocking operations (cur = 13):
> Nov 16 05:44:58 dbo1 syslogd 1.4.1: restart.
>
>
> NODE2 (dbo2)
> ========================================================================================================
>
> Nov 15 17:12:57 dbo2 kernel: o2net: connection to node dbo1 (num 0) at 192.168.134.140:7777 has been idle for 10
> seconds, shutting it down.
> Nov 15 17:12:57 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some times that might help debug the situation: (tmr
> 1163628767.826089 now 1163628777.825614 dr 1163628767.826070 adv 1163628767.826104:1163628767.826105 func (f0735f96
>    :506) 1163454320.893701:1163454320.893708)
> Nov 15 17:12:57 dbo2 kernel: o2net: no longer connected to node dbo1 (num 0) at 192.168.134.140:7777
> Nov 15 17:12:59 dbo2 kernel: o2net: connection to node dbo3 (num 2) at 192.168.134.142:7777 has been idle for 10
> seconds, shutting it down.
> Nov 15 17:12:59 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some times that might help debug the situation: (tmr
> 1163628769.44144 now 1163628779.43640 dr 1163628769.44123 adv 1163628769.44159:1163628769.44160 func (f7e0383f:504)
>     1163540424.444236:1163540424.444248)
> Nov 15 17:12:59 dbo2 kernel: o2net: no longer connected to node dbo3 (num 2) at 192.168.134.142:7777
> Nov 15 17:32:37 dbo2 -- MARK --
> Nov 15 17:33:03 dbo2 kernel: (11,1):o2quo_make_decision:121 ERROR: fencing this node because it is only connected to 1
> nodes and 2 is needed to make a quorum out of 3 heartbeating nodes
> Nov 15 17:33:03 dbo2 kernel: (11,1):o2hb_stop_all_regions:1889 ERROR: stopping heartbeat on all active regions.
> Nov 15 17:33:03 dbo2 kernel: Kernel panic: ocfs2 is very sorry to be fencing this system by panicing
> Nov 15 17:33:03 dbo2 kernel:
>
> NODE3 (dbo3)
> ========================================================================================================
> Nov 15 17:12:49 dbo3 kernel: (13,3):o2hb_write_timeout:270 ERROR: Heartbeat write timeout to device sdb2
> 				    after 12000 milliseconds
> Nov 15 17:12:49 dbo3 kernel: Heartbeat thread (13) printing last 24 blocking operations (cur = 11):
> Nov 16 10:45:32 dbo3 syslogd 1.4.1: restart.
>
>
> any help is greatly appreciated (BTW, I've read the ocfs2 user guide).
>
> thanks
> - -peter
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.1 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFFXKstoyy5QBCjoT0RAv82AJ9cAGUON4K2/ixbB3NxTtjL/yORlACeJFvH
> RVxoqk930affeEnK3yw5SIU=
> =eqqi
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>