[Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout

Thu Nov 16 11:54:17 PST 2006

BTW, you appear to be running with the default 12 sec timeout.
 From what I've gathered, most users are appear to be using a higher
60 secs timeout. 12 secs does not leave much room for error.

We are thinking of increasing the default in the next drop.

Sunil Mushran wrote:
> On nodes db01 and db03 hb timed-out at 17:12:49. However, the nodes
> did not fully panic. As in, the network was shutdown but the hb thread
> was still going strong for some reason.
>
> Within 10 secs of that, by 17:12:59, db02 detected loss of network
> connectivity with both nodes db01 and db03. However, it was still
> seeing the nodes hb on disk and assumed that they were alive. As per
> quorum rules, it paniced.
>
> So the qs is: what was happening on nodes db01 and db03 after 17:12:49?
>
> Peter Santos wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Folks,
>>     
>> I'm trying to piece together what happened during a recent event 
>> where our 3 node RAC cluster had problems.
>> It appears that all 3 nodes restarted .. which is likely to occur if 
>> all 3 nodes cannot communicate with the
>> shared ocfs2 storage.
>>
>> I did find out from our SA, that this happened during the time he was 
>> replacing a failed drive on the storage
>> and the storage was in a degraded mode.  I'm trying to understand if 
>> the 3 nodes had a difficult time accessing
>> the shared ocfs2 volume or was it a tcp connectivity issue. There is 
>> nobody currently using the cluster ..so
>> it should have been idle from a user perspective.
>>
>>
>> prompt># cat /etc/fstab | grep ocfs2
>>
>> /dev/sdb1  /ocfs2       ocfs2      _netdev,datavolume,nointr  0 0
>> /dev/sdb2  /backups     ocfs2      _netdev,datavolume,nointr  0 0
>>
>> we have 2 ocfs2 volumes.. once if for the voting and ocr files, while 
>> the other is to be used as a
>> shared storage for backups of archivelog files etc.
>>
>>
>> /var/log/messages
>>
>>
>> NODE1 (dbo1)
>> ======================================================================================================== 
>>
>> Nov 15 17:12:49 dbo1 kernel: (13,3):o2hb_write_timeout:270 ERROR: 
>> Heartbeat write timeout to device sdb2
>>                     after 12000 milliseconds
>> Nov 15 17:12:49 dbo1 kernel: Heartbeat thread (13) printing last 24 
>> blocking operations (cur = 13):
>> Nov 16 05:44:58 dbo1 syslogd 1.4.1: restart.
>>
>>
>> NODE2 (dbo2)
>> ======================================================================================================== 
>>
>>
>> Nov 15 17:12:57 dbo2 kernel: o2net: connection to node dbo1 (num 0) 
>> at 192.168.134.140:7777 has been idle for 10
>> seconds, shutting it down.
>> Nov 15 17:12:57 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are 
>> some times that might help debug the situation: (tmr
>> 1163628767.826089 now 1163628777.825614 dr 1163628767.826070 adv 
>> 1163628767.826104:1163628767.826105 func (f0735f96
>>    :506) 1163454320.893701:1163454320.893708)
>> Nov 15 17:12:57 dbo2 kernel: o2net: no longer connected to node dbo1 
>> (num 0) at 192.168.134.140:7777
>> Nov 15 17:12:59 dbo2 kernel: o2net: connection to node dbo3 (num 2) 
>> at 192.168.134.142:7777 has been idle for 10
>> seconds, shutting it down.
>> Nov 15 17:12:59 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are 
>> some times that might help debug the situation: (tmr
>> 1163628769.44144 now 1163628779.43640 dr 1163628769.44123 adv 
>> 1163628769.44159:1163628769.44160 func (f7e0383f:504)
>>     1163540424.444236:1163540424.444248)
>> Nov 15 17:12:59 dbo2 kernel: o2net: no longer connected to node dbo3 
>> (num 2) at 192.168.134.142:7777
>> Nov 15 17:32:37 dbo2 -- MARK --
>> Nov 15 17:33:03 dbo2 kernel: (11,1):o2quo_make_decision:121 ERROR: 
>> fencing this node because it is only connected to 1
>> nodes and 2 is needed to make a quorum out of 3 heartbeating nodes
>> Nov 15 17:33:03 dbo2 kernel: (11,1):o2hb_stop_all_regions:1889 ERROR: 
>> stopping heartbeat on all active regions.
>> Nov 15 17:33:03 dbo2 kernel: Kernel panic: ocfs2 is very sorry to be 
>> fencing this system by panicing
>> Nov 15 17:33:03 dbo2 kernel:
>>
>> NODE3 (dbo3)
>> ======================================================================================================== 
>>
>> Nov 15 17:12:49 dbo3 kernel: (13,3):o2hb_write_timeout:270 ERROR: 
>> Heartbeat write timeout to device sdb2
>>                     after 12000 milliseconds
>> Nov 15 17:12:49 dbo3 kernel: Heartbeat thread (13) printing last 24 
>> blocking operations (cur = 11):
>> Nov 16 10:45:32 dbo3 syslogd 1.4.1: restart.
>>
>>
>> any help is greatly appreciated (BTW, I've read the ocfs2 user guide).
>>
>> thanks
>> - -peter
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.1 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>
>> iD8DBQFFXKstoyy5QBCjoT0RAv82AJ9cAGUON4K2/ixbB3NxTtjL/yORlACeJFvH
>> RVxoqk930affeEnK3yw5SIU=
>> =eqqi
>> -----END PGP SIGNATURE-----
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>   
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users