[Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout

Andy Phillips andrew.phillips at betfair.com
Tue Nov 28 09:13:29 PST 2006


Its a #define in the code at the moment.

The timeout is triggered if the o2net code does not receive a valid
message within the timeout interval. Thats a valid message from the
point of view of the o2net layer, not the operating system.

Andy


On Tue, 2006-11-28 at 12:00 -0500, Peter Santos wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> So are you saying that the "10 second" message is currently not configurable?
> 
> Also, does this message result because of a failure to ping the hearbeat device or
> a network ping against the host's network card?
> 
> - -peter
> 
> 
> Sunil Mushran wrote:
> > As ocfs2 heartbeats on the same device, unplugging a different device on
> > the
> > storage should not affect ocfs2 as long as the ios are completing. But
> > the logs
> > indicate otherwise. HB ios are erroring out.
> > 
> > The o2net message is the tcp connect message. We will be providing a way
> > to configure that too.
> > 
> > Peter Santos wrote:
> > Suni,
> > 
> > after trying to chase this down, I think one of our sa's might have
> > restarted the storage without
> > notifying anyone.
> > 
> > Similarly, today a disk that was not in use was re-initialized and
> > caused everything to come down. I don't
> > know if this is an issue with ocfs2 or ( old_storage + our sa doing
> > this incorrectly).
> > 
> > The idea was to re-initialize a disk that was not being used (sdc) and
> > not have it affect
> > the ocfs2 storage (sdb).
> > 
> > After the re-initialization completed, I noticed that all 3 nodes
> > weren't working and this was
> > what I found on dbo3
> > 
> > =======================================================================================================================
> > 
> > Nov 21 11:40:36 dbo3 kernel: o2net: connection to node dbo2 (num 1) at
> > 192.168.134.141:7777 has been idle for 10
> > seconds, shutting it down.
> > 
> > Nov 21 11:40:36 dbo3 kernel: (0,1):o2net_idle_timer:1310 here are some
> > times that might help debug the situation: (tmr
> > 1164127226.293816 now 1164127236.291931 dr 1164127226.293797 adv
> > 1164127226.293818:1164127226.293819 func (a77953f3:2)
> > 1164124426.747626:1164124426.747628)
> > 
> > 
> > Nov 21 11:40:36 dbo3 kernel: o2net: no longer connected to node dbo2
> > (num 1) at 192.168.134.141:7777
> > 
> > Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
> > Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector
> > 591502543
> > Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
> > ...
> > Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
> > Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector
> > 591502568
> > Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954
> > ERROR: status = -5
> > Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954
> > ERROR: status = -5
> > Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
> > Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 1983
> > Nov 21 11:41:11 dbo3 kernel: (6614,0):o2hb_bio_end_io:332 ERROR: IO
> > Error -5
> > Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
> > Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector
> > 3921780
> > Nov 21 11:41:11 dbo3 kernel: (6614,0):o2hb_bio_end_io:332 ERROR: IO
> > Error -5
> > Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954
> > ERROR: status = -5
> > Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954
> > ERROR: status = -5
> > ...
> > Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954
> > ERROR: status = -5
> > Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954
> > ERROR: status = -5
> > Nov 21 11:41:11 dbo3 su: pam_unix2: session finished for user oracle,
> > service su
> > Nov 21 11:41:11 dbo3 logger: Oracle CSSD failure 134.
> > Nov 21 11:45:07 dbo3 syslogd 1.4.1: restart.
> > 
> > I'm curious about the message
> > "o2net: connection to node dbo2 (num 1) at 192.168.134.141:7777 has
> > been idle for 10 seconds, shutting it down."
> > 
> > I have increased my O2CB_HEARTBEAT_THRESHOLD to 61, but where is this
> > message getting "10 seconds" from?
> > Also this message is displayed because dbo2 was not able to check into
> > the hearbeat filesystem right ?
> > 
> > -peter
> > 
> > 
> > 
> > 
> > 
> > Sunil Mushran wrote:
> >  
> >>>> On nodes db01 and db03 hb timed-out at 17:12:49. However, the nodes
> >>>> did not fully panic. As in, the network was shutdown but the hb thread
> >>>> was still going strong for some reason.
> >>>>
> >>>> Within 10 secs of that, by 17:12:59, db02 detected loss of network
> >>>> connectivity with both nodes db01 and db03. However, it was still
> >>>> seeing the nodes hb on disk and assumed that they were alive. As per
> >>>> quorum rules, it paniced.
> >>>>
> >>>> So the qs is: what was happening on nodes db01 and db03 after 17:12:49?
> >>>>
> >>>> Peter Santos wrote:
> >>>> Folks,
> >>>>     I'm trying to piece together what happened during a recent event
> >>>> where
> >>>> our 3 node RAC cluster had problems.
> >>>> It appears that all 3 nodes restarted .. which is likely to occur if
> >>>> all 3 nodes cannot communicate with the
> >>>> shared ocfs2 storage.
> >>>>
> >>>> I did find out from our SA, that this happened during the time he was
> >>>> replacing a failed drive on the storage
> >>>> and the storage was in a degraded mode.  I'm trying to understand if
> >>>> the 3 nodes had a difficult time accessing
> >>>> the shared ocfs2 volume or was it a tcp connectivity issue. There is
> >>>> nobody currently using the cluster ..so
> >>>> it should have been idle from a user perspective.
> >>>>
> >>>>
> >>>> prompt># cat /etc/fstab | grep ocfs2
> >>>>
> >>>> /dev/sdb1  /ocfs2       ocfs2      _netdev,datavolume,nointr  0 0
> >>>> /dev/sdb2  /backups     ocfs2      _netdev,datavolume,nointr  0 0
> >>>>
> >>>> we have 2 ocfs2 volumes.. once if for the voting and ocr files, while
> >>>> the other is to be used as a
> >>>> shared storage for backups of archivelog files etc.
> >>>>
> >>>>
> >>>> /var/log/messages
> >>>>
> >>>>
> >>>> NODE1 (dbo1)
> >>>> ========================================================================================================
> >>>>
> >>>>
> >>>> Nov 15 17:12:49 dbo1 kernel: (13,3):o2hb_write_timeout:270 ERROR:
> >>>> Heartbeat write timeout to device sdb2
> >>>>                     after 12000 milliseconds
> >>>> Nov 15 17:12:49 dbo1 kernel: Heartbeat thread (13) printing last 24
> >>>> blocking operations (cur = 13):
> >>>> Nov 16 05:44:58 dbo1 syslogd 1.4.1: restart.
> >>>>
> >>>>
> >>>> NODE2 (dbo2)
> >>>> ========================================================================================================
> >>>>
> >>>>
> >>>>
> >>>> Nov 15 17:12:57 dbo2 kernel: o2net: connection to node dbo1 (num 0) at
> >>>> 192.168.134.140:7777 has been idle for 10
> >>>> seconds, shutting it down.
> >>>> Nov 15 17:12:57 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some
> >>>> times that might help debug the situation: (tmr
> >>>> 1163628767.826089 now 1163628777.825614 dr 1163628767.826070 adv
> >>>> 1163628767.826104:1163628767.826105 func (f0735f96
> >>>>    :506) 1163454320.893701:1163454320.893708)
> >>>> Nov 15 17:12:57 dbo2 kernel: o2net: no longer connected to node dbo1
> >>>> (num 0) at 192.168.134.140:7777
> >>>> Nov 15 17:12:59 dbo2 kernel: o2net: connection to node dbo3 (num 2) at
> >>>> 192.168.134.142:7777 has been idle for 10
> >>>> seconds, shutting it down.
> >>>> Nov 15 17:12:59 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some
> >>>> times that might help debug the situation: (tmr
> >>>> 1163628769.44144 now 1163628779.43640 dr 1163628769.44123 adv
> >>>> 1163628769.44159:1163628769.44160 func (f7e0383f:504)
> >>>>     1163540424.444236:1163540424.444248)
> >>>> Nov 15 17:12:59 dbo2 kernel: o2net: no longer connected to node dbo3
> >>>> (num 2) at 192.168.134.142:7777
> >>>> Nov 15 17:32:37 dbo2 -- MARK --
> >>>> Nov 15 17:33:03 dbo2 kernel: (11,1):o2quo_make_decision:121 ERROR:
> >>>> fencing this node because it is only connected to 1
> >>>> nodes and 2 is needed to make a quorum out of 3 heartbeating nodes
> >>>> Nov 15 17:33:03 dbo2 kernel: (11,1):o2hb_stop_all_regions:1889 ERROR:
> >>>> stopping heartbeat on all active regions.
> >>>> Nov 15 17:33:03 dbo2 kernel: Kernel panic: ocfs2 is very sorry to be
> >>>> fencing this system by panicing
> >>>> Nov 15 17:33:03 dbo2 kernel:
> >>>>
> >>>> NODE3 (dbo3)
> >>>> ========================================================================================================
> >>>>
> >>>>
> >>>> Nov 15 17:12:49 dbo3 kernel: (13,3):o2hb_write_timeout:270 ERROR:
> >>>> Heartbeat write timeout to device sdb2
> >>>>                     after 12000 milliseconds
> >>>> Nov 15 17:12:49 dbo3 kernel: Heartbeat thread (13) printing last 24
> >>>> blocking operations (cur = 11):
> >>>> Nov 16 10:45:32 dbo3 syslogd 1.4.1: restart.
> >>>>
> >>>>
> >>>> any help is greatly appreciated (BTW, I've read the ocfs2 user guide).
> >>>>
> >>>> thanks
> >>>> -peter
> >>>>
> >>>>     
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> > 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.1 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFFbGs/oyy5QBCjoT0RAs5YAJ9Rcks/NmKQ2iu4x8I4ZLcpp8wfxgCgmjWJ
> PUEYoxg/1p1XrcylzVnGo/Y=
> =aBV/
> -----END PGP SIGNATURE-----
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> 
> ________________________________________________________________________
> In order to protect our email recipients, Betfair use SkyScan from 
> MessageLabs to scan all Incoming and Outgoing mail for viruses.
> 
> ________________________________________________________________________
-- 
Andy Phillips
Systems Architecture Manager, Betfair.com

Office: 0208 8348436

Betfair Ltd|Winslow Road|Hammersmith Embankment|London|W69HP 
Company No. 5140986 
The information in this e-mail and any attachment is confidential and is
intended only for the named recipient(s). The e-mail may not be
disclosed or used by any person other than the addressee, nor may it be
copied in any way. If you are not a named recipient please notify the
sender immediately and delete any copies of this message. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden. Any view or opinions presented are solely
those of the author and do not necessarily represent those of the
company.






More information about the Ocfs2-users mailing list