[Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout

Thu Nov 23 02:52:09 PST 2006

Peter,

   I'm attempting to run down a similar problem. Can you tell me how
your heartbeat network is configured? I'm particularly interested in the
make and type of switch, the use of bonding and which mode, and whether 
you're using jumbo frames on that interface.

    Thanks
         Andy

On Wed, 2006-11-22 at 17:24 -0800, Sunil Mushran wrote:
> As ocfs2 heartbeats on the same device, unplugging a different device on the
> storage should not affect ocfs2 as long as the ios are completing. But 
> the logs
> indicate otherwise. HB ios are erroring out.
> 
> The o2net message is the tcp connect message. We will be providing a way
> to configure that too.
> 
> Peter Santos wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Suni,
> >
> > after trying to chase this down, I think one of our sa's might have restarted the storage without
> > notifying anyone.
> >
> > Similarly, today a disk that was not in use was re-initialized and caused everything to come down. I don't
> > know if this is an issue with ocfs2 or ( old_storage + our sa doing this incorrectly).
> >
> > The idea was to re-initialize a disk that was not being used (sdc) and not have it affect
> > the ocfs2 storage (sdb).
> >
> > After the re-initialization completed, I noticed that all 3 nodes weren't working and this was
> > what I found on dbo3
> >
> > =======================================================================================================================
> > Nov 21 11:40:36 dbo3 kernel: o2net: connection to node dbo2 (num 1) at 192.168.134.141:7777 has been idle for 10
> > seconds, shutting it down.
> >
> > Nov 21 11:40:36 dbo3 kernel: (0,1):o2net_idle_timer:1310 here are some times that might help debug the situation: (tmr
> > 1164127226.293816 now 1164127236.291931 dr 1164127226.293797 adv 1164127226.293818:1164127226.293819 func (a77953f3:2)
> > 1164124426.747626:1164124426.747628)
> >
> >
> > Nov 21 11:40:36 dbo3 kernel: o2net: no longer connected to node dbo2 (num 1) at 192.168.134.141:7777
> >
> > Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
> > Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 591502543
> > Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
> > ...
> > Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
> > Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 591502568
> > Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
> > Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
> > Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
> > Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 1983
> > Nov 21 11:41:11 dbo3 kernel: (6614,0):o2hb_bio_end_io:332 ERROR: IO Error -5
> > Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
> > Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 3921780
> > Nov 21 11:41:11 dbo3 kernel: (6614,0):o2hb_bio_end_io:332 ERROR: IO Error -5
> > Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
> > Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
> > ...
> > Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
> > Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
> > Nov 21 11:41:11 dbo3 su: pam_unix2: session finished for user oracle, service su
> > Nov 21 11:41:11 dbo3 logger: Oracle CSSD failure 134.
> > Nov 21 11:45:07 dbo3 syslogd 1.4.1: restart.
> >
> > I'm curious about the message
> > "o2net: connection to node dbo2 (num 1) at 192.168.134.141:7777 has been idle for 10 seconds, shutting it down."
> >
> > I have increased my O2CB_HEARTBEAT_THRESHOLD to 61, but where is this message getting "10 seconds" from?
> > Also this message is displayed because dbo2 was not able to check into the hearbeat filesystem right ?
> >
> > - -peter
> >
> >
> >
> >
> >
> > Sunil Mushran wrote:
> >   
> >> On nodes db01 and db03 hb timed-out at 17:12:49. However, the nodes
> >> did not fully panic. As in, the network was shutdown but the hb thread
> >> was still going strong for some reason.
> >>
> >> Within 10 secs of that, by 17:12:59, db02 detected loss of network
> >> connectivity with both nodes db01 and db03. However, it was still
> >> seeing the nodes hb on disk and assumed that they were alive. As per
> >> quorum rules, it paniced.
> >>
> >> So the qs is: what was happening on nodes db01 and db03 after 17:12:49?
> >>
> >> Peter Santos wrote:
> >> Folks,
> >>     
> >> I'm trying to piece together what happened during a recent event where
> >> our 3 node RAC cluster had problems.
> >> It appears that all 3 nodes restarted .. which is likely to occur if
> >> all 3 nodes cannot communicate with the
> >> shared ocfs2 storage.
> >>
> >> I did find out from our SA, that this happened during the time he was
> >> replacing a failed drive on the storage
> >> and the storage was in a degraded mode.  I'm trying to understand if
> >> the 3 nodes had a difficult time accessing
> >> the shared ocfs2 volume or was it a tcp connectivity issue. There is
> >> nobody currently using the cluster ..so
> >> it should have been idle from a user perspective.
> >>
> >>
> >> prompt># cat /etc/fstab | grep ocfs2
> >>
> >> /dev/sdb1  /ocfs2       ocfs2      _netdev,datavolume,nointr  0 0
> >> /dev/sdb2  /backups     ocfs2      _netdev,datavolume,nointr  0 0
> >>
> >> we have 2 ocfs2 volumes.. once if for the voting and ocr files, while
> >> the other is to be used as a
> >> shared storage for backups of archivelog files etc.
> >>
> >>
> >> /var/log/messages
> >>
> >>
> >> NODE1 (dbo1)
> >> ========================================================================================================
> >>
> >> Nov 15 17:12:49 dbo1 kernel: (13,3):o2hb_write_timeout:270 ERROR:
> >> Heartbeat write timeout to device sdb2
> >>                     after 12000 milliseconds
> >> Nov 15 17:12:49 dbo1 kernel: Heartbeat thread (13) printing last 24
> >> blocking operations (cur = 13):
> >> Nov 16 05:44:58 dbo1 syslogd 1.4.1: restart.
> >>
> >>
> >> NODE2 (dbo2)
> >> ========================================================================================================
> >>
> >>
> >> Nov 15 17:12:57 dbo2 kernel: o2net: connection to node dbo1 (num 0) at
> >> 192.168.134.140:7777 has been idle for 10
> >> seconds, shutting it down.
> >> Nov 15 17:12:57 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some
> >> times that might help debug the situation: (tmr
> >> 1163628767.826089 now 1163628777.825614 dr 1163628767.826070 adv
> >> 1163628767.826104:1163628767.826105 func (f0735f96
> >>    :506) 1163454320.893701:1163454320.893708)
> >> Nov 15 17:12:57 dbo2 kernel: o2net: no longer connected to node dbo1
> >> (num 0) at 192.168.134.140:7777
> >> Nov 15 17:12:59 dbo2 kernel: o2net: connection to node dbo3 (num 2) at
> >> 192.168.134.142:7777 has been idle for 10
> >> seconds, shutting it down.
> >> Nov 15 17:12:59 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some
> >> times that might help debug the situation: (tmr
> >> 1163628769.44144 now 1163628779.43640 dr 1163628769.44123 adv
> >> 1163628769.44159:1163628769.44160 func (f7e0383f:504)
> >>     1163540424.444236:1163540424.444248)
> >> Nov 15 17:12:59 dbo2 kernel: o2net: no longer connected to node dbo3
> >> (num 2) at 192.168.134.142:7777
> >> Nov 15 17:32:37 dbo2 -- MARK --
> >> Nov 15 17:33:03 dbo2 kernel: (11,1):o2quo_make_decision:121 ERROR:
> >> fencing this node because it is only connected to 1
> >> nodes and 2 is needed to make a quorum out of 3 heartbeating nodes
> >> Nov 15 17:33:03 dbo2 kernel: (11,1):o2hb_stop_all_regions:1889 ERROR:
> >> stopping heartbeat on all active regions.
> >> Nov 15 17:33:03 dbo2 kernel: Kernel panic: ocfs2 is very sorry to be
> >> fencing this system by panicing
> >> Nov 15 17:33:03 dbo2 kernel:
> >>
> >> NODE3 (dbo3)
> >> ========================================================================================================
> >>
> >> Nov 15 17:12:49 dbo3 kernel: (13,3):o2hb_write_timeout:270 ERROR:
> >> Heartbeat write timeout to device sdb2
> >>                     after 12000 milliseconds
> >> Nov 15 17:12:49 dbo3 kernel: Heartbeat thread (13) printing last 24
> >> blocking operations (cur = 11):
> >> Nov 16 10:45:32 dbo3 syslogd 1.4.1: restart.
> >>
> >>
> >> any help is greatly appreciated (BTW, I've read the ocfs2 user guide).
> >>
> >> thanks
> >> -peter
> >>
> >>     
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.1 (GNU/Linux)
> > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> >
> > iD8DBQFFYzlCoyy5QBCjoT0RAp5hAJ9tQfMhZKnXmZC4+WwKkN7qpey/4QCeImS0
> > W6wm2WuTikOoZJxvpjMhxy0=
> > =4IZ/
> > -----END PGP SIGNATURE-----
> >   
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> 
> ________________________________________________________________________
> In order to protect our email recipients, Betfair use SkyScan from 
> MessageLabs to scan all Incoming and Outgoing mail for viruses.
> 
> ________________________________________________________________________
-- 
Andy Phillips
Systems Architecture Manager, Betfair.com

Office: 0208 8348436

Betfair Ltd|Winslow Road|Hammersmith Embankment|London|W69HP 
Company No. 5140986 
The information in this e-mail and any attachment is confidential and is
intended only for the named recipient(s). The e-mail may not be
disclosed or used by any person other than the addressee, nor may it be
copied in any way. If you are not a named recipient please notify the
sender immediately and delete any copies of this message. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden. Any view or opinions presented are solely
those of the author and do not necessarily represent those of the
company.