[Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write
timeout
Peter Santos
psantos at cheetahmail.com
Tue Nov 21 09:37:06 PST 2006
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Suni,
after trying to chase this down, I think one of our sa's might have restarted the storage without
notifying anyone.
Similarly, today a disk that was not in use was re-initialized and caused everything to come down. I don't
know if this is an issue with ocfs2 or ( old_storage + our sa doing this incorrectly).
The idea was to re-initialize a disk that was not being used (sdc) and not have it affect
the ocfs2 storage (sdb).
After the re-initialization completed, I noticed that all 3 nodes weren't working and this was
what I found on dbo3
=======================================================================================================================
Nov 21 11:40:36 dbo3 kernel: o2net: connection to node dbo2 (num 1) at 192.168.134.141:7777 has been idle for 10
seconds, shutting it down.
Nov 21 11:40:36 dbo3 kernel: (0,1):o2net_idle_timer:1310 here are some times that might help debug the situation: (tmr
1164127226.293816 now 1164127236.291931 dr 1164127226.293797 adv 1164127226.293818:1164127226.293819 func (a77953f3:2)
1164124426.747626:1164124426.747628)
Nov 21 11:40:36 dbo3 kernel: o2net: no longer connected to node dbo2 (num 1) at 192.168.134.141:7777
Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 591502543
Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
...
Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 591502568
Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 1983
Nov 21 11:41:11 dbo3 kernel: (6614,0):o2hb_bio_end_io:332 ERROR: IO Error -5
Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 3921780
Nov 21 11:41:11 dbo3 kernel: (6614,0):o2hb_bio_end_io:332 ERROR: IO Error -5
Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
...
Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
Nov 21 11:41:11 dbo3 su: pam_unix2: session finished for user oracle, service su
Nov 21 11:41:11 dbo3 logger: Oracle CSSD failure 134.
Nov 21 11:45:07 dbo3 syslogd 1.4.1: restart.
I'm curious about the message
"o2net: connection to node dbo2 (num 1) at 192.168.134.141:7777 has been idle for 10 seconds, shutting it down."
I have increased my O2CB_HEARTBEAT_THRESHOLD to 61, but where is this message getting "10 seconds" from?
Also this message is displayed because dbo2 was not able to check into the hearbeat filesystem right ?
- -peter
Sunil Mushran wrote:
> On nodes db01 and db03 hb timed-out at 17:12:49. However, the nodes
> did not fully panic. As in, the network was shutdown but the hb thread
> was still going strong for some reason.
>
> Within 10 secs of that, by 17:12:59, db02 detected loss of network
> connectivity with both nodes db01 and db03. However, it was still
> seeing the nodes hb on disk and assumed that they were alive. As per
> quorum rules, it paniced.
>
> So the qs is: what was happening on nodes db01 and db03 after 17:12:49?
>
> Peter Santos wrote:
> Folks,
>
> I'm trying to piece together what happened during a recent event where
> our 3 node RAC cluster had problems.
> It appears that all 3 nodes restarted .. which is likely to occur if
> all 3 nodes cannot communicate with the
> shared ocfs2 storage.
>
> I did find out from our SA, that this happened during the time he was
> replacing a failed drive on the storage
> and the storage was in a degraded mode. I'm trying to understand if
> the 3 nodes had a difficult time accessing
> the shared ocfs2 volume or was it a tcp connectivity issue. There is
> nobody currently using the cluster ..so
> it should have been idle from a user perspective.
>
>
> prompt># cat /etc/fstab | grep ocfs2
>
> /dev/sdb1 /ocfs2 ocfs2 _netdev,datavolume,nointr 0 0
> /dev/sdb2 /backups ocfs2 _netdev,datavolume,nointr 0 0
>
> we have 2 ocfs2 volumes.. once if for the voting and ocr files, while
> the other is to be used as a
> shared storage for backups of archivelog files etc.
>
>
> /var/log/messages
>
>
> NODE1 (dbo1)
> ========================================================================================================
>
> Nov 15 17:12:49 dbo1 kernel: (13,3):o2hb_write_timeout:270 ERROR:
> Heartbeat write timeout to device sdb2
> after 12000 milliseconds
> Nov 15 17:12:49 dbo1 kernel: Heartbeat thread (13) printing last 24
> blocking operations (cur = 13):
> Nov 16 05:44:58 dbo1 syslogd 1.4.1: restart.
>
>
> NODE2 (dbo2)
> ========================================================================================================
>
>
> Nov 15 17:12:57 dbo2 kernel: o2net: connection to node dbo1 (num 0) at
> 192.168.134.140:7777 has been idle for 10
> seconds, shutting it down.
> Nov 15 17:12:57 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some
> times that might help debug the situation: (tmr
> 1163628767.826089 now 1163628777.825614 dr 1163628767.826070 adv
> 1163628767.826104:1163628767.826105 func (f0735f96
> :506) 1163454320.893701:1163454320.893708)
> Nov 15 17:12:57 dbo2 kernel: o2net: no longer connected to node dbo1
> (num 0) at 192.168.134.140:7777
> Nov 15 17:12:59 dbo2 kernel: o2net: connection to node dbo3 (num 2) at
> 192.168.134.142:7777 has been idle for 10
> seconds, shutting it down.
> Nov 15 17:12:59 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some
> times that might help debug the situation: (tmr
> 1163628769.44144 now 1163628779.43640 dr 1163628769.44123 adv
> 1163628769.44159:1163628769.44160 func (f7e0383f:504)
> 1163540424.444236:1163540424.444248)
> Nov 15 17:12:59 dbo2 kernel: o2net: no longer connected to node dbo3
> (num 2) at 192.168.134.142:7777
> Nov 15 17:32:37 dbo2 -- MARK --
> Nov 15 17:33:03 dbo2 kernel: (11,1):o2quo_make_decision:121 ERROR:
> fencing this node because it is only connected to 1
> nodes and 2 is needed to make a quorum out of 3 heartbeating nodes
> Nov 15 17:33:03 dbo2 kernel: (11,1):o2hb_stop_all_regions:1889 ERROR:
> stopping heartbeat on all active regions.
> Nov 15 17:33:03 dbo2 kernel: Kernel panic: ocfs2 is very sorry to be
> fencing this system by panicing
> Nov 15 17:33:03 dbo2 kernel:
>
> NODE3 (dbo3)
> ========================================================================================================
>
> Nov 15 17:12:49 dbo3 kernel: (13,3):o2hb_write_timeout:270 ERROR:
> Heartbeat write timeout to device sdb2
> after 12000 milliseconds
> Nov 15 17:12:49 dbo3 kernel: Heartbeat thread (13) printing last 24
> blocking operations (cur = 11):
> Nov 16 10:45:32 dbo3 syslogd 1.4.1: restart.
>
>
> any help is greatly appreciated (BTW, I've read the ocfs2 user guide).
>
> thanks
> -peter
>
>>
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFFYzlCoyy5QBCjoT0RAp5hAJ9tQfMhZKnXmZC4+WwKkN7qpey/4QCeImS0
W6wm2WuTikOoZJxvpjMhxy0=
=4IZ/
-----END PGP SIGNATURE-----
More information about the Ocfs2-users
mailing list