[Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout

Tue Nov 28 11:35:35 PST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andy,

sorry for the late reply.

Since this is just a test system right now, the shared storage
is directed at 1 volume (/dev/sdb). Volume is made up of about 13 disks.

Disk /dev/sdb: 545.6 GB, 545669775360 bytes
255 heads, 63 sectors/track, 66340 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1         244     1959898+  83  Linux
/dev/sdb2             245       12403    97667167+  83  Linux
/dev/sdb3           12404       36719   195318270   83  Linux
/dev/sdb4           36720       66340   237930682+  83  Linux

/sdb1 - ocfs2 partition for ocr.crs and vote.crs files.
/sdb2 - ocfs2 partition for sharing backup files such as archivelogs
/sdb3 - this is setup as an ASM disk for oracle.
/sdb4 - this is setup as an ASM disk for oracle.

This is how I'm mounting it

/dev/sdb1    /ocfs2      ocfs2      _netdev,datavolume,nointr  0 0
/dev/sdb2    /backups    ocfs2      _netdev,datavolume,nointr  0 0

Switch  - Dell 2716 switch - 16 port Gigabit switch
Storage - is an old IBM H16 storage array.
Physical Server - Dell 1850s

I don't use bonding and jumbo frames has not yet  been turned on.
We also use the private ip on the machines for ocfs2.

Let me know if you need more information.
- -peter

Andy Phillips wrote:
> Peter,
> 
>    I'm attempting to run down a similar problem. Can you tell me how
> your heartbeat network is configured? I'm particularly interested in the
> make and type of switch, the use of bonding and which mode, and whether 
> you're using jumbo frames on that interface.
> 
>     Thanks
>          Andy
> 
> On Wed, 2006-11-22 at 17:24 -0800, Sunil Mushran wrote:
>> As ocfs2 heartbeats on the same device, unplugging a different device on the
>> storage should not affect ocfs2 as long as the ios are completing. But 
>> the logs
>> indicate otherwise. HB ios are erroring out.
>>
>> The o2net message is the tcp connect message. We will be providing a way
>> to configure that too.
>>
>> Peter Santos wrote:
> Suni,
> 
> after trying to chase this down, I think one of our sa's might have restarted the storage without
> notifying anyone.
> 
> Similarly, today a disk that was not in use was re-initialized and caused everything to come down. I don't
> know if this is an issue with ocfs2 or ( old_storage + our sa doing this incorrectly).
> 
> The idea was to re-initialize a disk that was not being used (sdc) and not have it affect
> the ocfs2 storage (sdb).
> 
> After the re-initialization completed, I noticed that all 3 nodes weren't working and this was
> what I found on dbo3
> 
> =======================================================================================================================
> Nov 21 11:40:36 dbo3 kernel: o2net: connection to node dbo2 (num 1) at 192.168.134.141:7777 has been idle for 10
> seconds, shutting it down.
> 
> Nov 21 11:40:36 dbo3 kernel: (0,1):o2net_idle_timer:1310 here are some times that might help debug the situation: (tmr
> 1164127226.293816 now 1164127236.291931 dr 1164127226.293797 adv 1164127226.293818:1164127226.293819 func (a77953f3:2)
> 1164124426.747626:1164124426.747628)
> 
> 
> Nov 21 11:40:36 dbo3 kernel: o2net: no longer connected to node dbo2 (num 1) at 192.168.134.141:7777
> 
> Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
> Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 591502543
> Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
> ...
> Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
> Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 591502568
> Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
> Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
> Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
> Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 1983
> Nov 21 11:41:11 dbo3 kernel: (6614,0):o2hb_bio_end_io:332 ERROR: IO Error -5
> Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
> Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 3921780
> Nov 21 11:41:11 dbo3 kernel: (6614,0):o2hb_bio_end_io:332 ERROR: IO Error -5
> Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
> Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
> ...
> Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
> Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 ERROR: status = -5
> Nov 21 11:41:11 dbo3 su: pam_unix2: session finished for user oracle, service su
> Nov 21 11:41:11 dbo3 logger: Oracle CSSD failure 134.
> Nov 21 11:45:07 dbo3 syslogd 1.4.1: restart.
> 
> I'm curious about the message
> "o2net: connection to node dbo2 (num 1) at 192.168.134.141:7777 has been idle for 10 seconds, shutting it down."
> 
> I have increased my O2CB_HEARTBEAT_THRESHOLD to 61, but where is this message getting "10 seconds" from?
> Also this message is displayed because dbo2 was not able to check into the hearbeat filesystem right ?
> 
> -peter
> 
> 
> 
> 
> 
> Sunil Mushran wrote:
>   
>>>>> On nodes db01 and db03 hb timed-out at 17:12:49. However, the nodes
>>>>> did not fully panic. As in, the network was shutdown but the hb thread
>>>>> was still going strong for some reason.
>>>>>
>>>>> Within 10 secs of that, by 17:12:59, db02 detected loss of network
>>>>> connectivity with both nodes db01 and db03. However, it was still
>>>>> seeing the nodes hb on disk and assumed that they were alive. As per
>>>>> quorum rules, it paniced.
>>>>>
>>>>> So the qs is: what was happening on nodes db01 and db03 after 17:12:49?
>>>>>
>>>>> Peter Santos wrote:
>>>>> Folks,
>>>>>     
>>>>> I'm trying to piece together what happened during a recent event where
>>>>> our 3 node RAC cluster had problems.
>>>>> It appears that all 3 nodes restarted .. which is likely to occur if
>>>>> all 3 nodes cannot communicate with the
>>>>> shared ocfs2 storage.
>>>>>
>>>>> I did find out from our SA, that this happened during the time he was
>>>>> replacing a failed drive on the storage
>>>>> and the storage was in a degraded mode.  I'm trying to understand if
>>>>> the 3 nodes had a difficult time accessing
>>>>> the shared ocfs2 volume or was it a tcp connectivity issue. There is
>>>>> nobody currently using the cluster ..so
>>>>> it should have been idle from a user perspective.
>>>>>
>>>>>
>>>>> prompt># cat /etc/fstab | grep ocfs2
>>>>>
>>>>> /dev/sdb1  /ocfs2       ocfs2      _netdev,datavolume,nointr  0 0
>>>>> /dev/sdb2  /backups     ocfs2      _netdev,datavolume,nointr  0 0
>>>>>
>>>>> we have 2 ocfs2 volumes.. once if for the voting and ocr files, while
>>>>> the other is to be used as a
>>>>> shared storage for backups of archivelog files etc.
>>>>>
>>>>>
>>>>> /var/log/messages
>>>>>
>>>>>
>>>>> NODE1 (dbo1)
>>>>> ========================================================================================================
>>>>>
>>>>> Nov 15 17:12:49 dbo1 kernel: (13,3):o2hb_write_timeout:270 ERROR:
>>>>> Heartbeat write timeout to device sdb2
>>>>>                     after 12000 milliseconds
>>>>> Nov 15 17:12:49 dbo1 kernel: Heartbeat thread (13) printing last 24
>>>>> blocking operations (cur = 13):
>>>>> Nov 16 05:44:58 dbo1 syslogd 1.4.1: restart.
>>>>>
>>>>>
>>>>> NODE2 (dbo2)
>>>>> ========================================================================================================
>>>>>
>>>>>
>>>>> Nov 15 17:12:57 dbo2 kernel: o2net: connection to node dbo1 (num 0) at
>>>>> 192.168.134.140:7777 has been idle for 10
>>>>> seconds, shutting it down.
>>>>> Nov 15 17:12:57 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some
>>>>> times that might help debug the situation: (tmr
>>>>> 1163628767.826089 now 1163628777.825614 dr 1163628767.826070 adv
>>>>> 1163628767.826104:1163628767.826105 func (f0735f96
>>>>>    :506) 1163454320.893701:1163454320.893708)
>>>>> Nov 15 17:12:57 dbo2 kernel: o2net: no longer connected to node dbo1
>>>>> (num 0) at 192.168.134.140:7777
>>>>> Nov 15 17:12:59 dbo2 kernel: o2net: connection to node dbo3 (num 2) at
>>>>> 192.168.134.142:7777 has been idle for 10
>>>>> seconds, shutting it down.
>>>>> Nov 15 17:12:59 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some
>>>>> times that might help debug the situation: (tmr
>>>>> 1163628769.44144 now 1163628779.43640 dr 1163628769.44123 adv
>>>>> 1163628769.44159:1163628769.44160 func (f7e0383f:504)
>>>>>     1163540424.444236:1163540424.444248)
>>>>> Nov 15 17:12:59 dbo2 kernel: o2net: no longer connected to node dbo3
>>>>> (num 2) at 192.168.134.142:7777
>>>>> Nov 15 17:32:37 dbo2 -- MARK --
>>>>> Nov 15 17:33:03 dbo2 kernel: (11,1):o2quo_make_decision:121 ERROR:
>>>>> fencing this node because it is only connected to 1
>>>>> nodes and 2 is needed to make a quorum out of 3 heartbeating nodes
>>>>> Nov 15 17:33:03 dbo2 kernel: (11,1):o2hb_stop_all_regions:1889 ERROR:
>>>>> stopping heartbeat on all active regions.
>>>>> Nov 15 17:33:03 dbo2 kernel: Kernel panic: ocfs2 is very sorry to be
>>>>> fencing this system by panicing
>>>>> Nov 15 17:33:03 dbo2 kernel:
>>>>>
>>>>> NODE3 (dbo3)
>>>>> ========================================================================================================
>>>>>
>>>>> Nov 15 17:12:49 dbo3 kernel: (13,3):o2hb_write_timeout:270 ERROR:
>>>>> Heartbeat write timeout to device sdb2
>>>>>                     after 12000 milliseconds
>>>>> Nov 15 17:12:49 dbo3 kernel: Heartbeat thread (13) printing last 24
>>>>> blocking operations (cur = 11):
>>>>> Nov 16 10:45:32 dbo3 syslogd 1.4.1: restart.
>>>>>
>>>>>
>>>>> any help is greatly appreciated (BTW, I've read the ocfs2 user guide).
>>>>>
>>>>> thanks
>>>>> -peter
>>>>>
>>>>>     
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> 

>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>> ________________________________________________________________________
>> In order to protect our email recipients, Betfair use SkyScan from 
>> MessageLabs to scan all Incoming and Outgoing mail for viruses.
>>
>> ________________________________________________________________________
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFbI+Hoyy5QBCjoT0RAj59AKCYTIftetAgYA+unrnb/FGQEL24PwCfRRxk
H7S8qwwIN6H1XIyREx+oR3o=
=6KcN
-----END PGP SIGNATURE-----