[Ocfs2-users] ocfs2 issue? : unexplained reboots of RHEL 4 server (kernel:2.6.9-42.0.2.ELs)

Derek Hazell derek.hazell at gmail.com
Sun Aug 24 04:08:01 PDT 2008


Hi Sunil,
I checked the grub.conf file on the machine that reboots and there is no
(deadline) reference to the io scheduler. I will check when back at work on
Monday, but I suspect that we are just using the default io scheduler which
would be cfq.

Just to briefly elaborate, our ocfs2 cluster consists of three nodes (one
node (or its backup) mounts the ocfs2 filesystem read/write, while two other
nodes mount the ocfs2 read only. It is always the read/write node that
automatically reboots (fences as we know now) (though sometimes but not
always the other systems need to be rebooted to get the system working
properly.) The problem could be load-related but it is difficult to be sure.

I will discuss with my colleagues about whether to try the deadline option
and/or set up a private network for the ocfs2 members. The deadline option
is very easy to try (involving a small change to the grub.conf, and a
reboot), while setting up the private network is a little bit more work but
not hard.
.
rgds
Derek

2008/8/24 Sunil Mushran <sunil.mushran at oracle.com>

> Which io scheduler are you using? On el4, it is best to use deadline.
> cfq is the default. Check the faq for details on using deadline.
>
> Derek Hazell wrote:
>
>>
>> Hi Ocfs2 user
>> We got some relevant log messages (via a serial console) and via a putty
>> session logged on a root.
>> I suspect we need to set up a private network between the ocfs2 cluster
>> members, is this right? Anything else we might need to do?
>>  regards, I appreciate your help
>>
>> Derek
>> ########################################################
>> CURRENT O2CB CONFIG
>>  [root at sysname fs]# /etc/init.d/o2cb configure
>> Configuring the O2CB driver.
>> This will configure the on-boot properties of the O2CB driver.
>> The following questions will determine whether the driver is loaded on
>> boot.  The current values will be shown in brackets ('[]').  Hitting
>> <ENTER> without typing an answer will keep that current value.  Ctrl-C
>> will abort.
>> Load O2CB driver on boot (y/n) [y]:
>> Cluster to start on boot (Enter "none" to clear) [ocfs2]:
>> Specify heartbeat dead threshold (>=7) [61]:
>> Specify network idle timeout in ms (>=5000) [60000]: 120000
>> Specify network keepalive delay in ms (>=1000) [2000]:
>> Specify network reconnect delay in ms (>=2000) [2000]:
>> Writing O2CB configuration: OK
>> O2CB cluster ocfs2 already online
>> [root at sysname fs]#
>> ##################
>> TRACE OF ROOT PUTTY LOGIN
>>
>> [root at sysname ~]#
>> Message from syslogd at sysname <mailto:syslogd at sysname> at Fri Aug 22
>> 23:12:03 2008 ...
>> sysname kernel: Heartbeat thread (11) printing last 24 blocking operations
>> (cur = 8):
>>
>> Message from syslogd at sysname <mailto:syslogd at sysname> at Fri Aug 22
>> 23:12:03 2008 ...
>> sysname kernel: Heartbeat thread stuck at waiting for read completion,
>> stuffing current time into that blocker (index 8)
>>
>> Message from syslogd at sysname <mailto:syslogd at sysname> at Fri Aug 22
>> 23:12:03 2008 ...
>> sysname kernel: Index 9: took 0 ms to do bio alloc read
>>
>> .
>> .
>> .
>>
>> Message from syslogd at sysname <mailto:syslogd at sysname> at Fri Aug 22
>> 23:12:04 2008 ...
>> sysname kernel: Index 3: took 5240 ms to do waiting for write completion
>>
>> Message from syslogd at sysname <mailto:syslogd at sysname> at Fri Aug 22
>> 23:12:04 2008 ...
>> sysname kernel: Index 4: took 0 ms to do allocating bios for read
>>
>> Message from syslogd at sysname <mailto:syslogd at sysname> at Fri Aug 22
>> 23:12:04 2008 ...
>> sysname kernel: Index 5: took 0 ms to do bio alloc read
>>
>> Message from syslogd at sysname <mailto:syslogd at sysname> at Fri Aug 22
>> 23:12:04 2008 ...
>> sysname kernel: Index 6: took 0 ms to do bio add page read
>>
>> Message from syslogd at sysname <mailto:syslogd at sysname> at Fri Aug 22
>> 23:12:04 2008 ...
>> sysname kernel: Index 7: took 0 ms to do submit_bio for read
>>
>> Message from syslogd at sysname <mailto:syslogd at sysname> at Fri Aug 22
>> 23:12:04 2008 ...
>>
>> sysname kernel: Index 8: took 120303 ms to do waiting for read completion
>>  #############
>> TRACE OF SERIAL CONSOLE:
>> (11,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device
>> emcpowerb1 after 120000 milliseconds
>> Heartbeat thread (11) printing last 24 blocking operations (cur = 8):
>> Heartbeat thread stuck at waiting for read completion, stuffing current
>> time into that blocker (index 8)
>> Index 9: took 0 ms to do bio alloc read
>> Index 10: took 0 ms to do bio add page read
>> Index 11: took 0 ms to do submit_bio for read
>> Index 12: took 3025 ms to do waiting for read completion
>> Index 13: took 0 ms to do bio alloc write
>> Index 14: took 0 ms to do bio add page write
>> Index 15: took 0 ms to do submit_bio for write
>> Index 16: took 0 ms to do checking slots
>> Index 17: took 7221 ms to do waiting for write completion
>> Index 18: took 0 ms to do allocating bios for read
>> Index 19: took 0 ms to do bio alloc read
>> Index 20: took 0 ms to do bio add page read
>> Index 21: took 0 ms to do submit_bio for read
>> Index 22: took 3892 ms to do waiting for read completion
>> Index 23: took 0 ms to do bio alloc write
>> Index 0: took 0 ms to do bio add page write
>> Index 1: took 0 ms to do submit_bio for write
>> Index 2: took 0 ms to do checking slots
>> Index 3: took 5240 ms to do waiting for write completion
>> Index 4: took 0 ms to do allocating bios for read
>> Index 5: took 0 ms to do bio alloc read
>> Index 6: took 0 ms to do bio add page read
>> Index 7: took 0 ms to do submit_bio for read
>> Index 8: took 120303 ms to do waiting for read completion
>> *** ocfs2 is very sorry to be fencing this system by restarting ***
>> Bootdata ok (command line is ro root=/dev/VolGroup_ID_12182/LogVol1
>> console=ttyS0,9600n8)
>>
>>   ################################################################################
>> -----Original Message-----
>> From: ocfs2-users-bounces at oss.oracle.com <mailto:
>> ocfs2-users-bounces at oss.oracle.com> [mailto:
>> ocfs2-users-bounces at oss.oracle.com <mailto:
>> ocfs2-users-bounces at oss.oracle.com>] On Behalf Of Sunil Mushran
>> Sent: Tuesday, 19 August 2008 3:56 AM
>> To: _Derek Hazell (Internet)
>> Cc: ocfs2-users at oss.oracle.com <mailto:ocfs2-users at oss.oracle.com>
>> Subject: Re: [Ocfs2-users] ocfs2 issue? : unexplained reboots of RHEL 4
>> server (kernel:2.6.9-42.0.2.ELs)
>>
>> Configure a netdump or netconsole server. It will catch the relevant
>>
>> messages.
>>
>>
>> ################################################################################
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>
>


-- 
best wishes

Derek



Psalm 71:14 "But as for me, I will always have hope; I will praise you more
and more". (NIV)
########################
new home ph: 02-9701-0841
new mobile ph: 0458-588-821
(or +61-458-588-821 from overseas)
email : derek.hazell at gmail.com
skype : dereklife2005
msn : derek_hazell at yahoo.com
yahoo messenger : derek_hazell
########################
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080824/72d20c76/attachment.html 


More information about the Ocfs2-users mailing list