[Ocfs2-users] strange node reboot in RAC environment

Wed Feb 4 01:06:57 PST 2009

Hi All,

on my experiences with OCFS2 with RAC 10g 10.2.0.3 and 10.2.0.4 I would suggest the following:

*Update to 1.2.9-1

*Rise up the Timeouts:
Heartbeat dead threshold: 31 
Network idle timeout: 10000

This is definitely to low. Keep in mind that the timeouts should be higher than the CRS timeouts. I am Sorry, but setting the timeouts to the right value is something like reading a bead. You have to do a lot of testing.

Don’t share the CRS_HOME and ORACLE_HOME at OCFS2 and do not put the VOTING and OCS at OCFS.

Everything without warranty, of course… ;-)

Good Luck

Martin Schmitter

________________________________________
Von: ocfs2-users-bounces at oss.oracle.com [ocfs2-users-bounces at oss.oracle.com] im Auftrag von Ulf Zimmermann [ulf at openlane.com]
Gesendet: Dienstag, 3. Februar 2009 20:48
An: Pedro Figueira; ocfs2-users at oss.oracle.com
Cc: Ricardo Rocha Ramalho; Pedro Lopes Almeida
Betreff: Re: [Ocfs2-users] strange node reboot in RAC environment

> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-
> bounces at oss.oracle.com] On Behalf Of Pedro Figueira
> Sent: 02/03/2009 09:07
> To: ocfs2-users at oss.oracle.com
> Cc: Ricardo Rocha Ramalho; Pedro Lopes Almeida
> Subject: [Ocfs2-users] strange node reboot in RAC environment
>
> Hi all
>
> We have a 4 Oracle RAC with the following versions of software
> versions:
>
> Oracle and clusterware version 10.2.0.4
> Red Hat Enterprise Linux AS release 4 with kernel version 2.6.9-
> 55.ELlargesmp
> ocfs2-tools-1.2.4-1
> ocfs2-2.6.9-55.ELlargesmp-1.2.5-2
> ocfs2console-1.2.4-1
> timeout parameters:
>   Heartbeat dead threshold: 31
>   Network idle timeout: 10000
>   Network keepalive delay: 5000
>   Network reconnect delay: 2000
>
> Until later last year the cluster was rock solid (hundreds). From
> January forward all the servers started to reboot synchronized but the
> strange thing is that there are no log messages in /var/log/messages,
> so we don't know if this a ocfs2 related problem. This reboots seems be
> related with the backup process (maybe extra load?). Other reboots only
> affect 2 out of 4 nodes.

As ocfs2 will print out messages to the console and they might not get capture by anything,
I recommend to setup the virtual serial of iLO and use something like conserver to attach
a console to that virtual serial. I do this for all our OCFS hosts and have a log of anything
going on, including BIOS screen. If ocfs2 is fencing because of I/O issues, it will show there.

>
> Last night we updated the firmware and drivers from HP of the DL580G4
> server and today we had another reboot (now with the following messages
> in /var/log/messages):
>
> NODE 1:
> ------------------------------------------------------
> Feb  3 14:12:52 grid2db1 kernel: o2net: connection to node grid2db4
> (num 3) at 10.0.2.52:7777 has been idle for 10.0 seconds, shutting it
> down.
> Feb  3 14:12:52 grid2db1 kernel: (0,0):o2net_idle_timer:1418 here are
> some times that might help debug the situation: (tmr 1233670362.97595
> now 1233670372.96280 dr 1233670362.97580 adv
> 1233670362.97604:1233670362.97604 func (c77ed98a:504)
> 1233670067.138220:1233670067.138233)
> Feb  3 14:12:52 grid2db1 kernel: o2net: no longer connected to node
> grid2db4 (num 3) at 10.0.2.52:7777
> Feb  3 14:16:26 grid2db1 syslogd 1.4.1: restart.
> Feb  3 14:16:26 grid2db1 syslog: syslogd startup succeeded
>
> NODE 4:
> ------------------------------------------------------
> Feb  3 14:12:46 grid2db4 kernel: (20,2):o2hb_write_timeout:269 ERROR:
> Heartbeat write timeout to device sdl after 60000 milliseconds
> Feb  3 14:12:46 grid2db4 kernel: Heartbeat thread (20) printing last 24
> blocking operations (cur = 18):
> Feb  3 14:16:27 grid2db4 syslogd 1.4.1: restart.
> Feb  3 14:16:27 grid2db4 syslog: syslogd startup succeeded
>
> Other reboots simple don't log any error message.
>
> So my question is if it's possible this reboots are triggers by OCFS2
> and how to debug this problem? Should I change the timeout parameters?
>
> We are also planning to upgrade to OCFS2 1.2.9-1 and OCFS2 Tools 1.2.7-
> 1 and latest distro kernel, any catch?
>
> Best regards and thanks for any answer.
>
> Pedro Figueira
> Serviço de Estrangeiros e Fronteiras
> Direcção Central de Informática
> Departamento de Produção
> Telefone: + 351 217 115 153
>
> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-
> bounces at oss.oracle.com] On Behalf Of Sunil Mushran
> Sent: sábado, 31 de Janeiro de 2009 15:59
> To: Carl Benson
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] one node rejects connection from new node
>
> Nodes can be added to an online cluster. The instructions are listed
> in the user's guide.
>
> On Jan 31, 2009, at 7:53 AM, Carl Benson <cbenson at fhcrc.org> wrote:
>
> > Sunil,
> >
> > Thank you for responding. I will try o2cb_ctl on Monday, when I have
> > physical access to hit Reset in case one or more nodes lock up.
> >
> > If there really is a requirement to restart the cluster on wilson1
> > every time
> > I add a new node (and I have five or six more nodes to add), that is
> > too
> > bad. Wilson1 is a 24x7 production system.
> >
> > --Carl Benson
> >
> > Sunil Mushran wrote:
> >> Could be that the cluster was already online on wilson1 when you
> >> propagated the cluster.conf to all nodes. If so, restart the cluster
> >> on that node.
> >>
> >> To add a node to an online cluster, you need to use the o2cb_ctl
> >> command. Details are in the 1.4 user's guide.
> >>
> >>
> >> Carl J. Benson wrote:
> >>
> >>> Hello.
> >>>
> >>> I have three systems that share an ocfs2 filesystem, and I'm
> >>> trying to add a fourth system.
> >>>
> >>> These are all openSUSE 11.1, x86_64, kernel 2.6.27.7-9-default.
> >>> All have RPMs ocfs2-tools-1.4.1-6.9 and ocfs2console-1.4.1-6.9
> >>>
> >>> cluster.conf looks like this:
> >>> node:
> >>>        ip_port = 7777
> >>>        ip_address = 140.107.170.116
> >>>        number = 0
> >>>        name = merlot1
> >>>        cluster = ocfs2
> >>>
> >>> node:
> >>>        ip_port = 7777
> >>>        ip_address = 140.107.158.54
> >>>        number = 1
> >>>        name = merlot2
> >>>        cluster = ocfs2
> >>>
> >>> node:
> >>>        ip_port = 7777
> >>>        ip_address = 140.107.158.82
> >>>        number = 2
> >>>        name = wilson1
> >>>        cluster = ocfs2
> >>>
> >>> node:
> >>>        ip_port = 7778
> >>>        ip_address = 140.107.170.108
> >>>        number = 3
> >>>        name = gladstone
> >>>        cluster = ocfs2
> >>>
> >>> cluster:
> >>>        node_count = 4
> >>>        name = ocfs2
> >>>
> >>> gladstone is the new node.
> >>>
> >>> I edited the cluster.conf on wilson1 using ocfs2console, and
> >>> propagated it to the other systems from there.
> >>>
> >>> When I try to bring my ocfs2 online with /etc/init.d/o2cb online
> >>> ocfs2,
> >>> merlot1 accepts the connection from gladstone, as does merlot2.
> >>> However, wilson1 rejects it as an unknown node! For example:
> >>>
> >>> Jan 30 14:11:46 wilson1 kernel: (4447,3):o2net_accept_one:1795
> >>> attempt
> >>> to connect from unknown node at 140.107.170.108:37795
> >>>
> >>> Why would this happen?
> >>>
> >>>
> >>
> >>
> >> _______________________________________________
> >> Ocfs2-users mailing list
> >> Ocfs2-users at oss.oracle.com
> >> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >>
> >
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
> CONFIDENCIAL NOTICE:
> This message, as well as any existing attached files, is confidential
> and intended exclusively for the individual(s) named as addressees. If
> you are not the intended recipient, you are kindly requested not to
> make any use whatsoever of its contents and to proceed to the
> destruction of the message, thereby notifying the sender.
> DISCLAIMER:
> The sender of this message can NOT ensure the security of its
> electronic transmission and consequently does not accept liability for
> any fact, which may interfere with the integrity of its content.
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users