[Ocfs2-users] strange node reboot in RAC environment

Pedro Figueira Pedro.Figueira at sef.pt
Tue Feb 3 09:06:46 PST 2009


Hi all

We have a 4 Oracle RAC with the following versions of software versions:

Oracle and clusterware version 10.2.0.4
Red Hat Enterprise Linux AS release 4 with kernel version 2.6.9-55.ELlargesmp
ocfs2-tools-1.2.4-1
ocfs2-2.6.9-55.ELlargesmp-1.2.5-2
ocfs2console-1.2.4-1
timeout parameters:
  Heartbeat dead threshold: 31
  Network idle timeout: 10000
  Network keepalive delay: 5000
  Network reconnect delay: 2000

Until later last year the cluster was rock solid (hundreds). From January forward all the servers started to reboot synchronized but the strange thing is that there are no log messages in /var/log/messages, so we don't know if this a ocfs2 related problem. This reboots seems be related with the backup process (maybe extra load?). Other reboots only affect 2 out of 4 nodes.

Last night we updated the firmware and drivers from HP of the DL580G4 server and today we had another reboot (now with the following messages in /var/log/messages):

NODE 1:
------------------------------------------------------
Feb  3 14:12:52 grid2db1 kernel: o2net: connection to node grid2db4 (num 3) at 10.0.2.52:7777 has been idle for 10.0 seconds, shutting it down.
Feb  3 14:12:52 grid2db1 kernel: (0,0):o2net_idle_timer:1418 here are some times that might help debug the situation: (tmr 1233670362.97595 now 1233670372.96280 dr 1233670362.97580 adv 1233670362.97604:1233670362.97604 func (c77ed98a:504) 1233670067.138220:1233670067.138233)
Feb  3 14:12:52 grid2db1 kernel: o2net: no longer connected to node grid2db4 (num 3) at 10.0.2.52:7777
Feb  3 14:16:26 grid2db1 syslogd 1.4.1: restart.
Feb  3 14:16:26 grid2db1 syslog: syslogd startup succeeded

NODE 4:
------------------------------------------------------
Feb  3 14:12:46 grid2db4 kernel: (20,2):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device sdl after 60000 milliseconds
Feb  3 14:12:46 grid2db4 kernel: Heartbeat thread (20) printing last 24 blocking operations (cur = 18):
Feb  3 14:16:27 grid2db4 syslogd 1.4.1: restart.
Feb  3 14:16:27 grid2db4 syslog: syslogd startup succeeded

Other reboots simple don't log any error message.

So my question is if it's possible this reboots are triggers by OCFS2 and how to debug this problem? Should I change the timeout parameters?

We are also planning to upgrade to OCFS2 1.2.9-1 and OCFS2 Tools 1.2.7-1 and latest distro kernel, any catch?

Best regards and thanks for any answer.

Pedro Figueira 
Serviço de Estrangeiros e Fronteiras 
Direcção Central de Informática
Departamento de Produção 
Telefone: + 351 217 115 153

-----Original Message-----
From: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Sunil Mushran
Sent: sábado, 31 de Janeiro de 2009 15:59
To: Carl Benson
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] one node rejects connection from new node

Nodes can be added to an online cluster. The instructions are listed  
in the user's guide.

On Jan 31, 2009, at 7:53 AM, Carl Benson <cbenson at fhcrc.org> wrote:

> Sunil,
>
> Thank you for responding. I will try o2cb_ctl on Monday, when I have
> physical access to hit Reset in case one or more nodes lock up.
>
> If there really is a requirement to restart the cluster on wilson1  
> every time
> I add a new node (and I have five or six more nodes to add), that is  
> too
> bad. Wilson1 is a 24x7 production system.
>
> --Carl Benson
>
> Sunil Mushran wrote:
>> Could be that the cluster was already online on wilson1 when you
>> propagated the cluster.conf to all nodes. If so, restart the cluster
>> on that node.
>>
>> To add a node to an online cluster, you need to use the o2cb_ctl
>> command. Details are in the 1.4 user's guide.
>>
>>
>> Carl J. Benson wrote:
>>
>>> Hello.
>>>
>>> I have three systems that share an ocfs2 filesystem, and I'm
>>> trying to add a fourth system.
>>>
>>> These are all openSUSE 11.1, x86_64, kernel 2.6.27.7-9-default.
>>> All have RPMs ocfs2-tools-1.4.1-6.9 and ocfs2console-1.4.1-6.9
>>>
>>> cluster.conf looks like this:
>>> node:
>>>        ip_port = 7777
>>>        ip_address = 140.107.170.116
>>>        number = 0
>>>        name = merlot1
>>>        cluster = ocfs2
>>>
>>> node:
>>>        ip_port = 7777
>>>        ip_address = 140.107.158.54
>>>        number = 1
>>>        name = merlot2
>>>        cluster = ocfs2
>>>
>>> node:
>>>        ip_port = 7777
>>>        ip_address = 140.107.158.82
>>>        number = 2
>>>        name = wilson1
>>>        cluster = ocfs2
>>>
>>> node:
>>>        ip_port = 7778
>>>        ip_address = 140.107.170.108
>>>        number = 3
>>>        name = gladstone
>>>        cluster = ocfs2
>>>
>>> cluster:
>>>        node_count = 4
>>>        name = ocfs2
>>>
>>> gladstone is the new node.
>>>
>>> I edited the cluster.conf on wilson1 using ocfs2console, and
>>> propagated it to the other systems from there.
>>>
>>> When I try to bring my ocfs2 online with /etc/init.d/o2cb online  
>>> ocfs2,
>>> merlot1 accepts the connection from gladstone, as does merlot2.
>>> However, wilson1 rejects it as an unknown node! For example:
>>>
>>> Jan 30 14:11:46 wilson1 kernel: (4447,3):o2net_accept_one:1795  
>>> attempt
>>> to connect from unknown node at 140.107.170.108:37795
>>>
>>> Why would this happen?
>>>
>>>
>>
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

CONFIDENCIAL NOTICE:
This message, as well as any existing attached files, is confidential and intended exclusively for the individual(s) named as addressees. If you are not the intended recipient, you are kindly requested not to make any use whatsoever of its contents and to proceed to the destruction of the message, thereby notifying the sender. 
DISCLAIMER:
The sender of this message can NOT ensure the security of its electronic transmission and consequently does not accept liability for any fact, which may interfere with the integrity of its content.



More information about the Ocfs2-users mailing list