[Ocfs2-users] strange node reboot in RAC environment

Tue Feb 3 11:55:42 PST 2009

Yes, do set up netconsole or something like what Ulf uses.

The one bug you could be hitting is bugzilla#919 that was fixed in 1.2.9.
http://oss.oracle.com/projects/ocfs2/news/article_18.html

Ulf Zimmermann wrote:
>> -----Original Message-----
>> From: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-
>> bounces at oss.oracle.com] On Behalf Of Pedro Figueira
>> Sent: 02/03/2009 09:07
>> To: ocfs2-users at oss.oracle.com
>> Cc: Ricardo Rocha Ramalho; Pedro Lopes Almeida
>> Subject: [Ocfs2-users] strange node reboot in RAC environment
>>
>> Hi all
>>
>> We have a 4 Oracle RAC with the following versions of software
>> versions:
>>
>> Oracle and clusterware version 10.2.0.4
>> Red Hat Enterprise Linux AS release 4 with kernel version 2.6.9-
>> 55.ELlargesmp
>> ocfs2-tools-1.2.4-1
>> ocfs2-2.6.9-55.ELlargesmp-1.2.5-2
>> ocfs2console-1.2.4-1
>> timeout parameters:
>>   Heartbeat dead threshold: 31
>>   Network idle timeout: 10000
>>   Network keepalive delay: 5000
>>   Network reconnect delay: 2000
>>
>> Until later last year the cluster was rock solid (hundreds). From
>> January forward all the servers started to reboot synchronized but the
>> strange thing is that there are no log messages in /var/log/messages,
>> so we don't know if this a ocfs2 related problem. This reboots seems be
>> related with the backup process (maybe extra load?). Other reboots only
>> affect 2 out of 4 nodes.
>>     
>
> As ocfs2 will print out messages to the console and they might not get capture by anything,
> I recommend to setup the virtual serial of iLO and use something like conserver to attach
> a console to that virtual serial. I do this for all our OCFS hosts and have a log of anything
> going on, including BIOS screen. If ocfs2 is fencing because of I/O issues, it will show there.
>
>   
>> Last night we updated the firmware and drivers from HP of the DL580G4
>> server and today we had another reboot (now with the following messages
>> in /var/log/messages):
>>
>> NODE 1:
>> ------------------------------------------------------
>> Feb  3 14:12:52 grid2db1 kernel: o2net: connection to node grid2db4
>> (num 3) at 10.0.2.52:7777 has been idle for 10.0 seconds, shutting it
>> down.
>> Feb  3 14:12:52 grid2db1 kernel: (0,0):o2net_idle_timer:1418 here are
>> some times that might help debug the situation: (tmr 1233670362.97595
>> now 1233670372.96280 dr 1233670362.97580 adv
>> 1233670362.97604:1233670362.97604 func (c77ed98a:504)
>> 1233670067.138220:1233670067.138233)
>> Feb  3 14:12:52 grid2db1 kernel: o2net: no longer connected to node
>> grid2db4 (num 3) at 10.0.2.52:7777
>> Feb  3 14:16:26 grid2db1 syslogd 1.4.1: restart.
>> Feb  3 14:16:26 grid2db1 syslog: syslogd startup succeeded
>>
>> NODE 4:
>> ------------------------------------------------------
>> Feb  3 14:12:46 grid2db4 kernel: (20,2):o2hb_write_timeout:269 ERROR:
>> Heartbeat write timeout to device sdl after 60000 milliseconds
>> Feb  3 14:12:46 grid2db4 kernel: Heartbeat thread (20) printing last 24
>> blocking operations (cur = 18):
>> Feb  3 14:16:27 grid2db4 syslogd 1.4.1: restart.
>> Feb  3 14:16:27 grid2db4 syslog: syslogd startup succeeded
>>
>> Other reboots simple don't log any error message.
>>
>> So my question is if it's possible this reboots are triggers by OCFS2
>> and how to debug this problem? Should I change the timeout parameters?
>>
>> We are also planning to upgrade to OCFS2 1.2.9-1 and OCFS2 Tools 1.2.7-
>> 1 and latest distro kernel, any catch?
>>
>> Best regards and thanks for any answer.
>>
>> Pedro Figueira
>> Serviço de Estrangeiros e Fronteiras
>> Direcção Central de Informática
>> Departamento de Produção
>> Telefone: + 351 217 115 153
>>
>> -----Original Message-----
>> From: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-
>> bounces at oss.oracle.com] On Behalf Of Sunil Mushran
>> Sent: sábado, 31 de Janeiro de 2009 15:59
>> To: Carl Benson
>> Cc: ocfs2-users at oss.oracle.com
>> Subject: Re: [Ocfs2-users] one node rejects connection from new node
>>
>> Nodes can be added to an online cluster. The instructions are listed
>> in the user's guide.
>>
>> On Jan 31, 2009, at 7:53 AM, Carl Benson <cbenson at fhcrc.org> wrote:
>>
>>     
>>> Sunil,
>>>
>>> Thank you for responding. I will try o2cb_ctl on Monday, when I have
>>> physical access to hit Reset in case one or more nodes lock up.
>>>
>>> If there really is a requirement to restart the cluster on wilson1
>>> every time
>>> I add a new node (and I have five or six more nodes to add), that is
>>> too
>>> bad. Wilson1 is a 24x7 production system.
>>>
>>> --Carl Benson
>>>
>>> Sunil Mushran wrote:
>>>       
>>>> Could be that the cluster was already online on wilson1 when you
>>>> propagated the cluster.conf to all nodes. If so, restart the cluster
>>>> on that node.
>>>>
>>>> To add a node to an online cluster, you need to use the o2cb_ctl
>>>> command. Details are in the 1.4 user's guide.
>>>>
>>>>
>>>> Carl J. Benson wrote:
>>>>
>>>>         
>>>>> Hello.
>>>>>
>>>>> I have three systems that share an ocfs2 filesystem, and I'm
>>>>> trying to add a fourth system.
>>>>>
>>>>> These are all openSUSE 11.1, x86_64, kernel 2.6.27.7-9-default.
>>>>> All have RPMs ocfs2-tools-1.4.1-6.9 and ocfs2console-1.4.1-6.9
>>>>>
>>>>> cluster.conf looks like this:
>>>>> node:
>>>>>        ip_port = 7777
>>>>>        ip_address = 140.107.170.116
>>>>>        number = 0
>>>>>        name = merlot1
>>>>>        cluster = ocfs2
>>>>>
>>>>> node:
>>>>>        ip_port = 7777
>>>>>        ip_address = 140.107.158.54
>>>>>        number = 1
>>>>>        name = merlot2
>>>>>        cluster = ocfs2
>>>>>
>>>>> node:
>>>>>        ip_port = 7777
>>>>>        ip_address = 140.107.158.82
>>>>>        number = 2
>>>>>        name = wilson1
>>>>>        cluster = ocfs2
>>>>>
>>>>> node:
>>>>>        ip_port = 7778
>>>>>        ip_address = 140.107.170.108
>>>>>        number = 3
>>>>>        name = gladstone
>>>>>        cluster = ocfs2
>>>>>
>>>>> cluster:
>>>>>        node_count = 4
>>>>>        name = ocfs2
>>>>>
>>>>> gladstone is the new node.
>>>>>
>>>>> I edited the cluster.conf on wilson1 using ocfs2console, and
>>>>> propagated it to the other systems from there.
>>>>>
>>>>> When I try to bring my ocfs2 online with /etc/init.d/o2cb online
>>>>> ocfs2,
>>>>> merlot1 accepts the connection from gladstone, as does merlot2.
>>>>> However, wilson1 rejects it as an unknown node! For example:
>>>>>
>>>>> Jan 30 14:11:46 wilson1 kernel: (4447,3):o2net_accept_one:1795
>>>>> attempt
>>>>> to connect from unknown node at 140.107.170.108:37795
>>>>>
>>>>> Why would this happen?
>>>>>
>>>>>
>>>>>           
>>>> _______________________________________________
>>>> Ocfs2-users mailing list
>>>> Ocfs2-users at oss.oracle.com
>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>
>>>>         
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>> CONFIDENCIAL NOTICE:
>> This message, as well as any existing attached files, is confidential
>> and intended exclusively for the individual(s) named as addressees. If
>> you are not the intended recipient, you are kindly requested not to
>> make any use whatsoever of its contents and to proceed to the
>> destruction of the message, thereby notifying the sender.
>> DISCLAIMER:
>> The sender of this message can NOT ensure the security of its
>> electronic transmission and consequently does not accept liability for
>> any fact, which may interfere with the integrity of its content.
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>     
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>