[Ocfs2-users] Random node reboots

Neil Campbell neil.campbell at downeredi.com
Tue Dec 7 15:56:12 PST 2010


Thanks Sunil.

 

OK, so when another node says "......been idle for 60.0 seconds,
shutting it down"

 

It doesn't actually mean it is going to shut down the other node,
instead it means

something like it is not including it in the list of available nodes ?

 

Do you think I should add the nointr mount option for this as a general
purpose filesystem? 

 

In the meantime, I will look at setting up netconsole.

 

Thanks

Neil

 

 

________________________________

From: Sunil Mushran [mailto:sunil.mushran at oracle.com] 
Sent: Wednesday, 8 December 2010 10:50 AM
To: Neil Campbell
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] Random node reboots

 

It's the other way round. These message indicate that the other node
died. When these nodes detect that, they need to do recovery/cleanup.
The messages all relate to that activity.

The reason as to why the first node died will only be known once we have
the oops trace. And that is caught by netdump/netconsole.

On 12/07/2010 03:37 PM, Neil Campbell wrote: 

Hi Sunil,

 

Thanks for responding.

 

It is being evicted by the other nodes in the OCFS2 cluster isn't it?

 

/var/log/messages from dcapp02 shows 

 

Dec  8 01:56:01 dcapp02 kernel: o2net: connection to node dcapp01 (num
0) at 10.255.255.1:10007 has been idle for 60.0 seconds, shutting it
down.

 

This appears in all the other messages files of the nodes in the cluster
(except the one that got re-booted).

 

So that is why dcapp01 got shutdown isn't it? What I think I need help
with is why ?

 

Many thanks

Neil

 

________________________________

From: Sunil Mushran [mailto:sunil.mushran at oracle.com] 
Sent: Wednesday, 8 December 2010 10:32 AM
To: Neil Campbell
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] Random node reboots

 

==> /var/log/messages from the node being rebooted doesn't show
anything, just the to reboot shows the following
 
That's how Linux works. You should configure netconsole or netdump
to capture the logs. Only then we'll know as to why the node is
panic-ing.


On 12/07/2010 03:27 PM, Neil Campbell wrote: 

Hi all,
 
I keep getting node reboots across my cluster, it seems random in that
the node being evicted changes
and in that it happens every now an then. I'm running RHEL 4 kernel
2.6.89.0.26.ELsmp, 
and OCFS is OCFS2 1.2.9 Mon Jun 21 20:03:07 PDT 2010 (build
5e8325ec7f66b5189c65c7a8710fe8cb)
 
I am using OCFS2 as a general purpose filesystem (i.e not for Oracle
datafiles or OCR etc),
with the following entries in /etc/fstab   
 
/dev/emcpowera1         /u01/cfs                ocfs2   _netdev
0 0
 
As a general purpose filesystem, should I be using the nointr mount
option?
 
/etc/init.d/o2cb status
 
Module "configfs": Loaded
Filesystem "configfs": Mounted
Module "ocfs2_nodemanager": Loaded
Module "ocfs2_dlm": Loaded
Module "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster UATocfs2: Online
  Heartbeat dead threshold: 61
  Network idle timeout: 60000
  Network keepalive delay: 2000
  Network reconnect delay: 2000
Checking O2CB heartbeat: Active
 
/var/log/messages from the node being rebooted doesn't show anything,
just the to reboot shows the following
 
  Dec  8 00:59:02 dcapp01 syslogd 1.4.1: restart.
 
On the other nodes, I see the following entries
 
Dec  8 01:56:01 dcapp02 kernel: o2net: connection to node dcapp01 (num
0) at 10.255.255.1:10007 has been idle for 60.0 seconds, shutting it
down.
Dec  8 01:56:01 dcapp02 kernel: (0,3):o2net_idle_timer:1426 here are
some times that might help debug the situation: (tmr 1291733701.691575
now 1291733761.692608 dr 1291733701.690949 adv
1291733701.696965:1291733701.696967 func (d399da91:500)
1291733701.691576:1291733701.696950)
Dec  8 01:56:01 dcapp02 kernel: o2net: no longer connected to node
dcapp01 (num 0) at 10.255.255.1:10007
Dec  8 01:57:01 dcapp02 kernel: (16082,3):o2net_connect_expired:1585
ERROR: no connection established with node 0 after 60.0 seconds, giving
up and returning errors.
Dec  8 01:57:01 dcapp02 kernel:
(4215,2):dlm_send_remote_convert_request:398 ERROR: status = -107
Dec  8 01:57:01 dcapp02 kernel: (4215,2):dlm_wait_for_node_death:365
C5C06C9B675D41B99B60DE2EB28CE0F7: waiting 5000ms for notification of
death of node 0
Dec  8 01:57:04 dcapp02 kernel: (16082,3):ocfs2_dlm_eviction_cb:119
device (120,1): dlm has evicted node 0
Dec  8 01:57:05 dcapp02 kernel:
(4269,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Dec  8 01:57:05 dcapp02 kernel: (4269,0):dlm_wait_for_node_death:365
D43AF814A25845F7B103EBBEA440BA18: waiting 5000ms for notification of
death of node 0
Dec  8 01:57:05 dcapp02 kernel: (16082,3):ocfs2_dlm_eviction_cb:119
device (120,66): dlm has evicted node 0
Dec  8 01:57:06 dcapp02 kernel: (16082,3):ocfs2_dlm_eviction_cb:119
device (120,65): dlm has evicted node 0
Dec  8 02:00:15 dcapp02 kernel: o2net: connected to node dcapp01 (num 0)
at 10.255.255.1:10007
Dec  8 02:00:29 dcapp02 kernel: ocfs2_dlm: Node 0 joins domain
C5C06C9B675D41B99B60DE2EB28CE0F7
Dec  8 02:00:29 dcapp02 kernel: ocfs2_dlm: Nodes in domain
("C5C06C9B675D41B99B60DE2EB28CE0F7"): 0 1 2 3 6 7 8 9 10 11 14 15
Dec  8 02:00:35 dcapp02 kernel: ocfs2_dlm: Node 0 joins domain
97F22666B5A6494AAF38C53909275DB2
Dec  8 02:00:35 dcapp02 kernel: ocfs2_dlm: Nodes in domain
("97F22666B5A6494AAF38C53909275DB2"): 0 1 2 3
Dec  8 02:00:39 dcapp02 kernel: ocfs2_dlm: Node 0 joins domain
D43AF814A25845F7B103EBBEA440BA18
Dec  8 02:00:39 dcapp02 kernel: ocfs2_dlm: Nodes in domain
("D43AF814A25845F7B103EBBEA440BA18"): 0 1 2 3
 
I would really appreciate some help with this, as I'm not sure where to
go from here.
 
Thanks
Neil
 
________________________________


Downer
This message is for the named person's use only. It may contain
confidential, proprietary or legally privileged information. No
confidentiality or privilege is waived or lost by any mistransmission.
If you receive this message in error, please immediately delete it and
all copies of it from your system, destroy any hard copies of it and
notify the sender. You must not, directly or indirectly, use, disclose,
distribute, print, or copy any part of this message if you are not the
intended recipient. Downer EDI and any of its subsidiaries each reserve
the right to monitor all e-mail communications through its networks. Any
views expressed in this message are those of the individual sender,
except where the message states otherwise and the sender is authorized
to state them to be the views of any such entity. 

________________________________

 

 
 
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

 

________________________________

Downer
This message is for the named person's use only. It may contain
confidential, proprietary or legally privileged information. No
confidentiality or privilege is waived or lost by any mistransmission.
If you receive this message in error, please immediately delete it and
all copies of it from your system, destroy any hard copies of it and
notify the sender. You must not, directly or indirectly, use, disclose,
distribute, print, or copy any part of this message if you are not the
intended recipient. Downer EDI and any of its subsidiaries each reserve
the right to monitor all e-mail communications through its networks. Any
views expressed in this message are those of the individual sender,
except where the message states otherwise and the sender is authorized
to state them to be the views of any such entity. 

________________________________

 

 
 
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

 


-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Downer 
This message is for the named person's use only. It may contain confidential, proprietary or legally privileged information. No confidentiality or privilege is waived or 
lost by any mistransmission. If you receive this message in error, please immediately delete it and all copies of it from your system, destroy any hard copies of it and 
notify the sender. You must not, directly or indirectly, use, disclose, distribute, print, or copy any part of this message if you are not the intended recipient. 
Downer EDI and any of its subsidiaries each reserve the right to monitor all e-mail communications through its networks. Any views expressed in this message are those of 
the individual sender, except where the message states otherwise and the sender is authorized to state them to be the views of any such entity.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20101208/486b4ed6/attachment-0001.html 


More information about the Ocfs2-users mailing list