[Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell.

Fri Apr 12 01:18:13 PDT 2013

All nodes where powered down cleanly. Database was stopped and Grid Infrastructure was shut down manualy, but the ocfs2 cluster was not stopped manualy. Nothing happened when the nodes was shut down first time. After the first planned reboot, node 1,2,3 and 7 seemed to be OK, but the sysadmins had to look into node 4,5 and 6, due to disk problems. Grid Infrastructure was started on node 1,2 and 3, but it wouldn't start on node 7. The dba checked that the node had disks, but not that disks was in proper order, meaning that disk02 really was disk02, etc. The dba thought a reboot would probably fix it. So he disabled the grid infrastructure, did nothing to ocfs2 and rebooted the server. And that seemed to reboot all other nodes as well.

After the second reboot which was uncontrolled, the grid infrastructure was started one by one on node 1, 2 and 3. At 03:00 am the three nodes was running the database. At 04:06 am the cluster went down again unplanned. Nobody know why, but the sysadm guys said they saw some kernel panic. And in the /var/log/messages the "Kernel Bug at ...shran/BUILD/ocfs2-1.4.7..." came again as it did at 02:25 am.

Then when the nodes came up again, the database was started on node 4, 5 and 6. It wasn't possible to start crs on node 1, 2, 3 and 7. The sysadmins did something with those nodes and after another reboot of just those nodes, crs was able to start again. So the instances on node 1, 2 and 3 was started, but we didn't start anything on node 7 because we were afraid of shutting down the cluster again.

Got a mail from Sunil sayin I had to "ping Oracle". So I guess I'll do that.

Morten K. 
Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 
Kvalitet  - Trygghet - Respekt

-----Original Message-----
From: Joel Becker [mailto:jlbec at ftp.linux.org.uk] On Behalf Of Joel Becker
Sent: 11. april 2013 21:04
To: Kristiansen Morten
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell.

Did you power down nodes uncleanly?  The message says that one node lost track of who was doing a particular recovery.  If nodes are shut down cleanly, they should be communicating that information.

Joel

On Thu, Apr 11, 2013 at 12:10:22PM +0200, Kristiansen Morten wrote:
> I've had no response on my problem, is there anybody who can help me on this?
> 
> Morten K.
> 
> Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 Kvalitet  - Trygghet - 
> Respekt
> 
> 
> 
> From: ocfs2-users-bounces at oss.oracle.com 
> [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Kristiansen 
> Morten
> Sent: 21. mars 2013 14:47
> To: ocfs2-users at oss.oracle.com
> Subject: [Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell.
> 
> Hi,
> 
> We are running a 8 nodes cluster on RHEL 2.6.18-128 64-bit. Yesterday the server/san guys exchanged the ocfs2 disks to another SAN, by mirroring and synchronizing the disks. When they rebooted the servers, one of the nodes, tos-dipsprod-07 wasn't able to start Oracle Grid Infrastructure, the voting disk was not found. Then we tried to reboot that node, causing all nodes to reboot. Time round about 02:25. When examine the /var/log/messages I discovered a BUG message on one of the node that rebooted unexpectedly, tos-dipsprod-02. I've tried to google it, but I couldn't find any solution. Is this a well known bug? Does any body have a solution to this problem?
> 
> Below is a extract of o2net and ocfs2 messages from the /var/log/message file.
> 
> /var/log/messages til tos-dipsprod-07:
> Mar 21 02:08:49 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 (num 3) at 192.168.7.105:7777 has been idle for 10.0 seconds, shutting it down.
> Mar 21 02:25:25 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-01 (num 0) at 192.168.7.100:7777 has been idle for 10.0 seconds, shutting it down.
> Mar 21 02:25:35 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 (num 1) at 192.168.7.101:7777 has been idle for 10.0 seconds, shutting it down.
> Mar 21 02:25:40 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-03 (num 2) at 192.168.7.102:7777 has been idle for 10.0 seconds, shutting it down.
> Mar 21 02:25:45 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 (num 3) at 192.168.7.105:7777 has been idle for 10.0 seconds, shutting it down.
> Mar 21 02:25:54 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-04 (num 5) at 192.168.7.103:7777 has been idle for 10.0 seconds, shutting it down.
> Mar 21 04:03:17 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 (num 3) at 192.168.7.105:7777 has been idle for 10.0 seconds, shutting it down.
> Mar 21 04:06:32 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-01 (num 0) at 192.168.7.100:7777 has been idle for 10.0 seconds, shutting it down.
> Mar 21 04:06:37 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 (num 1) at 192.168.7.101:7777 has been idle for 10.0 seconds, shutting it down.
> Mar 21 04:06:47 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-03 (num 2) at 192.168.7.102:7777 has been idle for 10.0 seconds, shutting it down.
> Mar 21 06:04:25 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 (num 1) at 192.168.7.101:7777 has been idle for 10.0 seconds, shutting it down.
> 
> Og her fra tos-dipsprod-02:
> 10474-Mar 21 02:25:15 tos-dipsprod-02 kernel: 
> (o2net,7452,5):dlm_begin_reco_handler:2730 
> 992D008CD522447C8333FC34BD46F8CD: dead_node previously set to 7, node 
> 3 changing it to 7 10646-Mar 21 02:25:25 tos-dipsprod-02 kernel: 
> (o2net,7452,5):dlm_finalize_reco_handler:2839 ERROR: node 6 sent 
> recovery finalize msg, but node 3 is supposed to be the new master, 
> dead=7 10826:Mar 21 02:25:25 tos-dipsprod-02 kernel: Kernel BUG at 
> ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840
> 10939-Mar 21 02:43:01 tos-dipsprod-02 syslogd 1.4.1: restart.
> 10995-Mar 21 02:43:02 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue not found.
> --
> 17537-Mar 21 04:06:19 tos-dipsprod-02 kernel: 
> (o2net,7472,1):dlm_begin_reco_handler:2730 
> 992D008CD522447C8333FC34BD46F8CD: dead_node previously set to 6, node 
> 6 changing it to 7 17709-Mar 21 04:06:29 tos-dipsprod-02 kernel: 
> (o2net,7472,1):dlm_finalize_reco_handler:2839 ERROR: node 6 sent 
> recovery finalize msg, but node 255 is supposed to be the new master, 
> dead=7 17891:Mar 21 04:06:29 tos-dipsprod-02 kernel: Kernel BUG at 
> ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840
> 18004-Mar 21 04:38:04 tos-dipsprod-02 syslogd 1.4.1: restart.
> 18060-Mar 21 04:41:33 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue not found.
> 
> 
> Morten Kristiansen    | Counsellor
> Helse Nord IKT         | Departement of Serviceproduction
> 
> Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 Office address:  Amtmann 
> Worsøes gate 63, 8012 Bodø, Norway Quality  - Safety - Respect
> 
> 
> 
> 

> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users

-- 

"Against stupidity the Gods themselves contend in vain."
	- Friedrich von Schiller

			http://www.jlbec.org/
			jlbec at evilplan.org