[Ocfs2-users] ocfs2 kernel panic

Mon Feb 4 14:55:12 PST 2008

Hi,

We have a 4-node production cluster running Oracle 10.2.0.2 RAC database using Oracle Clusterware.
The cluster files are shared on a ocfs2 filesystem. We are using ocfs2 1.2.3 version and we have the O2CB_HEARTBEAT_THRESHOLD set to 60.

Almost frequently, we have one of the nodes getting a kernel panic dues to ocfs2.

We see messages similar to the following in the alert log

Reconfiguration started (old inc 8, new inc 10)
List of nodes:
 0 2 3
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 * domain 0 not valid according to instance 3
 * domain 0 not valid according to instance 2
Mon Feb  4 15:28:40 2008
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Mon Feb  4 15:28:40 2008
 LMS 0: 10 GCS shadows cancelled, 3 closed
*******************************************************************************
/var/sys/messages on one of the surviving nodes shows the following

Feb  4 15:24:32 db0 kernel: o2net: connection to node db1 (num 1) at 10.10.100.51:7777 has been idle for 10 seconds, shutting it down.
Feb  4 15:24:32 db0 kernel: (0,2):o2net_idle_timer:1309 here are some times that might help debug the situation: (tmr 1202160262.751965 now 1202160272.750632 dr 1202160262.751951
adv 1202160262.751968:1202160262.751970 func (f6ed8616:502) 1202119336.222326:1202119336.222328)
Feb  4 15:24:32 db0 kernel: o2net: no longer connected to node db1 (num 1) at 10.10.100.51:7777
Feb  4 15:28:39 db0 kernel: (13349,2):ocfs2_dlm_eviction_cb:119 device (120,386): dlm has evicted node 1
Feb  4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:847 138B67103BE042A784A6D419278F891D:$RECOVERY: at least one node (1) torecover before lock mastery can begin
Feb  4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:874 138B67103BE042A784A6D419278F891D: recovery map is not empty, but must master $RECOVERY lock now
*******************************************************************************
cat /proc/version
Linux version 2.6.9-42.0.3.ELsmp (brewbuilder at hs20-bc2-2.build.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP Mon Sep 25 17:24:31 EDT 2006
*******************************************************************************
cat /etc/sysconfig/o2cb
#
# This is a configuration file for automatic startup of the O2CB
# driver.  It is generated by running /etc/init.d/o2cb configure.
# Please use that method to modify this file
#

# O2CB_ENABELED: 'true' means to load the driver on boot.
O2CB_ENABLED=true

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=ocfs2

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=60
*******************************************************************************

Our system administrator found that the NIC hung up right before we lost the node. 
We are guessing that by the time the NIC (probably) could have come back up, the cluster declared the node as dead and evicted it.

This has been happening frequently, but we are not sure what is the root cause for it.
Would setting the keepalive timeout avoid the instance eviction?
Are there options to set network idle time out and keepalive timeout with ocfs2 1.2.3?

We are considering upgrading ocfs2 to 1.2.5, but would like to create a temporary workaround before we deploy it on production.
Please give us your suggestions and help us fix this problem from re-occuring.

Thanks,
Sincerely,
Saranya Sivakumar 

Database Administrator

      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080204/a7a1172c/attachment.html