<html><head><style type="text/css"><!-- DIV {margin:0px;} --></style></head><body><div style="font-family:times new roman, new york, times, serif;font-size:10pt">Hi,<br><br>We have a 4-node production cluster running Oracle 10.2.0.2 RAC database using Oracle Clusterware.<br>The cluster files are shared on a ocfs2 filesystem. We are using ocfs2 1.2.3 version and we have the O2CB_HEARTBEAT_THRESHOLD set to 60.<br><br>Almost frequently, we have one of the nodes getting a kernel panic dues to ocfs2.<br><br>We see messages similar to the following in the alert log<br><br>Reconfiguration started (old inc 8, new inc 10)<br>List of nodes:<br> 0 2 3<br> Global Resource Directory frozen<br> * dead instance detected - domain 0 invalid = TRUE<br> Communication channels reestablished<br> * domain 0 not valid according to instance 3<br> * domain 0 not valid according to instance 2<br>Mon Feb 4 15:28:40 2008<br> Master
broadcasted resource hash value bitmaps<br> Non-local Process blocks cleaned out<br>Mon Feb 4 15:28:40 2008<br> LMS 0: 10 GCS shadows cancelled, 3 closed<br>*******************************************************************************<br>/var/sys/messages on one of the surviving nodes shows the following<br><br><font size="3"><font size="2">Feb 4 15:24:32 db0 kernel: o2net: connection to node db1 (num 1) at 10.10.100.51:7777 has been idle for 10 seconds, shutting it down.<br>Feb 4 15:24:32 db0 kernel: (0,2):o2net_idle_timer:1309 here are some times that might help debug the situation: (tmr 1202160262.751965 now 1202160272.750632 dr 1202160262.751951<br>adv 1202160262.751968:1202160262.751970 func (f6ed8616:502) 1202119336.222326:1202119336.222328)<br>Feb 4 15:24:32 db0 kernel: o2net: no longer connected to node db1 (num 1) at 10.10.100.51:7777<br>Feb 4 15:28:39 db0 kernel: (13349,2):ocfs2_dlm_eviction_cb:119 device (120,386): dlm
has evicted node 1<br>Feb 4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:847 138B67103BE042A784A6D419278F891D:$RECOVERY: at least one node (1) torecover before lock mastery can begin<br>Feb 4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:874 138B67103BE042A784A6D419278F891D: recovery map is not empty, but must master $RECOVERY lock now</font><br></font>*******************************************************************************<font size="3"><br></font>cat /proc/version<br>Linux version 2.6.9-42.0.3.ELsmp (brewbuilder@hs20-bc2-2.build.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP Mon Sep 25 17:24:31 EDT 2006<br>*******************************************************************************<br>cat /etc/sysconfig/o2cb<br>#<br># This is a configuration file for automatic startup of the O2CB<br># driver. It is generated by running /etc/init.d/o2cb configure.<br># Please use that method to modify this
file<br>#<br><br># O2CB_ENABELED: 'true' means to load the driver on boot.<br>O2CB_ENABLED=true<br><br># O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.<br>O2CB_BOOTCLUSTER=ocfs2<br><br># O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.<br>O2CB_HEARTBEAT_THRESHOLD=60<br>*******************************************************************************<br><br>Our system administrator found that the NIC hung up right before we lost the node. <br>We are guessing that by the time the NIC (probably) could have come back up, the cluster declared the node as dead and evicted it.<br><br>This has been happening frequently, but we are not sure what is the root cause for it.<br>Would setting the keepalive timeout avoid the instance eviction?<br>Are there options to set network idle time out and keepalive timeout with ocfs2 1.2.3?<br><br>We are considering upgrading ocfs2 to 1.2.5, but would like to create a temporary workaround
before we deploy it on production.<br>Please give us your suggestions and help us fix this problem from re-occuring.<br><br>Thanks,<br>Sincerely,<br>Saranya Sivakumar <br><br>Database Administrator<br><br></div><br>
<hr size=1>Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. <a href="http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ "> Try it now.</a></body></html>