[Ocfs2-users] ocfs2 kernel panic

Sunil Mushran Sunil.Mushran at oracle.com
Mon Feb 4 15:03:56 PST 2008


The useful info is the oops stack trace. The messages provided
are standard messages not relevant to the problem per se.

Having said that, 1.2.3 is 1.5yrs old. Even 1.2.4 is a year old.
My suggestion would be to upgrade. We are about to release
1.2.8 shortly.

Saranya Sivakumar wrote:
> Hi,
>
> We have a 4-node production cluster running Oracle 10.2.0.2 RAC 
> database using Oracle Clusterware.
> The cluster files are shared on a ocfs2 filesystem. We are using ocfs2 
> 1.2.3 version and we have the O2CB_HEARTBEAT_THRESHOLD set to 60.
>
> Almost frequently, we have one of the nodes getting a kernel panic 
> dues to ocfs2.
>
> We see messages similar to the following in the alert log
>
> Reconfiguration started (old inc 8, new inc 10)
> List of nodes:
>  0 2 3
>  Global Resource Directory frozen
>  * dead instance detected - domain 0 invalid = TRUE
>  Communication channels reestablished
>  * domain 0 not valid according to instance 3
>  * domain 0 not valid according to instance 2
> Mon Feb  4 15:28:40 2008
>  Master broadcasted resource hash value bitmaps
>  Non-local Process blocks cleaned out
> Mon Feb  4 15:28:40 2008
>  LMS 0: 10 GCS shadows cancelled, 3 closed
> *******************************************************************************
> /var/sys/messages on one of the surviving nodes shows the following
>
> Feb 4 15:24:32 db0 kernel: o2net: connection to node db1 (num 1) at 
> 10.10.100.51:7777 has been idle for 10 seconds, shutting it down.
> Feb 4 15:24:32 db0 kernel: (0,2):o2net_idle_timer:1309 here are some 
> times that might help debug the situation: (tmr 1202160262.751965 now 
> 1202160272.750632 dr 1202160262.751951
> adv 1202160262.751968:1202160262.751970 func (f6ed8616:502) 
> 1202119336.222326:1202119336.222328)
> Feb 4 15:24:32 db0 kernel: o2net: no longer connected to node db1 (num 
> 1) at 10.10.100.51:7777
> Feb 4 15:28:39 db0 kernel: (13349,2):ocfs2_dlm_eviction_cb:119 device 
> (120,386): dlm has evicted node 1
> Feb 4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:847 
> 138B67103BE042A784A6D419278F891D:$RECOVERY: at least one node (1) 
> torecover before lock mastery can begin
> Feb 4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:874 
> 138B67103BE042A784A6D419278F891D: recovery map is not empty, but must 
> master $RECOVERY lock now
> *******************************************************************************
> cat /proc/version
> Linux version 2.6.9-42.0.3.ELsmp 
> (brewbuilder at hs20-bc2-2.build.redhat.com) (gcc version 3.4.6 20060404 
> (Red Hat 3.4.6-3)) #1 SMP Mon Sep 25 17:24:31 EDT 2006
> *******************************************************************************
> cat /etc/sysconfig/o2cb
> #
> # This is a configuration file for automatic startup of the O2CB
> # driver.  It is generated by running /etc/init.d/o2cb configure.
> # Please use that method to modify this file
> #
>
> # O2CB_ENABELED: 'true' means to load the driver on boot.
> O2CB_ENABLED=true
>
> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
> O2CB_BOOTCLUSTER=ocfs2
>
> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
> O2CB_HEARTBEAT_THRESHOLD=60
> *******************************************************************************
>
> Our system administrator found that the NIC hung up right before we 
> lost the node.
> We are guessing that by the time the NIC (probably) could have come 
> back up, the cluster declared the node as dead and evicted it.
>
> This has been happening frequently, but we are not sure what is the 
> root cause for it.
> Would setting the keepalive timeout avoid the instance eviction?
> Are there options to set network idle time out and keepalive timeout 
> with ocfs2 1.2.3?
>
> We are considering upgrading ocfs2 to 1.2.5, but would like to create 
> a temporary workaround before we deploy it on production.
> Please give us your suggestions and help us fix this problem from 
> re-occuring.
>
> Thanks,
> Sincerely,
> Saranya Sivakumar
>
> Database Administrator
>
>
> ------------------------------------------------------------------------
> Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try 
> it now. 
> <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20> 
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users




More information about the Ocfs2-users mailing list