[Ocfs2-devel] shutdown by o2net_idle_timer causes Xen to hang

Sunil Mushran sunil.mushran at oracle.com
Wed Mar 18 05:05:43 PDT 2009


Setup a netconsole server to capture the logs. There is not much to go
on with the info you have provided.

On Wed, Mar 18, 2009 at 12:17:36PM +0100, David Winter wrote:
> Hello,
> 
> we've had some serious trouble with a two-node Xen-based OCFS2  
> cluster. In brief: we had two incidents where one node detects an idle  
> timeout and shuts the other node down which causes the other node and  
> the Dom0 to hang. Both times this could only be resolved by rebooting  
> the whole machine using the built-in IPMI card.
> 
> All machines (including the other DomUs) run Centos 5.2 and the OCFS2  
> nodes use ocfs2-tools-1.4.1-1.el5 and  
> ocfs2-2.6.18-92.1.13.el5xen-1.4.1-1.el5.
> 
> Unfortunately there wasn't logged much of relevance, except for the / 
> var/log/messages of the node that issued the shutdown (see below) and  
> the nearly five hour gap in the logs of the other node.
> 
> Mar 15 14:39:47 ugc-1 kernel: o2net: connection to node cod-2 (num 3)  
> at 10.0.0.42:7777 has been idle for 30.0 seconds, shutting it down.
> Mar 15 14:39:47 ugc-1 kernel: (0,0):o2net_idle_timer:1476 here are  
> some times that might help debug the situation: (tmr 1237124357.624587  
> now 1237124387.624394 dr 1237124357.624578 adv  
> 1237124357.624588:1237124357.624589 func (be795f6d:507)  
> 1237124191.594238:1237124191.594242)
> Mar 15 14:39:47 ugc-1 kernel: o2net: no longer connected to node cod-2  
> (num 3) at 10.0.0.42:7777
> Mar 15 14:39:47 ugc-1 kernel: (24452,0):dlm_do_master_request:1335  
> ERROR: link to 3 went down!
> Mar 15 14:39:47 ugc-1 kernel: (24452,0):dlm_get_lock_resource:912  
> ERROR: status = -112
> Mar 15 14:40:17 ugc-1 kernel: (1743,0):o2net_connect_expired:1637  
> ERROR: no connection established with node 3 after 30.0 seconds,  
> giving up and returning errors.
> Mar 15 14:44:29 ugc-1 kernel: (16225,0):dlm_do_master_request:1335  
> ERROR: link to 3 went down!
> Mar 15 14:44:29 ugc-1 kernel: (16225,0):dlm_get_lock_resource:912  
> ERROR: status = -107
> Mar 15 14:44:47 ugc-1 kernel: (1743,0):o2net_connect_expired:1637  
> ERROR: no connection established with node 3 after 30.0 seconds,  
> giving up and returning errors.
> Mar 15 14:46:59 ugc-1 kernel: (24456,0):dlm_do_master_request:1335  
> ERROR: link to 3 went down!
> Mar 15 14:46:59 ugc-1 kernel: (24456,0):dlm_get_lock_resource:912  
> ERROR: status = -107
> Mar 15 14:46:59 ugc-1 kernel: (27195,0):dlm_do_master_request:1335  
> ERROR: link to 3 went down!
> Mar 15 14:46:59 ugc-1 kernel: (27195,0):dlm_get_lock_resource:912  
> ERROR: status = -107
> 
> Is this already a known issue and if so, is there a workaround or fix?
> 
> Thanks in advance.
> 
> 
> Regards, David
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-devel



More information about the Ocfs2-devel mailing list