[Ocfs2-devel] shutdown by o2net_idle_timer causes Xen to hang
Joel Becker
Joel.Becker at oracle.com
Wed Mar 18 10:07:13 PDT 2009
On Wed, Mar 18, 2009 at 12:17:36PM +0100, David Winter wrote:
> Hello,
>
> we've had some serious trouble with a two-node Xen-based OCFS2
> cluster. In brief: we had two incidents where one node detects an idle
> timeout and shuts the other node down which causes the other node and
> the Dom0 to hang. Both times this could only be resolved by rebooting
> the whole machine using the built-in IPMI card.
>
> All machines (including the other DomUs) run Centos 5.2 and the OCFS2
> nodes use ocfs2-tools-1.4.1-1.el5 and
> ocfs2-2.6.18-92.1.13.el5xen-1.4.1-1.el5.
>
> Unfortunately there wasn't logged much of relevance, except for the /
> var/log/messages of the node that issued the shutdown (see below) and
> the nearly five hour gap in the logs of the other node.
Just to clarify, the o2cb stack doesn't shut down other nodes.
Nodes can only self-fence. The 'shutting it down' message in the logs
is about the connection. In other words, cod-2 is already hanging.
ugc-1 notices and closes the network connection.
So you want to figure out why cod-2 hung or crashed. Sunil is
right that you'll want netconsole for a better idea of what's going on.
We can't diagnose cod-2 from this information.
If your dom0 is hanging, that's a separate issue. A hanging
domU, no matter the cause, shouldn't hang dom0.
Joel
--
"Sometimes when reading Goethe I have the paralyzing suspicion
that he is trying to be funny."
- Guy Davenport
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
More information about the Ocfs2-devel
mailing list