[Ocfs-devel] RE: [Ocfs-users] [BUG] node 0 hangs until disk unmounted on node 1

Wed Dec 10 14:02:31 CST 2003

I had something along those lines too and worked it out with an HP engineer. 

We were HP XP512 as the shared disk with dual channels to the disk. The nodes hung because we had the disk mounted using the secondary channel as opposed to the primary channel (something about cluster filesystems not liking a secondary channel to shared disk - nto exactly clear on the details though). Once I mounted the disk using the primary channel the problem went away.

Maybe you are having this issue? Just something I have encountered.

HOH

-----Original Message-----
From: Jeremy Schneider [mailto:jer1887 at asugroup.com]
Sent: December 10, 2003 1:44 PM
To: [; [
Subject: [Ocfs-users] [BUG] node 0 hangs until disk unmounted on node 1

I'm currently part of a project implementing Oracle eBusiness Suite 11i
on RAC.  We're using a two-node cluster with shared storage, both nodes
are configured identical.  Kernel is 2.4.9-e.27enterprise and ocfs is
1.0.9-11.  I have checked and the shared storage can be accessed
directly without any problems from both nodes (/dev/sdx).

Curious if anyone has any suggestions or comments regarding a problem
we've been having.

After mounting the ocfs partitions, eventually one of the nodes will
hang: the oracle server processes will get stuck in a "D" Disk wait
state, and when I go into the folder with the datafiles and type "ls"
that process also hangs in a "D" state.  It's interesting that I can
list the contents of other folders, but as soon as I try to list the
contents of the directory with the datafiles, the process hangs in a
Disk Wait state.  "strace -p" also hangs when I try to run it on the
process.

This only happens when the volume is mounted on both nodes.  Last
Friday I had several oracle processes hung and several terminal windows
with hung /bin/ls processes.  The *moment* I unmounted /u02 from the
other node, *all* of the "D" processes (even strace) instantly came out
of Disk Wait and continued.  Seems like an ocfs issue to me.

Any ideas how I can further narrow this problem down?  Would a process
dump from the magic sysrq key (t) help?  Or the wait channel from a "ps
l"?  Is one node designated a "master" node, and would identifying
whether or not this happened on the master node help?

Regards,
Jeremy

Jeremy Schneider
Systems/Database Administrator
The ASU Group - IS Dept
email: jer1887 at asugroup.com

Life is either a daring adventure or nothing.
  -- Helen Keller, Let Us Have Faith
_______________________________________________
Ocfs-users mailing list
Ocfs-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs-users