[Ocfs-devel] [BUG] node 0 hangs until disk unmounted on node 1

Jeremy Schneider jer1887 at asugroup.com
Wed Dec 10 13:44:00 CST 2003


I'm currently part of a project implementing Oracle eBusiness Suite 11i
on RAC.  We're using a two-node cluster with shared storage, both nodes
are configured identical.  Kernel is 2.4.9-e.27enterprise and ocfs is
1.0.9-11.  I have checked and the shared storage can be accessed
directly without any problems from both nodes (/dev/sdx).

Curious if anyone has any suggestions or comments regarding a problem
we've been having.

After mounting the ocfs partitions, eventually one of the nodes will
hang: the oracle server processes will get stuck in a "D" Disk wait
state, and when I go into the folder with the datafiles and type "ls"
that process also hangs in a "D" state.  It's interesting that I can
list the contents of other folders, but as soon as I try to list the
contents of the directory with the datafiles, the process hangs in a
Disk Wait state.  "strace -p" also hangs when I try to run it on the
process.

This only happens when the volume is mounted on both nodes.  Last
Friday I had several oracle processes hung and several terminal windows
with hung /bin/ls processes.  The *moment* I unmounted /u02 from the
other node, *all* of the "D" processes (even strace) instantly came out
of Disk Wait and continued.  Seems like an ocfs issue to me.

Any ideas how I can further narrow this problem down?  Would a process
dump from the magic sysrq key (t) help?  Or the wait channel from a "ps
l"?  Is one node designated a "master" node, and would identifying
whether or not this happened on the master node help?

Regards,
Jeremy


Jeremy Schneider
Systems/Database Administrator
The ASU Group - IS Dept
email: jer1887 at asugroup.com

Life is either a daring adventure or nothing.
  -- Helen Keller, Let Us Have Faith


More information about the Ocfs-devel mailing list