[Ocfs-users] OCFS Hang

Mon Apr 19 13:54:11 CDT 2004

Hi Randy,

It looks like you have some process stuck that had previously done a
down() on a semaphore in the /u06/oradata/database directory.  Pretty
much every operation inside that directory from that node will hang once
the first hang occurs.

The best place to go is to Oracle Support at this point.  But in any
case, the information they will want is a 
"debugocfs -f /oradata/database/ /dev/raw/raw##" and a 
"debugocfs -d /oradata/database/ /dev/raw/raw##" and a 
"fsck.ocfs -v /dev/raw/raw##".

My guess is either that the fsck.ocfs output will show an ERROR that
says you have a system file locked by another node, or that you have
some process actively spinning in the ocfs code.  If it turns out to be
the latter, you would also want to get the output of /var/log/messages
after running this:
"echo -1 > /proc/sys/kernel/ocfs/debug_level"
"echo -1 > /proc/sys/kernel/ocfs/debug_context"
making sure to set both of these values back to 0 after a couple
minutes.  Also, make sure to get a "ps -ef" or "ps awux" output too,
in order to match up the process ids.

The solution to any of the bugs I have mentioned will likely involve
taking down one node, depending upon which bug you have hit.  Since in
your case it unfortunately looks like the trouble partition contains
your datafiles, I would prepare to shutdown the database on this node in
anticipation of a reboot.  The other RAC node can likely remain up and
running.  (If this were a partition containing only archives, for
instance, you could possibly keep the database up by just switching
archive destination temporarily).

Thanks!
-kurt

On Mon, Apr 19, 2004 at 03:02:23PM -0400, Doering, Randy wrote:
>  
> 
> Greetings,
> 
>  
> 
>             Having read about the previous OSFS hangs, I think this one
> that we are seeing is different, but I'm not sure if this is caused by
> OCFS or the Linux OS.
> 
>  
> 
>             We are running OCFS Version 1.09 with Linux AS 3.0/9i RAC.
> 
>  
> 
> We have a 2 node Intel Cluster (Node 1 and Node 2). This morning the DBA
> tried to do an "ls" command on /u06/oradata/database and his process
> hung. I tried to kill his "ls" process and it is unkillable. On Node 2,
> the "ls" on /u06/oradata/database worked fine. All of the other file
> systems (on both nodes) are fine.
> 
>  
> 
> Also, what we can't get rid of is this process:
> 
>  
> 
> oracle   23593     1 95 10:00 ?        04:45:11 oracleXYZ2
> (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
> 
>  
> 
>             and it's been accumulating CPU time since the hang. I'm
> unsure if this process is a victim or the cause of the hangs.
> 
>  
> 
>             I hope that I have provided enough information about the
> situation. If not, let me know and I'll get more.
> 
>  
> 
> Regards,
> 
> Randy
> 
>  
> 

> _______________________________________________
> Ocfs-users mailing list
> Ocfs-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs-users