Fwd: RE: [Ocfs-users] OCFS Hang

Sunil Mushran Sunil.Mushran at oracle.com
Wed Apr 21 11:01:29 CDT 2004


Your patch has been checked in and will be included in
the next rev.... 1.0.12.

Jeremy Schneider wrote:

>Oh yeah - easy way to check, Randy:
>
>Next time your node hangs, get on the OTHER NODE and go into each
>directory where files are being opened (datafiles, archivelogs,
>controlefiles, redo logs, etc) and delete a file (you can create one
>first then delete it).  If this causes the hung node to recover then
>you're having the same problem I was having.
>
>Jeremy
>
>  
>
>>>>"Jeremy Schneider" <jer1887 at asugroup.com> 04/21/2004 10:14:04 AM
>>>>
>>>>        
>>>>
>Just a thought, but you might be having the same problem I was having.
>
>Symptoms sound *very* similar.  The patch has supposedly been merged
>into the source tree but I don't think they've released a new version
>of
>OCFS since the merge.  (Sunil or Wim - do you know if this bugfix was
>included in 1.0.11-1?)
>
>Check
>http://oss.oracle.com/pipermail/ocfs-users/2004-March/000192.html 
>
>For the geek [technical] description, check
>http://oss.oracle.com/pipermail/ocfs-users/2004-March/000185.html or
>http://www.asugroup.com/ocfsbugfix.txt 
>
>Jeremy
>
>  
>
>>>>"Doering, Randy" <Randy.Doering at ventersciencejtc.org> 04/19/2004
>>>>        
>>>>
>6:23:52 PM >>>
>Kurt, Thanks for the info. We ended up stopping/restarting the DB.
>That
>was successful, although trying to get to /u06/oradata/database was
>still hanging. We then rebooted the node, and after that everything is
>fine now. I'll look more into this using your suggestions and
>hopefully
>if/when it happens again, I'll have more information for you all.
> 
>BTW, using ocfstool, I was able to "browse" over and see the contents
>of that directory fine.
> 
>Thanks again,
>Randy
> 
>PS: We had also logged a case with oracle support.
> 
>
>	-----Original Message----- 
>	From: Kurt Hackel [mailto:Kurt.Hackel at oracle.com] 
>	Sent: Mon 4/19/2004 3:54 PM 
>	To: Doering, Randy 
>	Cc: ocfs-users at oss.oracle.com 
>	Subject: Re: [Ocfs-users] OCFS Hang
>	
>	
>
>	Hi Randy,
>	
>	It looks like you have some process stuck that had previously
>done a
>	down() on a semaphore in the /u06/oradata/database directory. 
>Pretty
>	much every operation inside that directory from that node will
>hang once
>	the first hang occurs.
>	
>	The best place to go is to Oracle Support at this point.  But
>in
>any
>	case, the information they will want is a
>	"debugocfs -f /oradata/database/ /dev/raw/raw##" and a
>	"debugocfs -d /oradata/database/ /dev/raw/raw##" and a
>	"fsck.ocfs -v /dev/raw/raw##".
>	
>	My guess is either that the fsck.ocfs output will show an ERROR
>that
>	says you have a system file locked by another node, or that you
>have
>	some process actively spinning in the ocfs code.  If it turns
>out to be
>	the latter, you would also want to get the output of
>/var/log/messages
>	after running this:
>	"echo -1 > /proc/sys/kernel/ocfs/debug_level"
>	"echo -1 > /proc/sys/kernel/ocfs/debug_context"
>	making sure to set both of these values back to 0 after a
>couple
>	minutes.  Also, make sure to get a "ps -ef" or "ps awux" output
>too,
>	in order to match up the process ids.
>	
>	The solution to any of the bugs I have mentioned will likely
>involve
>	taking down one node, depending upon which bug you have hit. 
>Since in
>	your case it unfortunately looks like the trouble partition
>contains
>	your datafiles, I would prepare to shutdown the database on
>this
>node in
>	anticipation of a reboot.  The other RAC node can likely remain
>up and
>	running.  (If this were a partition containing only archives,
>for
>	instance, you could possibly keep the database up by just
>switching
>	archive destination temporarily).
>	
>	Thanks!
>	-kurt
>	
>	
>	
>	On Mon, Apr 19, 2004 at 03:02:23PM -0400, Doering, Randy wrote:
>	> 
>	>
>	> Greetings,
>	>
>	> 
>	>
>	>             Having read about the previous OSFS hangs, I
>think
>this one
>	> that we are seeing is different, but I'm not sure if this is
>caused by
>	> OCFS or the Linux OS.
>	>
>	> 
>	>
>	>             We are running OCFS Version 1.09 with Linux AS
>3.0/9i RAC.
>	>
>	> 
>	>
>	> We have a 2 node Intel Cluster (Node 1 and Node 2). This
>morning the DBA
>	> tried to do an "ls" command on /u06/oradata/database and his
>process
>	> hung. I tried to kill his "ls" process and it is unkillable.
>On Node 2,
>	> the "ls" on /u06/oradata/database worked fine. All of the
>other file
>	> systems (on both nodes) are fine.
>	>
>	> 
>	>
>	> Also, what we can't get rid of is this process:
>	>
>	> 
>	>
>	> oracle   23593     1 95 10:00 ?        04:45:11 oracleXYZ2
>	> (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
>	>
>	> 
>	>
>	>             and it's been accumulating CPU time since the
>hang. I'm
>	> unsure if this process is a victim or the cause of the hangs.
>	>
>	> 
>	>
>	>             I hope that I have provided enough information
>about the
>	> situation. If not, let me know and I'll get more.
>	>
>	> 
>	>
>	> Regards,
>	>
>	> Randy
>	>
>	> 
>	>
>	
>	> _______________________________________________
>	> Ocfs-users mailing list
>	> Ocfs-users at oss.oracle.com 
>	> http://oss.oracle.com/mailman/listinfo/ocfs-users 
>	
>	
>
>
>This message (including any attachments) contains confidential
>information intended for a specific individual(s) and purpose, and is
>protected by law.  If you are not the intended recipient, you should
>delete this message.  Any disclosure, copying, or distribution of this
>message, or the taking of any action based on it, by anyone other than
>the intended recipient(s), is strictly prohibited.
>
><<<<...>>>>
>_______________________________________________
>Ocfs-users mailing list
>Ocfs-users at oss.oracle.com 
>http://oss.oracle.com/mailman/listinfo/ocfs-users
>_______________________________________________
>Ocfs-users mailing list
>Ocfs-users at oss.oracle.com
>http://oss.oracle.com/mailman/listinfo/ocfs-users
>  
>




More information about the Ocfs-users mailing list