[Ocfs-users] OCFS file system used as archived redo destination is corrupted

Fri Feb 11 14:45:52 CST 2005

This file system was created under 1.0.12.

Does upgrade from 1.0.12 to 1.0.13 require reformatting the file systems?  I don't care about the file system we are using for archived redos (it's pretty screwed up anyway - it's gonna need a clean swipe).  But should I do a full FS pre-upgrade dump and post-upgrade restore for the file system used for storing oracle datafiles?  Of course I'll do a full db backup in any case.

Are you saying the problem I described was a known problem in 1.0.12 and had been fixed in 1.0.13?

Before this problem, our production db had archive_lag_target set to 15 minutes (in order for the standby db not to lag behind the production db too much).  Since this is a four-node RAC, it means that there are at least (60/15)*4=16 archived redos generated per hour (16*24=384 per day).  And the fact that this problem only appears after several months tells me that OCFS QA process needs to be more thorough and run for a long time in order to catch bugs like this.

Another weird thing is when I do a 'ls /export/u10/oraarch/AUCP', it takes about 15 sec and CPU usage is like 25+% for that duration.  If I do the same command on multiple nodes, the elapse time might be 30 sec on each node.  It concerns me a simple command like 'ls' can be that resource intensive and slow.  Maybe it's related to the FS corruption...

thanks

Pei

> -----Original Message-----
> From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com]
> Sent: Friday, February 11, 2005 11:51 AM
> To: Pei Ku
> Cc: ocfs-users at oss.oracle.com
> Subject: Re: [Ocfs-users] OCFS file system used as archived redo
> destination is corrupted
> 
> 
> Looks like the dirnode index is screwed up. The file is showing up
> twice, but there is only one copy of the file.
> We had detected a race which could cause this. Was fixed.
> 
> Did you start on 1.0.12 or run an older version of the module 
> with this 
> device?
> May want to look into upgrading to atleast 1.0.13. We did 
> some memory alloc
> changes which were sorely required.
> 
> As part of our tests, we simulate the archiver... run a script on 
> multiple nodes
> which constantly creates files.
> 
> Pei Ku wrote:
> 
> >  
> > we started using an ocfs file system about 4 months ago as 
> the shared 
> > archived redo  destination for the 4-node rac instances  (HP dl380, 
> > msa1000, RH AS 2.1)  .  last night we are seeing some weird 
> behavior, 
> > and my guess is the inode directory in the file system is getting 
> > corrupted.  I've always had a bad feeling about OCFS not being very 
> > robust at handling constant file creation and deletion 
> (which is what 
> > happens when you use it for archived redo logs).
> >  
> > ocfs-2.4.9-e-smp-1.0.12-1 is what we are using in production.
> >  
> > For now, we set up an archo redo dest on a local ext3 FS on 
> each node 
> > and made that dest the mandatory dest; we changed the ocfs 
> dest to an 
> > optional one.  The reason we made ocfs arch redo dest the 
> primary dest 
> > a few months ago was because we are planning to migrate to 
> rman-based 
> > backup (as opposed to the current hot backup scheme); it's easier 
> > (required?) to manage RAC archived redo logs with rman if archived 
> > redos reside in a shared file system
> >  
> > below are some diagnostics: 
> > $ ls -l rdo_1_21810.arc*
> >  
> > -rw-r-----    1 oracle   dba        397312 Feb 10 22:30 
> rdo_1_21810.arc
> > -rw-r-----    1 oracle   dba        397312 Feb 10 22:30 
> rdo_1_21810.arc
> >  
> > (they have the same inode, btw -- I had done a 'ls -li' earlier but 
> > the output had rolled off the screen)
> >  
> > after a while , one of the dba scripts gziped the file(s).  
> Now they 
> > look like this:
> >  
> >  $ ls -liL /export/u10/oraarch/AUCP/rdo_1_21810.arc*
> > 1457510912 -rw-r-----    1 oracle   dba            36 Feb 10 23:00 
> > /export/u10/oraarch/AUCP/rdo_1_21810.arc.gz
> > 1457510912 -rw-r-----    1 oracle   dba            36 Feb 10 23:00 
> > /export/u10/oraarch/AUCP/rdo_1_21810.arc.gz
> >  
> > These two same files have the same inode also.  But the size is way 
> > too small. 
> >  
> > yeah, /export/u10 is pretty hosed...
> >  
> > Pei 
> >
> >     -----Original Message-----
> >     *From:* Pei Ku
> >     *Sent:* Thu 2/10/2005 11:16 PM
> >     *To:* IT
> >     *Cc:* ADS
> >     *Subject:* possible OCFS /export/u10/ corruption on dbprd*
> >
> >     Ulf,
> >      
> >     AUCP had problems creating archive file
> >     "/export/u10/oraarch/AUCP/rdo_1_21810.arc".  After a 
> few tries, it
> >     appeared that it was able to -- except that there are *two*
> >     rdo_1_21810.arc files in it (by the time you look at it, it/they
> >     probably would get gzipped.  We also have a couple of zero-lengh
> >     gzipped redo log files (which is not normal) in there.
> >      
> >     At least the problem had not brought any of the AUCP instances
> >     down.  Manoj and I turned on archiving to an ext3 file system on
> >     each node for now; archiving to /export/u10/ is still active but
> >     made optional for now.
> >      
> >     My guess /export/u10/ is corrupted in some way.  I 
> still say OCFS
> >     can't take constant file creation/removing.
> >      
> >     We are one rev behind (1.0.12 vs 1.0.13 on ocfs.org).   No
> >     guarantee that 1.0.13 contains the cure...
> >      
> >     Pei
> >
> >         -----Original Message-----
> >         *From:* Oracle [mailto:oracle at dbprd01.autc.com]
> >         *Sent:* Thu 2/10/2005 10:26 PM
> >         *To:* DBA; Page DBA; Unix Admin
> >         *Cc:*
> >         *Subject:* SL1:dbprd01.autc.com:050210_222600:oalert_mon>
> >         Alert Log Errors
> >
> >         SEVER_LVL=1  PROG=oalert_mon
> >         **** oalert_mon.pl: DB=AUCP SID=AUCP1
> >         [Thu Feb 10 22:25:21] ORA-19504: failed to create file
> >         "/export/u10/oraarch/AUCP/rdo_1_21810.arc"
> >         [Thu Feb 10 22:25:21] ORA-19504: failed to create file
> >         "/export/u10/oraarch/AUCP/rdo_1_21810.arc"
> >         [Thu Feb 10 22:25:21] ORA-27040: skgfrcre: create error,
> >         unable to create file
> >         [Thu Feb 10 22:25:28] ORA-16038: log 12 sequence# 
> 21810 cannot
> >         be archived
> >         [Thu Feb 10 22:25:28] ORA-19504: failed to create file ""
> >         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
> >         '/export/u01/oradata/AUCP/redo12m1.log'
> >         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
> >         '/export/u01/oradata/AUCP/redo12m2.log'
> >         [Thu Feb 10 22:25:28] ORA-16038: log 12 sequence# 
> 21810 cannot
> >         be archived
> >         [Thu Feb 10 22:25:28] ORA-19504: failed to create file ""
> >         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
> >         '/export/u01/oradata/AUCP/redo12m1.log'
> >         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
> >         '/export/u01/oradata/AUCP/redo12m2.log'
> >         [Thu Feb 10 22:25:28] ORA-16038: log 12 sequence# 
> 21810 cannot
> >         be archived
> >         [Thu Feb 10 22:25:28] ORA-19504: failed to create file ""
> >         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
> >         '/export/u01/oradata/AUCP/redo12m1.log'
> >         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
> >         '/export/u01/oradata/AUCP/redo12m2.log'
> >
> >-------------------------------------------------------------
> -----------
> >
> >_______________________________________________
> >Ocfs-users mailing list
> >Ocfs-users at oss.oracle.com
> >http://oss.oracle.com/mailman/listinfo/ocfs-users
> >  
> >
>