[Ocfs2-users] [Fwd: Re: Unable to fix corrupt directories with fsck.ocfs2]

Joel Becker Joel.Becker at oracle.com
Wed May 20 11:11:20 PDT 2009


On Wed, May 20, 2009 at 12:05:59PM +1000, Robin Garner wrote:
> Joel Becker wrote:
> > On Tue, May 19, 2009 at 02:49:31PM +1000, Robin Garner wrote:
> >> Robin Garner wrote:
> >>> Yes.  This is a 24/7 application (at least during semester), and 
> >>> arranging extended downtime is a challenge.
> > 
> > 	Ok, you ran fsck against a live filesystem and skipped the
> > cluster locking with the '-F' option.  So now you have two problems.
> > 
> > 1) The original directory problem.
> > 2) The duplicate blocks created by your fsck of a mounted filesystem.
> > 
> > 	Do you have backups?
> > 
> > Joel
> > 
> 
> OK, now I'm confused:
> 
> The man page for fsck.ocfs2 says
> 
>         -F     Usually fsck.ocfs2 will check with cluster
>                services  and the DLM to make sure that no
>                one else in the cluster is actively  using
>                the  device  before  proceeding.  -F skips
>                this check and should only be used when it
>                can  be  guaranteed  that  there can be no
>                other users of the device while fsck.ocfs2
>                is running.
> 
> To me & my colleagues "no one else in the cluster is actively using the 
>   device" means that the filesystem must be mounted on *at most* one 
> node in the cluster (the node doing the fsck).  That's what we did.

	It means 'no software is using it', and that includes mounting
it.  I checked this text after I emailed you, and I agree it needs to be
updated.

> I can't see any reference in the man page about not doing an fsck on a 
> mounted disk.
> 
> e2fsck for example says this:
> 
>  > WARNING!!! Running e2fsck on a mounted file system may cause
>  > SEVERE filesystem damage.
>  >
>  > Do you really want to continue (y/n)?
> 
> when you try to fsck a mounted filesystem.  May I suggest that 
> fsck.ocfs2 do something similar ?  Perhaps 'everyone knows' you can't 
> run fsck on a mounted filesystem, but we were assuming that ocfs2 being 
> a modern cluster filesystem might be a little more advanced.  Apparently 
> not.

	We don't assume "everyone knows", but we apparently assume that
people will understand that "mounted" means "using it".  Adding a check
like e2fsck's is hard, because -F just skipped the check that would tell
us about other nodes.
	What about fsck on a mounted cluster filesystem?  It's REALLY
HARD.  To ensure an allocator is in a good state, you have to find all
the things that use the blocks in that allocator.  But that means
scanning the entire disk for possible use of those blocks.  And you
can't let anything happen to the areas of the disk you have already
checked.  So essentially an online fsck would have to lock out the
entire filesystem anyway.

> We'll try to salvage the data another way (we believe the directory 
> corruption is some way down the directory tree), and pull missing data 
> back from backups.

	Do you know which directory it is or have a way of finding it?
That would make salvaging better, I bet.  I'm sure that the duplicate
clusters won't be a problem reading the filesystem (other than possibly
corrupting one of the two files pointing to the duplicate cluster.
	Unfortunatley, fsck.ocfs2 doesn't have the ability to fix
duplicate clusters yet.  I'd actually started looking at that already
for a different reason, but it won't be something ready soon.  We
haven't run into a lot of cases, thankfully.

Joel

-- 

"I always thought the hardest questions were those I could not answer.
 Now I know they are the ones I can never ask."
			- Charlie Watkins

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127



More information about the Ocfs2-users mailing list