[Ocfs2-users] Fwd: fsck fails & volume mount fails, is my data lost?

khaije rock khaije1 at gmail.com
Sat May 30 20:12:51 PDT 2009


Thanks Sunil.

I treat the ocfs2/LVM volumes as static partitions so that shouldn't cause
problems unless I'm attempting to resize or something like that right?

---recovery---
In the past I tried to recover using "debugfs.ocfs rdump" but it would
always fail with an error message of:

debugfs.ocfs2: extend_file.c:211: ocfs2_new_path: Assertion
`root_el->l_tree_depth < 5' failed.

After your suggestion I'm trying it again and breaking the rdump operation
into parts so that it never needs to traverse more than 5 subdirectories
deep and it seems to be working very nicely. I've recovered about 60% so far
and it looks like smooth sailing from here.

I was exporting this volume over iSCSI and I suspect the filesystem problem
came about as a result of fencing resets?

90% of my way through the recovery and it turns out that each of the more
volume-wide rdump attemps were chocking at the same specific point: a
symlink pointing to a directory on a different filesystem that then
descended back into the ocfs volume.

Looking like this:
debugfs: stat lossless.from_folks
        Inode: 258749   Mode: 0777   Generation: 667446219 (0x27c86bcb)
        FS Generation: 781535612 (0x2e95497c)
        Type: Symbolic Link   Attr: 0x0   Flags: Valid
        User: 1000 (khaije)   Group: 1000 (khaije)   Size: 68
        Links: 1   Clusters: 0
        ctime: 0x49f5b25d -- Mon Apr 27 09:25:49 2009
        atime: 0x49f5b25d -- Mon Apr 27 09:25:49 2009
        mtime: 0x4994683e -- Thu Feb 12 13:19:42 2009
        dtime: 0x0 -- Wed Dec 31 19:00:00 1969
        ctime_nsec: 0x0a0f5a36 -- 168778294
        atime_nsec: 0x00000000 -- 0
        mtime_nsec: 0x00000000 -- 0
        Last Extblk: 0
        Sub Alloc Slot: 0   Sub Alloc Bit: 699
        Fast Symlink Destination:
/home/khaije/documents/shared_multimedia/audio/music/lossy.from_folks


I'm guessing this either added enough directories to the path to exceed
rdump's threshold or that it had problems manipulating the combination of
local and non-local filesystems. I would simply delete it but debugfs.ocfs2
doesn't seem to allow that. (I'm not sure why I didn't just use a relative
path symlink since the target is in the same directory)

Anyway by avoiding that symlink I was able to make a full recovery.

Cheers and thanks again,


On Fri, May 29, 2009 at 2:18 PM, Sunil Mushran <sunil.mushran at oracle.com>wrote:

> You are using ocfs2 atop lvm - a non-cluster-aware volume manager.
> A lot of things can go wrong in this combination. Quite a few have
> been reported on this forum.
>
> debugfs.ocfs2 has commands dump and rdump that allows users to
> read the files directly off the disk. Use it to recover your data.
>
> khaije rock wrote:
>
>> I can simplify this question:
>>
>> What can I do to try to recover data from a problematic ocfs2 filesystem?
>>
>> For example, would I get any traction if I build tools from upstream
>> sources?
>>
>> Thanks all!
>>
>> ---------- Forwarded message ----------
>> From: *khaije rock* <khaije1 at gmail.com <mailto:khaije1 at gmail.com>>
>> Date: Mon, May 25, 2009 at 8:06 AM
>> Subject: fsck fails & volume mount fails, is my data lost?
>> To: ocfs2-users at oss.oracle.com <mailto:ocfs2-users at oss.oracle.com>
>>
>>
>> Hi,
>>
>> I hope its appropriate for me to post my issue to this list. Thanks in
>> advance for any help!
>>
>> I don't know exactly what the underlying cause is, but here is what it
>> looks like:
>>  - mount the filesystem
>>  - cd into the directory with no errors, however
>>  - the shell seizes when i attempt to 'ls' or interact with any data in
>> any way.
>>
>> I've found when running fsck.ocfs2 against the block device (it's a
>> logical volume using lvm) it completes successfully and reports the
>> following:
>>
>> khaije at chronovore:~$ sudo fsck
>> /dev/vg.chronovore/lv.medea.share._multimedia_store
>> fsck 1.41.3 (12-Oct-2008)
>> Checking OCFS2 filesystem in
>> /dev/vg.chronovore/lv.medea.share._multimedia_store:
>>  label:              lv.medea.share._multimedia_store
>>  uuid:               28 f3 65 1c 1d 04 4e 28 af f0 37 7f 30 13 fc 38
>>  number of blocks:   65536000
>>  bytes per block:    4096
>>  number of clusters: 65536000
>>  bytes per cluster:  4096
>>  max slots:          4
>>
>> o2fsck_should_replay_journals:564 | slot 0 JOURNAL_DIRTY_FL: 1
>> o2fsck_should_replay_journals:564 | slot 1 JOURNAL_DIRTY_FL: 0
>> o2fsck_should_replay_journals:564 | slot 2 JOURNAL_DIRTY_FL: 0
>> o2fsck_should_replay_journals:564 | slot 3 JOURNAL_DIRTY_FL: 0
>>
>> /dev/vg.chronovore/lv.medea.share._multimedia_store is clean.  It will be
>> checked after 20 additional mounts.
>>
>>
>> The command returns this output and returns control to the shell. As you
>> can see it indicates there is a 'journal dirty' flag set for slot one, which
>> is the host machine. You'll notice that immediately after stating that the
>> journal is dirty it says that the filesystem is clean.
>>
>> In order to try to make the filesystem usable I ran fsck.ocfs2 with the
>> -fvv flags. This process never fully completes. After several minutes of the
>> process happily chugging along it seizes. One of the last blocks of output
>> generated has this to say:
>>
>> o2fsck_verify_inode_fields:435 | checking inode 14119181's fields
>> check_el:249 | depth 0 count 243 next_free 1
>> check_er:164 | cpos 0 clusters 1 blkno 14677109
>> verify_block:705 | adding dir block 14677109
>> update_inode_alloc:157 | updated inode 14119181 alloc to 1 from 1 in slot
>> 0
>> o2fsck_verify_inode_fields:435 | checking inode 14119182's fields
>> check_el:249 | depth 0 count 243 next_free 1
>> check_er:164 | cpos 0 clusters 1 blkno 14677110
>> o2fsck_mark_cluster_allocated: Internal logic faliure !! duplicate cluster
>> 14677110
>> verify_block:705 | adding dir block 14677110
>>
>> This 'Internal logic failure' seems significant, so I googled and found
>> the following passage (
>> http://oss.oracle.com/osswiki/OCFS2/DesignDocs/RemoveSlotsTunefs) which
>> seems to have some bearing in my case:
>>
>> -=-=-=-=-=-
>> Duplicate groups or missing groups
>>
>> When we relink the groups in extent_alloc and inode_alloc, it contains 2
>> steps, deleting from the old inode and relinking to the new inode. So which
>> should be carried first since we may panic between the two steps.
>>
>>      Deleting from the old inode first If deletion is carried first and
>> tunefs panic: Since fsck.ocfs2 don't know the inode and extent blocks are
>> allocated(it decide them by reading inode_alloc and extent_alloc), all the
>> spaces will be freed. This is too bad.
>>
>>      Relinking to the new inode first If relink is carried first, and
>> tunefs panic: Since now two alloc inode contains some duplicated chains,
>> error "GROUP_PARENT" is prompted every time and many internal error
>> "o2fsck_mark_cluster_allocated: Internal logic failure !! duplicate
>> cluster".
>> Although this is also boring, we at least have the chain information in
>> our hand, so I'd like to revise fsck.ocfs2 to be fit for this scenario.
>> There are also one thing that has to be mentioned: fsck.ocfs2 will loop
>> forever in o2fsck_add_dir_block since it doesn't handle the condition of
>> dbe->e_blkno == tmp_dbe->e_blkno, so we have to handle this also.
>> =-=-=-=-=-
>>
>> Later in this page the author suggests that fsck.ocfs2 would need to be
>> modified to handle this case (which I gather hasn't happened yet), however
>> there must be some other way to remedy this situation and recover the nearly
>> 250 gigs of data i have on this share?
>>
>> Can anyone help?
>>
>> I've tried copying to a new partition by using debugfs.ocfs2 but I'm not
>> sure if I'm doing it right or if there is a more sensible approach to try.
>>
>> Thanks all,
>> Nick
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20090530/dcefb589/attachment.html 


More information about the Ocfs2-users mailing list