[Ocfs2-users] Filesystem corruption and OCFS2 errors

Christian van Barneveld c.van.barneveld at zx.nl
Sat May 16 06:55:22 PDT 2009


Hi,

Our OCFS2 cluster has been stable for approx 8 months, but since this week it went wrong. First we had high load problems. We had this problem because a couple of directories got filled with files, 1 directory over 1,5 milion files (symlinks) and NFS (mounts are exported with NFS) caused high load because of that. Dir listing wasn't posible anymore.
I cleaned up the directories and after that the load became normal again and everything seemed to be fine.

But within a day our customer reported continuous disappearance of files. Those files where not from directories that I have cleaned, but random at the filesystem. There are also files that are not accesible anymore and a readonly FSCK showed some inode errors.
We have 3 OCFS2 filesystems mounted and 2 of them had problems. Last night I brought down the cluster, unmount the filesystems and did a filesystem check. The 2 affected filesystems reported several errors like:
[DIRENT_INODE_FREE] Directory entry 'f5377cd11ee628fe7c76c7f5b47f3bee.jpg' refers to inode number 811823124 which isn't allocated, clear the entry? <y> y
[INODE_ORPHANED] Inode 800661759 was found in the orphan directory. Delete its contents and unlink it? <y> y

I fixed the 2 filesystems which had problems and decided to check the (thirth) filesystem which had  no problems and after that something went terribly wrong.
First error was like this:
[SUPERBLOCK_CLUSTERS] Superblock has clusters set to 40959872 instead of 999936 recorded in global_bitmap, it may be caused by an unsuccessful resize. Trust global_bitmap? <y>
And I think I have given the wrong answer. After that a lot of Inode errors and when it finished there was no data anymore! Also after a remount the filesystem is not 2.5 TB, but 500 GB. LVM is used to create a 2,5 TB filesystem of one 2 TB LUN and a 500 GB LUN:
  VG Size               2.44 TB
But fdisk says:
Disk /dev/mapper/vg04-FS1: 485.3 GB, 485322915840 bytes

OCFS2:
  number of blocks:   118702080
  bytes per block:    4096
  number of clusters: 7418880
  bytes per cluster:  65536

After that I tried:
 tunefs.ocfs2 -S /dev/vg04/FS1 tunefs.ocfs2 1.4.1 tunefs.ocfs2: Cannot shrink volume size from 118702080 blocks to 118487040 blocks tunefs.ocfs2: Nothing to do. Exiting.
But no results

Is there anything I can do to fix this? I have tried a lot of things, but without results. 

I also tried a new kernel (2.6.29.3), but after booting and mounting it crashed (dm-17 is NOT the corrupted 3rth filesystem, but the second which had no problems anymore):

May 15 23:47:31 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #664384: signature = u^P£è\¼z
May 15 23:47:31 fileserver-1 kernel:
May 15 23:47:31 fileserver-1 kernel: File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted.
May 15 23:47:31 fileserver-1 kernel: (14610,1):ocfs2_read_locked_inode:466 ERROR: status = -22
May 15 23:47:31 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208
May 15 23:47:31 fileserver-1 kernel: (14613,0):ocfs2_read_locked_inode:466 ERROR: status = -5
May 15 23:47:31 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #659658: signature = ^Bu^SÔ¤\237\235
May 15 23:47:31 fileserver-1 kernel:
May 15 23:47:31 fileserver-1 kernel: (14606,1):ocfs2_read_locked_inode:466 ERROR: status = -22
May 15 23:47:31 fileserver-1 kernel: attempt to access beyond end of device
May 15 23:47:32 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208
May 15 23:47:32 fileserver-1 kernel: (14613,0):ocfs2_read_locked_inode:466 ERROR: status = -5
May 15 23:47:32 fileserver-1 kernel: attempt to access beyond end of device
May 15 23:47:32 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208
May 15 23:47:32 fileserver-1 kernel: (14612,1):ocfs2_read_locked_inode:466 ERROR: status = -5
May 15 23:47:32 fileserver-1 kernel: attempt to access beyond end of device
May 15 23:47:32 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208
May 15 23:47:32 fileserver-1 kernel: (14611,0):ocfs2_read_locked_inode:466 ERROR: status = -5
May 15 23:47:33 fileserver-1 kernel: attempt to access beyond end of device
May 15 23:47:33 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208
May 15 23:47:33 fileserver-1 kernel: (14612,0):ocfs2_read_locked_inode:466 ERROR: status = -5
May 15 23:47:33 fileserver-1 kernel: attempt to access beyond end of device
May 15 23:47:33 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208
May 15 23:47:34 fileserver-1 kernel: (14611,0):ocfs2_read_locked_inode:466 ERROR: status = -5
May 15 23:47:34 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #664384: signature = u^P£è\¼z
May 15 23:47:34 fileserver-1 kernel:
May 15 23:47:34 fileserver-1 kernel: (14613,1):ocfs2_read_locked_inode:466 ERROR: status = -22
May 15 23:47:34 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #659658: signature = ^Bu^SÔ¤\237\235
May 15 23:47:34 fileserver-1 kernel:
May 15 23:47:34 fileserver-1 kernel: (14610,1):ocfs2_read_locked_inode:466 ERROR: status = -22
May 15 23:47:34 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #664384: signature = u^P£è\¼z
May 15 23:47:34 fileserver-1 kernel:
May 15 23:47:34 fileserver-1 kernel: (14612,1):ocfs2_read_locked_inode:466 ERROR: status = -22
May 15 23:47:34 fileserver-1 kernel: attempt to access beyond end of device
May 15 23:47:34 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208
May 15 23:47:34 fileserver-1 kernel: (14611,0):ocfs2_read_locked_inode:466 ERROR: status = -5
May 15 23:47:34 fileserver-1 kernel: attempt to access beyond end of device
May 15 23:47:34 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208
May 15 23:47:34 fileserver-1 kernel: (14612,1):ocfs2_read_locked_inode:466 ERROR: status = -5
May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_orphan_del:1978 ERROR: status = -2
May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_remove_inode:619 ERROR: status = -2
May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_wipe_inode:753 ERROR: status = -2
May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_delete_inode:990 ERROR: status = -2
May 16 00:28:39 fileserver-1 kernel: ocfs2_dlm: Nodes in domain ("296B7CF537094A9BA5F193A426D92440"): 0

May 16 00:40:19 fileserver-1 kernel: ------------[ cut here ]------------
May 16 00:40:19 fileserver-1 kernel: kernel BUG at fs/ocfs2/inode.c:244!
May 16 00:40:19 fileserver-1 kernel: invalid opcode: 0000 [#1] SMP
May 16 00:40:19 fileserver-1 kernel: last sysfs file: /sys/fs/o2cb/interface_revision
May 16 00:40:19 fileserver-1 kernel: Modules linked in: ocfs2 jbd2 xt_multiport nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs dm_round_robin scsi_dh_rdac dm_multipath dm_mod scsi_dh qla2xxx
May 16 00:40:19 fileserver-1 kernel:
May 16 00:40:19 fileserver-1 kernel: Pid: 14609, comm: nfsd Not tainted (2.6.29.3-amd-mods-qla2xxx-mpath-fw-cluster-hm64 #1) Sun Fire V40z
May 16 00:40:19 fileserver-1 kernel: EIP: 0060:[<fa8c2580>] EFLAGS: 00010246 CPU: 0
May 16 00:40:19 fileserver-1 kernel: EIP is at ocfs2_populate_inode+0x550/0x560 [ocfs2]
May 16 00:40:19 fileserver-1 kernel: EAX: 00000000 EBX: f49ae000 ECX: 00000000 EDX: fa9002aa
May 16 00:40:19 fileserver-1 kernel: ESI: e44eddfc EDI: f66f1000 EBP: f2821cb8 ESP: f2821c6c
May 16 00:40:19 fileserver-1 kernel:  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
May 16 00:40:19 fileserver-1 kernel: Process nfsd (pid: 14609, ti=f2820000 task=f6660080 task.ti=f2820000)
May 16 00:40:19 fileserver-1 kernel: Stack:
May 16 00:40:19 fileserver-1 kernel:  00000001 00000000 e44eda80 00000000 00000000 e44eddfc 00000001 f2821cac
May 16 00:40:19 fileserver-1 kernel:  f2821cf4 00000001 f2821cb8 00000000 00000001 f2821cac 00000000 fa8c07f0
May 16 00:40:19 fileserver-1 kernel:  f66f1000 e44eddfc 00000001 f2821d04 fa8c2b7b 00000000 f2821ce0 f3d0b0c0
May 16 00:40:19 fileserver-1 kernel: Call Trace:
May 16 00:40:19 fileserver-1 kernel:  [<fa8c07f0>] ? ocfs2_validate_inode_block+0x0/0x280 [ocfs2]
May 16 00:40:19 fileserver-1 kernel:  [<fa8c2b7b>] ? ocfs2_iget+0x5eb/0x930 [ocfs2]
May 16 00:40:19 fileserver-1 kernel:  [<fa8b708a>] ? ocfs2_get_dentry+0x9a/0x1e0 [ocfs2]
May 16 00:40:19 fileserver-1 kernel:  [<c04d80d2>] ? skb_copy_datagram_iovec+0x132/0x1d0
May 16 00:40:19 fileserver-1 kernel:  [<fa8b7277>] ? ocfs2_fh_to_dentry+0x47/0x60 [ocfs2]
May 16 00:40:19 fileserver-1 kernel:  [<c0251cc5>] ? exportfs_decode_fh+0x35/0x1f0
May 16 00:40:19 fileserver-1 kernel:  [<c02c470f>] ? security_task_setgroups+0xf/0x20
May 16 00:40:19 fileserver-1 kernel:  [<c0132de6>] ? set_groups+0x16/0x1f0
May 16 00:40:19 fileserver-1 kernel:  [<c057794d>] ? cache_check+0x2d/0x3e0
May 16 00:40:19 fileserver-1 kernel:  [<c013305a>] ? groups_alloc+0x3a/0xc0
May 16 00:40:19 fileserver-1 kernel:  [<c025babc>] ? nfsd_setuser+0x17c/0x360
May 16 00:40:19 fileserver-1 kernel:  [<c0254bca>] ? nfsd_setuser_and_check_port+0x5a/0x60
May 16 00:40:19 fileserver-1 kernel:  [<c02599c4>] ? exp_find+0x54/0x80
May 16 00:40:19 fileserver-1 kernel:  [<c0259a26>] ? rqst_exp_find+0x36/0xd0
May 16 00:40:19 fileserver-1 kernel:  [<c0254fe4>] ? fh_verify+0x414/0x650
May 16 00:40:19 fileserver-1 kernel:  [<c02556f0>] ? nfsd_acceptable+0x0/0xe0
May 16 00:40:19 fileserver-1 kernel:  [<c011fa3b>] ? default_wake_function+0xb/0x10
May 16 00:40:19 fileserver-1 kernel:  [<c057794d>] ? cache_check+0x2d/0x3e0
May 16 00:40:19 fileserver-1 kernel:  [<c025d6f9>] ? nfsd3_proc_getattr+0x69/0xe0
May 16 00:40:19 fileserver-1 kernel:  [<c025fbb0>] ? nfs3svc_decode_fhandle+0x0/0x40
May 16 00:40:19 fileserver-1 kernel:  [<c025fbb0>] ? nfs3svc_decode_fhandle+0x0/0x40
May 16 00:40:19 fileserver-1 kernel:  [<c025208a>] ? nfsd_dispatch+0x9a/0x220
May 16 00:40:19 fileserver-1 kernel:  [<c0251ff0>] ? nfsd_dispatch+0x0/0x220
May 16 00:40:19 fileserver-1 kernel:  [<c057106b>] ? svc_process+0x3eb/0x6c0
May 16 00:40:19 fileserver-1 kernel:  [<c0252746>] ? nfsd+0x136/0x240
May 16 00:40:19 fileserver-1 kernel:  [<c011c5d8>] ? complete+0x48/0x60
May 16 00:40:19 fileserver-1 kernel:  [<c0252610>] ? nfsd+0x0/0x240
May 16 00:40:19 fileserver-1 kernel:  [<c0138972>] ? kthread+0x42/0x70
May 16 00:40:19 fileserver-1 kernel:  [<c0138930>] ? kthread+0x0/0x70
May 16 00:40:19 fileserver-1 kernel:  [<c010389b>] ? kernel_thread_helper+0x7/0x1c
May 16 00:40:19 fileserver-1 kernel: Code: 8f fa 85 d2 ba 20 dc 8f fa 0f 44 c2 89 86 9c 00 00 00 e9 39 ff ff ff 83 8e 44 01 00 00 20 e9 a1 fc ff ff 0f 0b eb fe 0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 90 8d b4 26 00 00 00 00 55 89 e5 57 56
May 16 00:40:19 fileserver-1 kernel: EIP: [<fa8c2580>] ocfs2_populate_inode+0x550/0x560 [ocfs2] SS:ESP 0068:f2821c6c
May 16 00:40:19 fileserver-1 kernel: ---[ end trace 3b05f9cfd74396a1 ]---

NFS with OCFS2 problems?
I went back to my previous kernel 2.6.25.5 and it seemed to be stable. At this moment I have 2 mounted (production) filesystems and 1 umounted  corrupted filesystem. This morning I looked in the logs and again errors!
Many like this:
(249,1):ocfs2_orphan_del:1869 ERROR: status = -2
(249,1):ocfs2_remove_inode:610 ERROR: status = -2
(249,1):ocfs2_wipe_inode:736 ERROR: status = -2
(249,1):ocfs2_delete_inode:970 ERROR: status = -2

This came from the 2 filesystems that seemed to be clean last night.
 
- What can I do to prevent filesystem corruption on my 2 production OCFS2 filesystems and get rid of the above errors?
- Is it possible to fix the corrupted thirth filesystem?
- What is the most stable kernel (or setup) in my case? Now (and the last year) I am using 2.6.25.5. The 2.6.29.3 kernel I've tried crashed after a couple of minutes.

Versions:
OS: Debian Etch (4.0)
kernel: custom 2.6.25.5

o2cb_ctl version 1.4.1
ocfs2-tools         1.4.1

OCFS2 DLM 1.5.0
OCFS2 DLMFS 1.5.0

I hope that you can help me with these problems.

Best regards,
Christian van Barneveld


More information about the Ocfs2-users mailing list