[Ocfs2-users] OCFS2 Error: Group Descriptor Mismatch

Tue Mar 17 04:37:29 PDT 2009

Hello,

We recently ran into an issue with another one of our OCFS2 clusters where OCFS2 detected on-disk corruption. The filesystem in question has a capacity of 641GB, and I had attempted to remove about 500GB of files. The number of files that I was removing would have been about 20,000.

I started the 'rm -rf' on host2. Shortly after that the filesystem was automatically mounted read-only and the following errors logged:

Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_commit_truncate:6490 ERROR: status = -5
Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_delete_inode:974 ERROR: status = -5
Mar 15 14:31:34 host2 kernel: (2008,1):__ocfs2_flush_truncate_log:5111 ERROR: status = -5
Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_free_clusters:1842 ERROR: status = -5
Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_free_suballoc_bits:1755 ERROR: status = -5
Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_replay_truncate_records:5039 ERROR: status = -5
Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_truncate_for_delete:562 ERROR: status = -5
Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_wipe_inode:733 ERROR: status = -5
Mar 15 14:31:34 host2 kernel: File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted.
Mar 15 14:31:34 host2 kernel: OCFS2: ERROR (device xvdb1): ocfs2_check_group_descriptor: Group descriptor # 12257280 has bit count 32256 but claims that 32639 are free
Mar 15 14:31:36 host2 kernel: (1796,1):__ocfs2_flush_truncate_log:5111 ERROR: status = -5
Mar 15 14:31:36 host2 kernel: (1796,1):ocfs2_free_clusters:1842 ERROR: status = -5
Mar 15 14:31:36 host2 kernel: (1796,1):ocfs2_free_suballoc_bits:1755 ERROR: status = -5
Mar 15 14:31:36 host2 kernel: (1796,1):ocfs2_replay_truncate_records:5039 ERROR: status = -5
Mar 15 14:31:36 host2 kernel: (1796,1):ocfs2_truncate_log_worker:5150 ERROR: status = -5
Mar 15 14:31:36 host2 kernel: OCFS2: ERROR (device xvdb1): ocfs2_check_group_descriptor: Group descriptor # 12257280 has bit count 32256 but claims that 32639 are free

I unmounted the filesystem from all three nodes and ran 'ocfs2.fsck -f' on it. I had to run ocfs2.fsck twice before it reported the clean. A snippet of the fsck output:

[GROUP_FREE_BITS] Group descriptor at block 12257280 claims to have 32639 free bits which is more than 32238 bits indicated by the bitmap. Drop its free bit count down to the total? <y> y
[CHAIN_BITS] Chain 137 in allocator inode 11 has 249271 bits marked free out of 677376 total bits but the block groups in the chain have 248870 free out of 677376 total.  Fix this by updating the chain record? <y> y
[CHAIN_GROUP_BITS] Allocator inode 11 has 109425928 bits marked used out of 167772803 total bits but the chains have 109426329 used out of 167772803 total.  Fix this by updating the inode counts <y> y
[CLUSTER_ALLOC_BIT] Cluster 12268148 is marked in the global cluster bitmap but it isn't in use.  Clear its bit in the bitmap? <y> y
[INODE_ORPHANED] Inode 15975682 was found in the orphan directory. Delete its contents and unlink it? <y> y

*See the attachment for all errors logged and some more output from ocfs2.fsck.

Following the fsck, I remounted the filesystem on all nodes and was able to delete the remainder of the files. I ran a quick test using our application on one of the remaining files and it appeared to be intact, with no data corruption.

During the maintenance our application was running, and it may have been writing some files to disk, however the amount would have been very small (I see a 493K file created at 14:34) compared to our busy times. Our monitoring graphs show a much lighter load than during normal operations, so the OCFS2 filesystem was not under any unusual load apart from the 'rm -rf'.

We are running RHEL 5.2 as Xen guests on all nodes in the cluster, with kernel 2.6.18-92.el5xen and ocfs2-2.6.18-92.el5xen-1.4.1-1.el5 installed.

I did a quick search on the mailing list for some of the errors we encountered, but couldn't find any results that seemed to document a similar issue.

I have a snapshot of the LUN with the original data, and I can run some tests on it if necessary and try to reproduce the problem. Note that we had another OCFS2 issue recently (http://oss.oracle.com/pipermail/ocfs2-users/2009-February/003369.html), and we're still investigating that. However, that problem was on a database cluster, which is on a different storage array, and it does not seem to be the same problem.

Has anyone seen this issue before, or does anyone have any advice on how we can troubleshoot it?

Regards,

Jari
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ocfs2-group-descriptor.txt
Url: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20090317/be7354f5/attachment.txt