[Ocfs2-users] Diagnosing some OCFS2 error messages

Mon Jun 14 09:44:18 PDT 2010

----- lopresti at gmail.com wrote:

> Hello.  I am experimenting with OCFS2 on Suse Linux Enterprise Server
> 11 Service Pack 1.
> 
> I am performing various stress tests.  My current exercise involves
> writing to files using a shared-writable mmap() from two nodes. 
> (Each
> node mmaps and writes to different files; I am not trying to access
> the same file from multiple nodes.)
> 
> Both nodes are logging messages like these:
> 
> [94355.116255] (ocfs2_wq,5995,6):ocfs2_block_check_validate:443
> ERROR:
> CRC32 failed: stored: 2715161149, computed 575704001.  Applying ECC.
> 
> [94355.116344] (ocfs2_wq,5995,6):ocfs2_block_check_validate:457
> ERROR:
> Fixed CRC32 failed: stored: 2715161149, computed 2102707465
> 
> [94355.116348] (ocfs2_wq,5995,6):ocfs2_validate_extent_block:903
> ERROR: Checksum failed for extent block 2321665
> 
> [94355.116352] (ocfs2_wq,5995,6):__ocfs2_find_path:1861 ERROR: status
> = -5
> 
> [94355.116355] (ocfs2_wq,5995,6):ocfs2_find_leaf:1958 ERROR: status =
> -5
> 
> [94355.116358] (ocfs2_wq,5995,6):ocfs2_find_new_last_ext_blk:6655
> ERROR: status = -5
> 
> [94355.116361] (ocfs2_wq,5995,6):ocfs2_do_truncate:6900 ERROR: status
> = -5
> 
> [94355.116364] (ocfs2_wq,5995,6):ocfs2_commit_truncate:7559 ERROR:
> status = -5
> 
> [94355.116370] (ocfs2_wq,5995,6):ocfs2_truncate_for_delete:597 ERROR:
> status = -5
> 
> [94355.116373] (ocfs2_wq,5995,6):ocfs2_wipe_inode:770 ERROR: status =
> -5
> 
> [94355.116376] (ocfs2_wq,5995,6):ocfs2_delete_inode:1062 ERROR: status
> = -5
> 
> 
> ...although the particular extent block number varies somewhat.
> 
> In addition, when I run "fsck.ocfs2 -y -f /dev/md0", I get an I/O
> error:
> 
> dp-1:~ # fsck.ocfs2 -y -f /dev/md0
> 
> fsck.ocfs2 1.4.3
> 
> Checking OCFS2 filesystem in /dev/md0:
> 
>   Label:              <NONE>
> 
>   UUID:               29BB12B5AA4C449E9DDE906405F5BDE4
> 
>   Number of blocks:   3221225472
> 
>   Block size:         4096
> 
>   Number of clusters: 12582912
> 
>   Cluster size:       1048576
> 
>   Number of slots:    4
> 
> 
> 
> /dev/md0 was run with -f, check forced.
> 
> Pass 0a: Checking cluster allocation chains
> 
> Pass 0b: Checking inode allocation chains
> 
> Pass 0c: Checking extent block allocation chains
> 
> Pass 1: Checking inodes and blocks.
> 
> extent.c: I/O error on channel reading extent block at 2321665 in
> owner 9704867 for verification
> 
> pass1: I/O error on channel while iterating over the blocks for inode
> 9704867
> 
> fsck.ocfs2: I/O error on channel while performing pass 1
> 
> 
> 
> This looks like a straightforward I/O error, right?  The only problem
> is that there is nothing in any log (dmesg, /var/log/messages, event
> log on the hardware RAID) to indicate any hardware problem.  That is,
> when fsck.ocfs2 reports this I/O error, no other errors are logged
> anywhere as far as I can tell.  Shouldn't the kernel log a message if
> a block device gets an I/O error?
> 
> I am using a pair of hardware RAID chassis accessed via iSCSI, and
> then using Linux md (RAID-0) to stripe between them.
> 
> Questions:
> 
> 1) I would like to confirm this I/O error for myself using dd.  How
> do
> I map the numbers above ("extent block at 2321665 in owner 9704867")
> to an actual offset on the block device so I can try to read the
> blocks by hand?
> 
> 2) Is there any plausible explanation for these errors other than bad
> hardware?
> 
> Thanks!

Well, it is either bad hardware and/or bad software. We will need additional
information. And for that, you will have to disable metaecc. You can
re-enable it later.

Disable metaecc
# tunefs.ocfs2 --fs-features=nometaecc /dev/sdX

File a bugzilla and attach the three outputs below. Also, just cut-paste
your email.
# debugfs.ocfs2 -R "stats" /dev/sdX  >/tmp/sb.out
# debugfs.ocfs2 -R "stat <9704867>" /dev/sdX >inode.out
# dd if=/dev/sdX of=/tmp/ext2321665 bs=4K skip=2321665 count=1

Rerun fsck.
# fsck.ocfs2 -fy /dev/sdX
See if it runs clean.

Do you have a test that reproduces this problem? And can be shared.
If so, attach that to the bz to.

Definitely log this with Novell. If there is a fix, they'll be the one
providing the fix for sles11. And, good if you can file on oss.oracle.com
too.

Sunil