[Ocfs2-users] Diagnosing some OCFS2 error messages

Sun Jun 13 19:14:04 PDT 2010

Hello.  I am experimenting with OCFS2 on Suse Linux Enterprise Server
11 Service Pack 1.

I am performing various stress tests.  My current exercise involves
writing to files using a shared-writable mmap() from two nodes.  (Each
node mmaps and writes to different files; I am not trying to access
the same file from multiple nodes.)

Both nodes are logging messages like these:

[94355.116255] (ocfs2_wq,5995,6):ocfs2_block_check_validate:443 ERROR:
CRC32 failed: stored: 2715161149, computed 575704001.  Applying ECC.

[94355.116344] (ocfs2_wq,5995,6):ocfs2_block_check_validate:457 ERROR:
Fixed CRC32 failed: stored: 2715161149, computed 2102707465

[94355.116348] (ocfs2_wq,5995,6):ocfs2_validate_extent_block:903
ERROR: Checksum failed for extent block 2321665

[94355.116352] (ocfs2_wq,5995,6):__ocfs2_find_path:1861 ERROR: status = -5

[94355.116355] (ocfs2_wq,5995,6):ocfs2_find_leaf:1958 ERROR: status = -5

[94355.116358] (ocfs2_wq,5995,6):ocfs2_find_new_last_ext_blk:6655
ERROR: status = -5

[94355.116361] (ocfs2_wq,5995,6):ocfs2_do_truncate:6900 ERROR: status = -5

[94355.116364] (ocfs2_wq,5995,6):ocfs2_commit_truncate:7559 ERROR: status = -5

[94355.116370] (ocfs2_wq,5995,6):ocfs2_truncate_for_delete:597 ERROR:
status = -5

[94355.116373] (ocfs2_wq,5995,6):ocfs2_wipe_inode:770 ERROR: status = -5

[94355.116376] (ocfs2_wq,5995,6):ocfs2_delete_inode:1062 ERROR: status = -5

...although the particular extent block number varies somewhat.

In addition, when I run "fsck.ocfs2 -y -f /dev/md0", I get an I/O error:

dp-1:~ # fsck.ocfs2 -y -f /dev/md0

fsck.ocfs2 1.4.3

Checking OCFS2 filesystem in /dev/md0:

  Label:              <NONE>

  UUID:               29BB12B5AA4C449E9DDE906405F5BDE4

  Number of blocks:   3221225472

  Block size:         4096

  Number of clusters: 12582912

  Cluster size:       1048576

  Number of slots:    4

/dev/md0 was run with -f, check forced.

Pass 0a: Checking cluster allocation chains

Pass 0b: Checking inode allocation chains

Pass 0c: Checking extent block allocation chains

Pass 1: Checking inodes and blocks.

extent.c: I/O error on channel reading extent block at 2321665 in
owner 9704867 for verification

pass1: I/O error on channel while iterating over the blocks for inode 9704867

fsck.ocfs2: I/O error on channel while performing pass 1

This looks like a straightforward I/O error, right?  The only problem
is that there is nothing in any log (dmesg, /var/log/messages, event
log on the hardware RAID) to indicate any hardware problem.  That is,
when fsck.ocfs2 reports this I/O error, no other errors are logged
anywhere as far as I can tell.  Shouldn't the kernel log a message if
a block device gets an I/O error?

I am using a pair of hardware RAID chassis accessed via iSCSI, and
then using Linux md (RAID-0) to stripe between them.

Questions:

1) I would like to confirm this I/O error for myself using dd.  How do
I map the numbers above ("extent block at 2321665 in owner 9704867")
to an actual offset on the block device so I can try to read the
blocks by hand?

2) Is there any plausible explanation for these errors other than bad hardware?

Thanks!

 - Pat