[Ocfs2-users] Diagnosing some OCFS2 error messages

Mon Jun 14 09:27:27 PDT 2010

----- bpkroth at gmail.com wrote:

> Patrick J. LoPresti <lopresti at gmail.com> 2010-06-13 19:14:
> > Hello.  I am experimenting with OCFS2 on Suse Linux Enterprise
> Server
> > 11 Service Pack 1.
> > 
> > I am performing various stress tests.  My current exercise involves
> > writing to files using a shared-writable mmap() from two nodes. 
> (Each
> > node mmaps and writes to different files; I am not trying to access
> > the same file from multiple nodes.)
> > 
> > Both nodes are logging messages like these:
> > 
> > [94355.116255] (ocfs2_wq,5995,6):ocfs2_block_check_validate:443
> ERROR:
> > CRC32 failed: stored: 2715161149, computed 575704001.  Applying
> ECC.
> > 
> > [94355.116344] (ocfs2_wq,5995,6):ocfs2_block_check_validate:457
> ERROR:
> > Fixed CRC32 failed: stored: 2715161149, computed 2102707465
> > 
> > [94355.116348] (ocfs2_wq,5995,6):ocfs2_validate_extent_block:903
> > ERROR: Checksum failed for extent block 2321665
> > 
> > [94355.116352] (ocfs2_wq,5995,6):__ocfs2_find_path:1861 ERROR:
> status = -5
> > 
> > [94355.116355] (ocfs2_wq,5995,6):ocfs2_find_leaf:1958 ERROR: status
> = -5
> > 
> > [94355.116358] (ocfs2_wq,5995,6):ocfs2_find_new_last_ext_blk:6655
> > ERROR: status = -5
> > 
> > [94355.116361] (ocfs2_wq,5995,6):ocfs2_do_truncate:6900 ERROR:
> status = -5
> > 
> > [94355.116364] (ocfs2_wq,5995,6):ocfs2_commit_truncate:7559 ERROR:
> status = -5
> > 
> > [94355.116370] (ocfs2_wq,5995,6):ocfs2_truncate_for_delete:597
> ERROR:
> > status = -5
> > 
> > [94355.116373] (ocfs2_wq,5995,6):ocfs2_wipe_inode:770 ERROR: status
> = -5
> > 
> > [94355.116376] (ocfs2_wq,5995,6):ocfs2_delete_inode:1062 ERROR:
> status = -5
> > 
> > 
> > ...although the particular extent block number varies somewhat.
> > 
> > In addition, when I run "fsck.ocfs2 -y -f /dev/md0", I get an I/O
> error:
> >
> > dp-1:~ # fsck.ocfs2 -y -f /dev/md0
> > 
> > fsck.ocfs2 1.4.3
> > 
> > Checking OCFS2 filesystem in /dev/md0:
> > 
> >   Label:              <NONE>
> > 
> >   UUID:               29BB12B5AA4C449E9DDE906405F5BDE4
> > 
> >   Number of blocks:   3221225472
> > 
> >   Block size:         4096
> > 
> >   Number of clusters: 12582912
> > 
> >   Cluster size:       1048576
> > 
> >   Number of slots:    4
> > 
> > 
> > 
> > /dev/md0 was run with -f, check forced.
> > 
> > Pass 0a: Checking cluster allocation chains
> > 
> > Pass 0b: Checking inode allocation chains
> > 
> > Pass 0c: Checking extent block allocation chains
> > 
> > Pass 1: Checking inodes and blocks.
> > 
> > extent.c: I/O error on channel reading extent block at 2321665 in
> > owner 9704867 for verification
> > 
> > pass1: I/O error on channel while iterating over the blocks for
> inode 9704867
> > 
> > fsck.ocfs2: I/O error on channel while performing pass 1
> > 
> > 
> > 
> > This looks like a straightforward I/O error, right?  The only
> problem
> > is that there is nothing in any log (dmesg, /var/log/messages,
> event
> > log on the hardware RAID) to indicate any hardware problem.  That
> is,
> > when fsck.ocfs2 reports this I/O error, no other errors are logged
> > anywhere as far as I can tell.  Shouldn't the kernel log a message
> if
> > a block device gets an I/O error?
> > 
> > I am using a pair of hardware RAID chassis accessed via iSCSI, and
> > then using Linux md (RAID-0) to stripe between them.
> > 
> > Questions:
> > 
> > 1) I would like to confirm this I/O error for myself using dd.  How
> do
> > I map the numbers above ("extent block at 2321665 in owner
> 9704867")
> > to an actual offset on the block device so I can try to read the
> > blocks by hand?
> > 
> > 2) Is there any plausible explanation for these errors other than
> bad hardware?
> > 
> > Thanks!
> > 
> >  - Pat
> 
> I don't believe OCFS2 can currently support any logical volume
> manager
> other than a simple concatenation (and even then it's with extreme
> caution).  The overhead involved in the lower software layer doing
> striping needs to somehow be coordinated among all the nodes in the
> cluster else all fs consistency guarantees provided by the SCSI layer
> are lost.

Not true. For OCFS2 to work with a LVM, it needs to be (a) cluster-aware,
and, (b) use the same cluster stack as the fs. 

SLES11 has the Pacemaker (pcmk) cluster stack. Just configure both
OCFS2 and cLVM2 to use pcmk.

Sunil