[Ocfs2-users] OCFS2 fencing

Joel Becker Joel.Becker at oracle.com
Fri Mar 13 00:04:52 PDT 2009


On Fri, Mar 13, 2009 at 01:26:55PM +0800, Tao Ma wrote:
> ramya tn wrote:
> > Feb 20 23:36:41 ImageInt1 kernel: SCSI error : <1 0 2 1> return code = 
> > 0x20000
> > Feb 20 23:36:41 ImageInt1 kernel: end_request: I/O error, dev sdc, 
> > sector 656216192
> > Feb 20 23:36:41 ImageInt1 kernel: SCSI error : <1 0 2 1> return code = 
> > 0x20000
> > Feb 20 23:36:42 ImageInt1 kernel: end_request: I/O error, dev sdc, 
> > sector 657248384
> > Feb 20 23:36:42 ImageInt1 kernel: SCSI error : <1 0 2 1> return code = 
> > 0x20000
> > Feb 20 23:36:42 ImageInt1 kernel: end_request: I/O error, dev sdc, 
> > sector 667312256
> > Feb 20 23:36:42 ImageInt1 kernel: SCSI error : <1 0 2 1> return code = 
> > 0x20000
> > Feb 20 23:36:42 ImageInt1 kernel: end_request: I/O error, dev sdc, 
> > sector 670408832
> > Feb 20 23:36:42 ImageInt1 kernel: SCSI error : <1 0 2 1> return code = 
> > 0x20000
> > Feb 20 23:36:42 ImageInt1 kernel: end_request: I/O error, dev sdc, 
> > sector 670666880
> > .
> > Feb 20 23:53:21 ImageInt1 kernel: Index 13: took 0 ms to do submit_bio 
> > for write
> > Feb 20 23:53:21 ImageInt1 kernel: Index 14: took 0 ms to do checking slots
> > Feb 20 23:53:21 ImageInt1 kernel: Index 15: took 50 ms to do waiting for 
> > write completion
> > Feb 20 23:53:21 ImageInt1 kernel: Index 16: took 1904 ms to do msleep
> > Feb 20 23:53:21 ImageInt1 kernel: Index 17: took 0 ms to do allocating 
> > bios for read
> > Feb 20 23:53:21 ImageInt1 kernel: Index 18: took 0 ms to do bio alloc read
> > Feb 20 23:53:21 ImageInt1 kernel: Index 19: took 0 ms to do bio add page 
> > read
> > Feb 20 23:53:21 ImageInt1 kernel: Index 20: took 0 ms to do submit_bio 
> > for read
> > Feb 20 23:53:21 ImageInt1 kernel: Index 21: took 44652 ms to do waiting 
> > for read completion
> > Feb 20 23:53:21 ImageInt1 kernel: Index 22: took 0 ms to do bio alloc write
> > Feb 20 23:53:21 ImageInt1 kernel: Index 23: took 0 ms to do bio add page 
> > write
> > Feb 20 23:53:21 ImageInt1 kernel: Index 0: took 0 ms to do submit_bio 
> > for write
> > Feb 20 23:53:21 ImageInt1 kernel: Index 1: took 0 ms to do checking slots
> > Feb 20 23:53:21 ImageInt1 kernel: Index 2: took 9307 ms to do waiting 
> > for write completion
> > Feb 20 23:53:21 ImageInt1 kernel: Index 3: took 0 ms to do allocating 
> > bios for read
> > Feb 20 23:53:21 ImageInt1 kernel: Index 4: took 0 ms to do bio alloc read
> > Feb 20 23:53:21 ImageInt1 kernel: Index 5: took 0 ms to do bio add page read
> > Feb 20 23:53:21 ImageInt1 kernel: Index 6: took 0 ms to do submit_bio 
> > for read
> > Feb 20 23:53:22 ImageInt1 kernel: Index 7: took 35756 ms to do waiting 
> > for read completion
> > Feb 20 23:53:22 ImageInt1 kernel: Index 8: took 0 ms to do bio alloc write
> > Feb 20 23:53:22 ImageInt1 kernel: Index 9: took 0 ms to do bio add page 
> > write
> > Feb 20 23:53:22 ImageInt1 kernel: Index 10: took 0 ms to do submit_bio 
> > for write
> > Feb 20 23:53:22 ImageInt1 kernel: Index 11: took 0 ms to do checking slots
> > Feb 20 23:53:22 ImageInt1 kernel: Index 12: took 84549 ms to do waiting 
> > for write completion
> > Feb 20 23:53:22 ImageInt1 kernel: *** ocfs2 is very sorry to be fencing 
> > this system by restarting ***
> > I found the same scsi errors each time it fences. Can anyone suggest 
> > what could be the reason for these SCSI errors and is it those SCSI 
> > errors which is causing fencing.
> I don't know the reason for SCSI errors. So just answer your second qs.
> Yes, SCSI error will cause ocfs2 fencing. OCFS2 need to heartbeat in the 
> disk, so if it tries many times and still fails to write to disk because 
> of the SCSI error, it will fence itself.

	Like Tao says, if ocfs2 can't read or write the disk in a timely
fashion, it will fence.  I think there's an issue with your storage.
	That second hunk of log messages shows some I/Os taking 85
seconds (84549ms) to complete.  Your heartbeat timeouts are probably
shorter than that, and so ocfs2 eventually has to give up.
	The earlier log messages, about I/O errors for sdc, are even
more worrying.  Those are I/Os that failed.  I would check your I/O
topology.  Is it an overloaded SAN?  Is it iSCSI without enough
throughput?  Do you just have a failing disk?

Joel

-- 

You can use a screwdriver to screw in screws or to clean your ears,
however, the latter needs real skill, determination and a lack of fear
of injuring yourself.  It is much the same with JavaScript.
	- Chris Heilmann

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127



More information about the Ocfs2-users mailing list