[Ocfs2-users] ocfs2 fencing on reboot of 2nd node

Thu Sep 21 17:04:10 PDT 2006

What is your O2CB_HEARTBEAT_THRESHOLD set to?

For more, refer:
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#HEARTBEAT

SRuff at fiberlink.com wrote:
>
> I'm performing some testing with ocfs2 on 2 nodes with Red Hat AS4 
> Update 4 (x86_64) and (mulitpath included in the 2.6 kernel) and am 
> runing into some issues when cleanly rebooting the 2nd node, while the 
> 1st node is still up.
>
> So if I do the following on the 2nd node, the 1st node does not fence 
> itself:
>
> /etc/init.d/ocfs2 stop
> /etc/init.d/o2cb stop
> wait more than 60 seconds
> init 6
>
> I get the following on the 1st node, but everything is fine:
>
> Sep 21 21:44:49 bbflgrid11 kernel: SCSI error : <0 0 0 12> return code 
> = 0x20000
> Sep 21 21:44:49 bbflgrid11 kernel: end_request: I/O error, dev sdm, 
> sector 1.
> Sep 21 21:44:49 bbflgrid11 kernel: device-mapper: dm-multipath: 
> Failing path 8:192.
> Sep 21 21:44:49 bbflgrid11 kernel: SCSI error : <0 0 0 14> return code 
> = 0x20000
> Sep 21 21:44:49 bbflgrid11 kernel: end_request: I/O error, dev sdo, 
> sector 193297
> Sep 21 21:44:49 bbflgrid11 kernel: device-mapper: dm-multipath: 
> Failing path 8:224.
> Sep 21 21:44:49 bbflgrid11 kernel: SCSI error : <0 0 0 13> return code 
> = 0x20000
> Sep 21 21:44:49 bbflgrid11 kernel: end_request: I/O error, dev sdn, 
> sector 192785
> Sep 21 21:44:49 bbflgrid11 kernel: device-mapper: dm-multipath: 
> Failing path 8:208.
> Sep 21 21:44:49 bbflgrid11 multipathd: 8:192: mark as failed
> Sep 21 21:44:49 bbflgrid11 multipathd: mpath1: remaining active paths: 1
> Sep 21 21:44:49 bbflgrid11 multipathd: 8:224: mark as failed
> Sep 21 21:44:49 bbflgrid11 multipathd: mpath3: remaining active paths: 1
> Sep 21 21:44:49 bbflgrid11 multipathd: 8:208: mark as failed
> Sep 21 21:44:49 bbflgrid11 multipathd: mpath2: remaining active paths: 1
> Sep 21 21:44:58 bbflgrid11 multipathd: 8:192: readsector0 checker 
> reports path is up
> Sep 21 21:44:58 bbflgrid11 multipathd: 8:192: reinstated
> Sep 21 21:44:58 bbflgrid11 multipathd: mpath1: remaining active paths: 2
> Sep 21 21:44:58 bbflgrid11 multipathd: 8:208: readsector0 checker 
> reports path is up
> Sep 21 21:44:58 bbflgrid11 multipathd: 8:208: reinstated
> Sep 21 21:44:58 bbflgrid11 multipathd: mpath2: remaining active paths: 2
> Sep 21 21:44:58 bbflgrid11 multipathd: 8:224: readsector0 checker 
> reports path is up
> Sep 21 21:44:58 bbflgrid11 multipathd: 8:224: reinstated
> Sep 21 21:44:58 bbflgrid11 multipathd: mpath3: remaining active paths: 2
> Sep 21 21:46:06 bbflgrid11 kernel: SCSI error : <1 0 0 11> return code 
> = 0x20000
> Sep 21 21:46:06 bbflgrid11 kernel: end_request: I/O error, dev sdaa, 
> sector 1920
> Sep 21 21:46:06 bbflgrid11 kernel: device-mapper: dm-multipath: 
> Failing path 65:160.
> Sep 21 21:46:06 bbflgrid11 multipathd: 65:160: mark as failed
> Sep 21 21:46:06 bbflgrid11 multipathd: mpath0: remaining active paths: 1
> Sep 21 21:46:06 bbflgrid11 multipathd: 65:160: readsector0 checker 
> reports path is up
> Sep 21 21:46:06 bbflgrid11 multipathd: 65:160: reinstated
> Sep 21 21:46:06 bbflgrid11 multipathd: mpath0: remaining active paths: 2
>
>
>
> Now if I do the following on the 2nd node, the 1st node fences itself 
> (same as above, except dont wait 60 seconds after o2cb stop)
>
> /etc/init.d/ocfs2 stop
> /etc/init.d/o2cb stop
> init 6
>
> Node 1 logs the following and fences itself, I have to power cycle the 
> server to get it back, it doesn't reboot or shutdown just hangs
>
> Sep 21 21:28:00 bbflgrid11 kernel: SCSI error : <0 0 0 13> return code 
> = 0x20000
> Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O error, dev sdn, 
> sector 192785
> Sep 21 21:28:00 bbflgrid11 kernel: device-mapper: dm-multipath: 
> Failing path 8:208.
> Sep 21 21:28:00 bbflgrid11 multipathd: 8:208: mark as failed
> Sep 21 21:28:00 bbflgrid11 multipathd: mpath2: remaining active paths: 1
> Sep 21 21:28:00 bbflgrid11 kernel: SCSI error : <1 0 0 12> return code 
> = 0x20000
> Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O error, dev sdab, 
> sector 192784
> Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O error, dev sdab, 
> sector 192786
> Sep 21 21:28:00 bbflgrid11 kernel: device-mapper: dm-multipath: 
> Failing path 65:176.
> Sep 21 21:28:00 bbflgrid11 kernel: SCSI error : <1 0 0 13> return code 
> = 0x20000
> Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O error, dev sdac, 
> sector 192785
> Sep 21 21:28:00 bbflgrid11 kernel: device-mapper: dm-multipath: 
> Failing path 65:192.
> Sep 21 21:28:00 bbflgrid11 multipathd: 65:176: mark as failed
> Sep 21 21:28:00 bbflgrid11 multipathd: mpath1: remaining active paths: 1
> Sep 21 21:28:01 bbflgrid11 multipathd: 65:192: mark as failed
> Sep 21 21:28:01 bbflgrid11 multipathd: mpath2: remaining active paths: 0
> Sep 21 21:28:01 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:01 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:01 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:01 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:01 bbflgrid11 multipathd: 65:176: readsector0 checker 
> reports path is up
> Sep 21 21:28:01 bbflgrid11 multipathd: 65:176: reinstated
> Sep 21 21:28:01 bbflgrid11 multipathd: mpath1: remaining active paths: 2
> Sep 21 21:28:03 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:03 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:03 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:03 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:05 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:05 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:05 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:05 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:07 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:07 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:07 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:07 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:09 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:09 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:09 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:09 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:09 bbflgrid11 multipathd: 8:208: readsector0 checker 
> reports path is up
> Sep 21 21:28:09 bbflgrid11 multipathd: 8:208: reinstated
> Sep 21 21:28:09 bbflgrid11 multipathd: mpath2: remaining active paths: 1
> Sep 21 21:28:10 bbflgrid11 multipathd: 65:192: readsector0 checker 
> reports path is up
> Sep 21 21:28:10 bbflgrid11 multipathd: 65:192: reinstated
> Sep 21 21:28:10 bbflgrid11 multipathd: mpath2: remaining active paths: 2
>
>
> ...
> Index 14: took 0 ms to do submit_bio for read
> Index 15: took 0 ms to do waiting for read completion
> (11,1):o2hb_stop_all_regions:1908 ERROR: stopping heartbeat on all 
> active regions
> Kernel panic - not syncing:  ocfs2 is very sorry to be fencing this 
> system by panicing
>
>
> Seems like if I wait for the node 1 to heartbeat to node 2, with o2c 
> down, before rebooting it's fine, but if I reboot before node 1 has 
> had a chance to hearbeat to node 2, with o2cb down, it's panics.
>
>
>
> Shawn E. Ruff
> Senior Oracle DBA
> Fiberlink Communications
>
> The information transmitted is intended only for the person or entity 
> to which it is addressed and may contain confidential and/or 
> privileged material.  Any review, retransmission, dissemination or 
> other use of, or taking of any action in reliance upon, this 
> information by persons or entities other than the intended recipient 
> is prohibited.   If you received this in error, please contact the 
> sender and delete the material from any computer.
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>