[Ocfs2-users] server reboots due to heartbeat error - ocfs node

Thu Jun 28 12:18:37 PDT 2012

I have been trying to submit this message to the group for 3 days now.  I hope it works this time around.  Any help would be appreciated.

I am troubleshooting an issue with my development RAC server where node 1 will reboot due to a heartbeat timeout.  I am getting multiple errors from ocfs2 and wondering if my configuration is wrong or if my procedure is wrong.  The issue comes up when I am cloning the OCFS lun for the production system to a separate lun used for development.  Taking standard precautions, but yet still node 1 on the development cluster (2 node) reboots.

System design:
2 ocfs2 clusters (prod and development)
production = 3 nodes, development = 2 nodes
All systems are running Oracle Linux Server release 5.8
kernel: [root at node1 ~]# uname -a
Linux node1.xxxxx.com 2.6.18-308.1.1.0.1.el5 #1 SMP Wed Mar 7 11:39:17 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
OCFS release ocfs2-2.6.18-308.1.1.0.1.el5-1.4.9-1.el5

cluster node setup on prod and development clusters.
Unique cluster names for each.
Nodes in the production cluster are numbered 1,2,3
Nodes in the development cluster are number 1,2
QUESTION 1:  Should the node numbers be unique since I am cloning a LUN between the 2 clusters?  See error about heartbeat in the same slot below?

Procedure:
Shutdown all processes on the luns to be cloned.
unmount the luns to be cloned on both development nodes (devnode1, devnode2)
synchronize the clone to the production lun on the san, EMC clariion.  Fracture the luns.
On devnode2 I run the following commands:
fsck.ocfs2 (fix errors)
tunefs.ocfs2 --label=dev.index /dev/path (set new label)
tunefs.ocfs2 --uuid-reset /dev/path (set random uuid)
On devnode1 I run:
sfdisk -R /dev/path (re-read partitions to grab new label and uuid)
Then I mount the volumes onto both nodes

Errors:
Both nodes constantly report the same error on the cloned luns.
kernel: (o2hb-D34207AE9F,4086,16):o2hb_do_disk_heartbeat:781 ERROR: Device "emcpowerj1": another node is heartbeating in our slot!
However, the error above does not cause any instability.
After I unmount the luns and start the clone, I get the following error for a few minutes:
(MpxTestDaemon  ,14513,8):o2hb_bio_end_io:241 ERROR: IO Error -5
kernel: (o2hb-023EBFE1B5,3945,8):o2hb_do_disk_heartbeat:772 ERROR: status = -5
After 3 minutes the system gets rebooted, the logs show the following:
kernel: (events/8,70,8):o2hb_write_timeout:176 ERROR: Heartbeat write timeout to device emcpowerl1 after 150000 milliseconds
(events/8,70,8):o2hb_stop_all_regions:2026 ERROR: stopping heartbeat on all active regions.

QUESTION 2:  Do I need to stop the heartbeat on the unmounted luns before the SAN unpresents them from the server?  I found a command as follows:
ocfs2_hb_ctl -K -d /dev/device

QUESTION 3:  Am I doing anything else wrong in my procedure that would be causing the heartbeat issue and server reboot?

Thanks in advance for your time and replies.