[Ocfs2-users] OCFS2 v1.8 on VMware VMs global heartbeat woes

Wed Nov 12 18:26:51 PST 2014

Running two VMs on ESXi 5.1.0 and trying to get global heart beat (HB) working with no luck (on about my 20th rebuild and redo)

Environment:

Two VMware based VMs running

# cat /etc/oracle-release

Oracle Linux Server release 6.5

# uname -r

2.6.32-400.36.8.el6uek.x86_64

# yum list installed  | grep ocfs

ocfs2-tools.x86_64               1.8.0-11.el6           @oel-latest 

# yum list installed | grep uek

kernel-uek.x86_64                2.6.32-400.36.8.el6uek @oel-latest             
kernel-uek-firmware.noarch       2.6.32-400.36.8.el6uek @oel-latest             
kernel-uek-headers.x86_64        2.6.32-400.36.8.el6uek @oel-latest    

Configuration:

The shared data stores (HB and mounted OCFS) are setup in a similar way as described by VMWare and Oracle for shared RAC VMWare based data stores. All blogs, wikis and VMWare KB docs show similar setup, VM shared SCSI settings [multi-writer], shared disk [independant + persistent] etc. such as:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1034165 <http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1034165> ) 

The devices can be seen by both VMs after in the OS. I have used the same configuration to run an OCFS2 setup with local heartbeat, and that works fine (cluster starts up and the OCFS2 file system mounts with no issues) 

I followed similar procedures as show in an Oracle blog + docs: https://docs.oracle.com/cd/E37670_01/E37355/html/ol_instcfg_ocfs2.html <https://docs.oracle.com/cd/E37670_01/E37355/html/ol_instcfg_ocfs2.html> and https://blogs.oracle.com/wim/entry/ocfs2_global_heartbeat <https://blogs.oracle.com/wim/entry/ocfs2_global_heartbeat> with no luck. 

The shared SCSI controllers are VMware paravirtual and set to “shared none” as suggested by the VMware RAC shared disk KB (previously mentioned)

After the shared Linux devices have been added to both VMs and are seen by both VMs in the OS (ls /dev/sd* shows the devices on each) I format the global HB devices in a way similar to the following from one VM:

# mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol1 --cluster-name=test --cluster-stack=o2cb --global-heartbeat /dev/sdc
# mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol2 --cluster-name=test --cluster-stack=o2cb --global-heartbeat /dev/sdd

From both VMs you can run the following and see:

# mounted.ocfs2 -d

Device    Stack  Cluster  F  UUID                              Label
/dev/sdc  o2cb   test     G  5620F19D43D840C7A46523019AE15A96  ocfs2vol1
/dev/sdd  o2cb   test     G  9B9182279ABD4FD99F695F91488C94C1  ocfs2vol2

I then add the global HB devices to the ocfs config file with similar commands:

# o2cb add-heartbeat test 5620F19D43D840C7A46523019AE15A96
# o2cb add-heartbeat test 9B9182279ABD4FD99F695F91488C94C1

Thus far looking good (heh, but then all we’ve done is format ocfs2 with options and updated a text file) - then I do the following:

# o2cb heartbeat-mode test global

All this being done on one node in the cluster I copy the following to the other node (with hostnames changed here, though the actual hostname = output of the hostname command on each node):

# cat /etc/ocfs2/cluster.conf 

node:
	name = clusterhost1.mydomain.com
	cluster = test
	number = 0
	ip_address = 10.143.144.12
	ip_port = 7777

node:
	name = clusterhost2.mydomain.com
	cluster = test
	number = 1
	ip_address = 10.143.144.13
	ip_port = 7777

cluster:
	name = test
	heartbeat_mode = global
	node_count = 2

heartbeat:
	cluster = test
	region = 5620F19D43D840C7A46523019AE15A96

heartbeat:
	cluster = test
	region = 9B9182279ABD4FD99F695F91488C94C1

The same config works fine with heartbeat_mode set to local and the global heartbeat devices removed, and I can mount a shared FS - the local HB interfaces are IPv4 on a private L2 non routed VLAN, are up and each node can ping each other.

Once the config is copied to each node and have already run:

# service o2cb configure

Which completes in local heartbeat mode fine, so the cluster will start on boot and the params are default for timeouts etc. 

I check that the service on both nodes unloads and loads modules with no issues:

# service o2cb unload

Clean userdlm domains: OK
Unmounting ocfs2_dlmfs filesystem: OK
Unloading module "ocfs2_dlmfs": OK
Unloading module "ocfs2_stack_o2cb": OK
Unmounting configfs filesystem: OK
Unloading module "configfs": OK

# service o2cb load

Loading filesystem "configfs": OK
Mounting configfs filesystem at /sys/kernel/config: OK
Loading stack plugin "o2cb": OK
Loading filesystem "ocfs2_dlmfs": OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK

# mount -v
…
….
debugfs on /sys/kernel/debug type debugfs (rw)
….
ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)

#  lsmod | grep ocfs

ocfs2_dlmfs            18026  1 
ocfs2_stack_o2cb        3606  0 
ocfs2_dlm             196778  1 ocfs2_stack_o2cb
ocfs2_nodemanager     202856  3 ocfs2_dlmfs,ocfs2_stack_o2cb,ocfs2_dlm
ocfs2_stackglue        11283  2 ocfs2_dlmfs,ocfs2_stack_o2cb
configfs               25853  2 ocfs2_nodemanager

Looks good on both nodes…. then (sigh)

# service o2cb enable

Writing O2CB configuration: OK
Setting cluster stack "o2cb": OK
Registering O2CB cluster "test": Failed
o2cb: Unable to access cluster service while registering heartbeat mode 'global'
Unregistering O2CB cluster "test": OK

I have searched for the error string and have come up with a huge ZERO on help  - and the local OS log messages are equally unhelpful:

# tail /var/log/messages

Nov 12 21:54:53 clusterhost1 o2cb.init: online test
Nov 13 00:58:38 clusterhost1 o2cb.init: online test
Nov 13 01:00:06 clusterhost1 o2cb.init: offline test 0
Nov 13 01:00:06 clusterhost1 kernel: ocfs2: Unregistered cluster interface o2cb
Nov 13 01:01:14 clusterhost1 kernel: OCFS2 Node Manager 1.6.3
Nov 13 01:01:14 clusterhost1 kernel: OCFS2 DLM 1.6.3
Nov 13 01:01:14 clusterhost1 kernel: ocfs2: Registered cluster interface o2cb
Nov 13 01:01:14 clusterhost1 kernel: OCFS2 DLMFS 1.6.3
Nov 13 01:01:14 clusterhost1 kernel: OCFS2 User DLM kernel interface loaded
Nov 13 01:03:32 clusterhost1 o2cb.init: online test

Dmesg shows the same:

# dmesg

OCFS2 Node Manager 1.6.3
OCFS2 DLM 1.6.3
ocfs2: Registered cluster interface o2cb
OCFS2 DLMFS 1.6.3
OCFS2 User DLM kernel interface loaded
Slow work thread pool: Starting up
Slow work thread pool: Ready
FS-Cache: Loaded
FS-Cache: Netfs 'nfs' registered for caching
eth0: no IPv6 routers present
eth1: no IPv6 routers present
ocfs2: Unregistered cluster interface o2cb
OCFS2 Node Manager 1.6.3
OCFS2 DLM 1.6.3
ocfs2: Registered cluster interface o2cb
OCFS2 DLMFS 1.6.3
OCFS2 User DLM kernel interface loaded
ocfs2: Unregistered cluster interface o2cb
OCFS2 Node Manager 1.6.3
OCFS2 DLM 1.6.3
ocfs2: Registered cluster interface o2cb
OCFS2 DLMFS 1.6.3
OCFS2 User DLM kernel interface loaded

The filesystem looks fine and this can be run from both hosts in the cluster:

# fsck.ocfs2 -n /dev/sdc 

fsck.ocfs2 1.8.0
Checking OCFS2 filesystem in /dev/sdc:
  Label:              ocfs2vol1
  UUID:               5620F19D43D840C7A46523019AE15A96
  Number of blocks:   524288
  Block size:         4096
  Number of clusters: 524288
  Cluster size:       4096
  Number of slots:    4

# fsck.ocfs2 -n /dev/sdd

fsck.ocfs2 1.8.0
Checking OCFS2 filesystem in /dev/sdd:
  Label:              ocfs2vol2
  UUID:               9B9182279ABD4FD99F695F91488C94C1
  Number of blocks:   524288
  Block size:         4096
  Number of clusters: 524288
  Cluster size:       4096
  Number of slots:    4

What am I missing? I’ve re-done this, re-created the devices a few too many times (thinking I may have missed something) but I am mystified. From all outer appearances I have two VMs that can see and in local heartbeat mode mount a shared OCFS2 filesystem and access it (have it running in local heartbeat mode for a cluster of rsyslog servers that are being load balanced by an F5 LTM VS with no issues) I am stumped on how to get global HB devices setup, though I have read and re-read the user guides, troubleshooting guides and wikis/blogs on how to make that work until my eyes hurt. 

Mounted the debugfs and ran the debugfs.ocfs2 utility but am unfamiliar of what I should be looking for there (or if this is where I would look for cluster not coming online errors) 

As the oc2b/ocfs modules are all kernel based I am not 100% sure how to increase debug information without digging into the source code and mucking around there.

Any guidance or lessons learned (or things to check) would be super :) and if works warrant a happy scream of joy from my frustrated cube!

Warm regards,

Jon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20141112/68b903fe/attachment-0001.html