[Ocfs2-users] OCFS2 v1.8 on VMware VMs global heartbeat woes
Jon Norris
jon_norris at apple.com
Wed Nov 12 18:26:51 PST 2014
Running two VMs on ESXi 5.1.0 and trying to get global heart beat (HB) working with no luck (on about my 20th rebuild and redo)
Environment:
Two VMware based VMs running
# cat /etc/oracle-release
Oracle Linux Server release 6.5
# uname -r
2.6.32-400.36.8.el6uek.x86_64
# yum list installed | grep ocfs
ocfs2-tools.x86_64 1.8.0-11.el6 @oel-latest
# yum list installed | grep uek
kernel-uek.x86_64 2.6.32-400.36.8.el6uek @oel-latest
kernel-uek-firmware.noarch 2.6.32-400.36.8.el6uek @oel-latest
kernel-uek-headers.x86_64 2.6.32-400.36.8.el6uek @oel-latest
Configuration:
The shared data stores (HB and mounted OCFS) are setup in a similar way as described by VMWare and Oracle for shared RAC VMWare based data stores. All blogs, wikis and VMWare KB docs show similar setup, VM shared SCSI settings [multi-writer], shared disk [independant + persistent] etc. such as:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1034165 <http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1034165> )
The devices can be seen by both VMs after in the OS. I have used the same configuration to run an OCFS2 setup with local heartbeat, and that works fine (cluster starts up and the OCFS2 file system mounts with no issues)
I followed similar procedures as show in an Oracle blog + docs: https://docs.oracle.com/cd/E37670_01/E37355/html/ol_instcfg_ocfs2.html <https://docs.oracle.com/cd/E37670_01/E37355/html/ol_instcfg_ocfs2.html> and https://blogs.oracle.com/wim/entry/ocfs2_global_heartbeat <https://blogs.oracle.com/wim/entry/ocfs2_global_heartbeat> with no luck.
The shared SCSI controllers are VMware paravirtual and set to “shared none” as suggested by the VMware RAC shared disk KB (previously mentioned)
After the shared Linux devices have been added to both VMs and are seen by both VMs in the OS (ls /dev/sd* shows the devices on each) I format the global HB devices in a way similar to the following from one VM:
# mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol1 --cluster-name=test --cluster-stack=o2cb --global-heartbeat /dev/sdc
# mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol2 --cluster-name=test --cluster-stack=o2cb --global-heartbeat /dev/sdd
From both VMs you can run the following and see:
# mounted.ocfs2 -d
Device Stack Cluster F UUID Label
/dev/sdc o2cb test G 5620F19D43D840C7A46523019AE15A96 ocfs2vol1
/dev/sdd o2cb test G 9B9182279ABD4FD99F695F91488C94C1 ocfs2vol2
I then add the global HB devices to the ocfs config file with similar commands:
# o2cb add-heartbeat test 5620F19D43D840C7A46523019AE15A96
# o2cb add-heartbeat test 9B9182279ABD4FD99F695F91488C94C1
Thus far looking good (heh, but then all we’ve done is format ocfs2 with options and updated a text file) - then I do the following:
# o2cb heartbeat-mode test global
All this being done on one node in the cluster I copy the following to the other node (with hostnames changed here, though the actual hostname = output of the hostname command on each node):
# cat /etc/ocfs2/cluster.conf
node:
name = clusterhost1.mydomain.com
cluster = test
number = 0
ip_address = 10.143.144.12
ip_port = 7777
node:
name = clusterhost2.mydomain.com
cluster = test
number = 1
ip_address = 10.143.144.13
ip_port = 7777
cluster:
name = test
heartbeat_mode = global
node_count = 2
heartbeat:
cluster = test
region = 5620F19D43D840C7A46523019AE15A96
heartbeat:
cluster = test
region = 9B9182279ABD4FD99F695F91488C94C1
The same config works fine with heartbeat_mode set to local and the global heartbeat devices removed, and I can mount a shared FS - the local HB interfaces are IPv4 on a private L2 non routed VLAN, are up and each node can ping each other.
Once the config is copied to each node and have already run:
# service o2cb configure
Which completes in local heartbeat mode fine, so the cluster will start on boot and the params are default for timeouts etc.
I check that the service on both nodes unloads and loads modules with no issues:
# service o2cb unload
Clean userdlm domains: OK
Unmounting ocfs2_dlmfs filesystem: OK
Unloading module "ocfs2_dlmfs": OK
Unloading module "ocfs2_stack_o2cb": OK
Unmounting configfs filesystem: OK
Unloading module "configfs": OK
# service o2cb load
Loading filesystem "configfs": OK
Mounting configfs filesystem at /sys/kernel/config: OK
Loading stack plugin "o2cb": OK
Loading filesystem "ocfs2_dlmfs": OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
# mount -v
…
….
debugfs on /sys/kernel/debug type debugfs (rw)
….
ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
# lsmod | grep ocfs
ocfs2_dlmfs 18026 1
ocfs2_stack_o2cb 3606 0
ocfs2_dlm 196778 1 ocfs2_stack_o2cb
ocfs2_nodemanager 202856 3 ocfs2_dlmfs,ocfs2_stack_o2cb,ocfs2_dlm
ocfs2_stackglue 11283 2 ocfs2_dlmfs,ocfs2_stack_o2cb
configfs 25853 2 ocfs2_nodemanager
Looks good on both nodes…. then (sigh)
# service o2cb enable
Writing O2CB configuration: OK
Setting cluster stack "o2cb": OK
Registering O2CB cluster "test": Failed
o2cb: Unable to access cluster service while registering heartbeat mode 'global'
Unregistering O2CB cluster "test": OK
I have searched for the error string and have come up with a huge ZERO on help - and the local OS log messages are equally unhelpful:
# tail /var/log/messages
Nov 12 21:54:53 clusterhost1 o2cb.init: online test
Nov 13 00:58:38 clusterhost1 o2cb.init: online test
Nov 13 01:00:06 clusterhost1 o2cb.init: offline test 0
Nov 13 01:00:06 clusterhost1 kernel: ocfs2: Unregistered cluster interface o2cb
Nov 13 01:01:14 clusterhost1 kernel: OCFS2 Node Manager 1.6.3
Nov 13 01:01:14 clusterhost1 kernel: OCFS2 DLM 1.6.3
Nov 13 01:01:14 clusterhost1 kernel: ocfs2: Registered cluster interface o2cb
Nov 13 01:01:14 clusterhost1 kernel: OCFS2 DLMFS 1.6.3
Nov 13 01:01:14 clusterhost1 kernel: OCFS2 User DLM kernel interface loaded
Nov 13 01:03:32 clusterhost1 o2cb.init: online test
Dmesg shows the same:
# dmesg
OCFS2 Node Manager 1.6.3
OCFS2 DLM 1.6.3
ocfs2: Registered cluster interface o2cb
OCFS2 DLMFS 1.6.3
OCFS2 User DLM kernel interface loaded
Slow work thread pool: Starting up
Slow work thread pool: Ready
FS-Cache: Loaded
FS-Cache: Netfs 'nfs' registered for caching
eth0: no IPv6 routers present
eth1: no IPv6 routers present
ocfs2: Unregistered cluster interface o2cb
OCFS2 Node Manager 1.6.3
OCFS2 DLM 1.6.3
ocfs2: Registered cluster interface o2cb
OCFS2 DLMFS 1.6.3
OCFS2 User DLM kernel interface loaded
ocfs2: Unregistered cluster interface o2cb
OCFS2 Node Manager 1.6.3
OCFS2 DLM 1.6.3
ocfs2: Registered cluster interface o2cb
OCFS2 DLMFS 1.6.3
OCFS2 User DLM kernel interface loaded
The filesystem looks fine and this can be run from both hosts in the cluster:
# fsck.ocfs2 -n /dev/sdc
fsck.ocfs2 1.8.0
Checking OCFS2 filesystem in /dev/sdc:
Label: ocfs2vol1
UUID: 5620F19D43D840C7A46523019AE15A96
Number of blocks: 524288
Block size: 4096
Number of clusters: 524288
Cluster size: 4096
Number of slots: 4
# fsck.ocfs2 -n /dev/sdd
fsck.ocfs2 1.8.0
Checking OCFS2 filesystem in /dev/sdd:
Label: ocfs2vol2
UUID: 9B9182279ABD4FD99F695F91488C94C1
Number of blocks: 524288
Block size: 4096
Number of clusters: 524288
Cluster size: 4096
Number of slots: 4
What am I missing? I’ve re-done this, re-created the devices a few too many times (thinking I may have missed something) but I am mystified. From all outer appearances I have two VMs that can see and in local heartbeat mode mount a shared OCFS2 filesystem and access it (have it running in local heartbeat mode for a cluster of rsyslog servers that are being load balanced by an F5 LTM VS with no issues) I am stumped on how to get global HB devices setup, though I have read and re-read the user guides, troubleshooting guides and wikis/blogs on how to make that work until my eyes hurt.
Mounted the debugfs and ran the debugfs.ocfs2 utility but am unfamiliar of what I should be looking for there (or if this is where I would look for cluster not coming online errors)
As the oc2b/ocfs modules are all kernel based I am not 100% sure how to increase debug information without digging into the source code and mucking around there.
Any guidance or lessons learned (or things to check) would be super :) and if works warrant a happy scream of joy from my frustrated cube!
Warm regards,
Jon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20141112/68b903fe/attachment-0001.html
More information about the Ocfs2-users
mailing list