[Ocfs2-users] Ocfs2-users Digest, Vol 130, Issue 1

Thu Nov 13 06:16:05 PST 2014

Hi Jon,

The kernel you are using includes the ocfs2 kernel modules at version 
1.6.3. The global heartbeat feature was introduced in ocfs2 1.8.

I haven't checked whether any of the 2.6.32-based uek's include ocfs2 
1.8, but certainly the 2.6.39 and later (aka uek2, uek3) ones do.

I assume from below you have an Oracle support license - at least for 
rdbms if not Oracle Linux. When using ocfs2 for rdbms resources, your 
rdbms license entitles you to ocfs2 support via MOS, though for 
general-purpose ocfs2 issues an Oracle Linux Support contract needs to 
be in place. This would have a separate CSI from that of licensed 
products - obviously open-source products are not licensed, but if you 
require support you need a support contract.

You may also want to review MOS documents 1552519.1 and 1553162.1

Best regards,
Richard.

On 13/11/2014 02:27, ocfs2-users-request at oss.oracle.com wrote:
> Send Ocfs2-users mailing list submissions to
> 	ocfs2-users at oss.oracle.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://oss.oracle.com/mailman/listinfo/ocfs2-users
> or, via email, send a message with subject or body 'help' to
> 	ocfs2-users-request at oss.oracle.com
>
> You can reach the person managing the list at
> 	ocfs2-users-owner at oss.oracle.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Ocfs2-users digest..."
>
>
> Today's Topics:
>
>     1. OCFS2 v1.8 on VMware VMs global heartbeat woes (Jon Norris)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 12 Nov 2014 18:26:51 -0800
> From: Jon Norris <jon_norris at apple.com>
> Subject: [Ocfs2-users] OCFS2 v1.8 on VMware VMs global heartbeat woes
> To: ocfs2-users at oss.oracle.com
> Message-ID: <FFF084CA-12F0-4FA1-BB38-B70F5A4A6695 at apple.com>
> Content-Type: text/plain; charset="utf-8"
>
> Running two VMs on ESXi 5.1.0 and trying to get global heart beat (HB) working with no luck (on about my 20th rebuild and redo)
>
> Environment:
>
> Two VMware based VMs running
>
> # cat /etc/oracle-release
>
> Oracle Linux Server release 6.5
>
> # uname -r
>
> 2.6.32-400.36.8.el6uek.x86_64
>
> # yum list installed  | grep ocfs
>
> ocfs2-tools.x86_64               1.8.0-11.el6           @oel-latest
>
> # yum list installed | grep uek
>
> kernel-uek.x86_64                2.6.32-400.36.8.el6uek @oel-latest
> kernel-uek-firmware.noarch       2.6.32-400.36.8.el6uek @oel-latest
> kernel-uek-headers.x86_64        2.6.32-400.36.8.el6uek @oel-latest
>
> Configuration:
>
> The shared data stores (HB and mounted OCFS) are setup in a similar way as described by VMWare and Oracle for shared RAC VMWare based data stores. All blogs, wikis and VMWare KB docs show similar setup, VM shared SCSI settings [multi-writer], shared disk [independant + persistent] etc. such as:
>
> http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1034165 <http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1034165> )
>
> The devices can be seen by both VMs after in the OS. I have used the same configuration to run an OCFS2 setup with local heartbeat, and that works fine (cluster starts up and the OCFS2 file system mounts with no issues)
>
> I followed similar procedures as show in an Oracle blog + docs: https://docs.oracle.com/cd/E37670_01/E37355/html/ol_instcfg_ocfs2.html <https://docs.oracle.com/cd/E37670_01/E37355/html/ol_instcfg_ocfs2.html> and https://blogs.oracle.com/wim/entry/ocfs2_global_heartbeat <https://blogs.oracle.com/wim/entry/ocfs2_global_heartbeat> with no luck.
>
> The shared SCSI controllers are VMware paravirtual and set to ?shared none? as suggested by the VMware RAC shared disk KB (previously mentioned)
>
> After the shared Linux devices have been added to both VMs and are seen by both VMs in the OS (ls /dev/sd* shows the devices on each) I format the global HB devices in a way similar to the following from one VM:
>
> # mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol1 --cluster-name=test --cluster-stack=o2cb --global-heartbeat /dev/sdc
> # mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol2 --cluster-name=test --cluster-stack=o2cb --global-heartbeat /dev/sdd
>
> >From both VMs you can run the following and see:
>
> # mounted.ocfs2 -d
>
> Device    Stack  Cluster  F  UUID                              Label
> /dev/sdc  o2cb   test     G  5620F19D43D840C7A46523019AE15A96  ocfs2vol1
> /dev/sdd  o2cb   test     G  9B9182279ABD4FD99F695F91488C94C1  ocfs2vol2
>
> I then add the global HB devices to the ocfs config file with similar commands:
>
> # o2cb add-heartbeat test 5620F19D43D840C7A46523019AE15A96
> # o2cb add-heartbeat test 9B9182279ABD4FD99F695F91488C94C1
>
> Thus far looking good (heh, but then all we?ve done is format ocfs2 with options and updated a text file) - then I do the following:
>
> # o2cb heartbeat-mode test global
>
> All this being done on one node in the cluster I copy the following to the other node (with hostnames changed here, though the actual hostname = output of the hostname command on each node):
>
> # cat /etc/ocfs2/cluster.conf
>
> node:
> 	name = clusterhost1.mydomain.com
> 	cluster = test
> 	number = 0
> 	ip_address = 10.143.144.12
> 	ip_port = 7777
>
> node:
> 	name = clusterhost2.mydomain.com
> 	cluster = test
> 	number = 1
> 	ip_address = 10.143.144.13
> 	ip_port = 7777
>
> cluster:
> 	name = test
> 	heartbeat_mode = global
> 	node_count = 2
>
> heartbeat:
> 	cluster = test
> 	region = 5620F19D43D840C7A46523019AE15A96
>
> heartbeat:
> 	cluster = test
> 	region = 9B9182279ABD4FD99F695F91488C94C1
>
> The same config works fine with heartbeat_mode set to local and the global heartbeat devices removed, and I can mount a shared FS - the local HB interfaces are IPv4 on a private L2 non routed VLAN, are up and each node can ping each other.
>
> Once the config is copied to each node and have already run:
>
> # service o2cb configure
>
> Which completes in local heartbeat mode fine, so the cluster will start on boot and the params are default for timeouts etc.
>
> I check that the service on both nodes unloads and loads modules with no issues:
>
> # service o2cb unload
>
> Clean userdlm domains: OK
> Unmounting ocfs2_dlmfs filesystem: OK
> Unloading module "ocfs2_dlmfs": OK
> Unloading module "ocfs2_stack_o2cb": OK
> Unmounting configfs filesystem: OK
> Unloading module "configfs": OK
>
> # service o2cb load
>
> Loading filesystem "configfs": OK
> Mounting configfs filesystem at /sys/kernel/config: OK
> Loading stack plugin "o2cb": OK
> Loading filesystem "ocfs2_dlmfs": OK
> Mounting ocfs2_dlmfs filesystem at /dlm: OK
>
> # mount -v
> ?
> ?.
> debugfs on /sys/kernel/debug type debugfs (rw)
> ?.
> ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
>
> #  lsmod | grep ocfs
>
> ocfs2_dlmfs            18026  1
> ocfs2_stack_o2cb        3606  0
> ocfs2_dlm             196778  1 ocfs2_stack_o2cb
> ocfs2_nodemanager     202856  3 ocfs2_dlmfs,ocfs2_stack_o2cb,ocfs2_dlm
> ocfs2_stackglue        11283  2 ocfs2_dlmfs,ocfs2_stack_o2cb
> configfs               25853  2 ocfs2_nodemanager
>
> Looks good on both nodes?. then (sigh)
>
> # service o2cb enable
>
> Writing O2CB configuration: OK
> Setting cluster stack "o2cb": OK
> Registering O2CB cluster "test": Failed
> o2cb: Unable to access cluster service while registering heartbeat mode 'global'
> Unregistering O2CB cluster "test": OK
>
> I have searched for the error string and have come up with a huge ZERO on help  - and the local OS log messages are equally unhelpful:
>
> # tail /var/log/messages
>
> Nov 12 21:54:53 clusterhost1 o2cb.init: online test
> Nov 13 00:58:38 clusterhost1 o2cb.init: online test
> Nov 13 01:00:06 clusterhost1 o2cb.init: offline test 0
> Nov 13 01:00:06 clusterhost1 kernel: ocfs2: Unregistered cluster interface o2cb
> Nov 13 01:01:14 clusterhost1 kernel: OCFS2 Node Manager 1.6.3
> Nov 13 01:01:14 clusterhost1 kernel: OCFS2 DLM 1.6.3
> Nov 13 01:01:14 clusterhost1 kernel: ocfs2: Registered cluster interface o2cb
> Nov 13 01:01:14 clusterhost1 kernel: OCFS2 DLMFS 1.6.3
> Nov 13 01:01:14 clusterhost1 kernel: OCFS2 User DLM kernel interface loaded
> Nov 13 01:03:32 clusterhost1 o2cb.init: online test
>
> Dmesg shows the same:
>
> # dmesg
>
> OCFS2 Node Manager 1.6.3
> OCFS2 DLM 1.6.3
> ocfs2: Registered cluster interface o2cb
> OCFS2 DLMFS 1.6.3
> OCFS2 User DLM kernel interface loaded
> Slow work thread pool: Starting up
> Slow work thread pool: Ready
> FS-Cache: Loaded
> FS-Cache: Netfs 'nfs' registered for caching
> eth0: no IPv6 routers present
> eth1: no IPv6 routers present
> ocfs2: Unregistered cluster interface o2cb
> OCFS2 Node Manager 1.6.3
> OCFS2 DLM 1.6.3
> ocfs2: Registered cluster interface o2cb
> OCFS2 DLMFS 1.6.3
> OCFS2 User DLM kernel interface loaded
> ocfs2: Unregistered cluster interface o2cb
> OCFS2 Node Manager 1.6.3
> OCFS2 DLM 1.6.3
> ocfs2: Registered cluster interface o2cb
> OCFS2 DLMFS 1.6.3
> OCFS2 User DLM kernel interface loaded
>
> The filesystem looks fine and this can be run from both hosts in the cluster:
>
> # fsck.ocfs2 -n /dev/sdc
>
> fsck.ocfs2 1.8.0
> Checking OCFS2 filesystem in /dev/sdc:
>    Label:              ocfs2vol1
>    UUID:               5620F19D43D840C7A46523019AE15A96
>    Number of blocks:   524288
>    Block size:         4096
>    Number of clusters: 524288
>    Cluster size:       4096
>    Number of slots:    4
>
> # fsck.ocfs2 -n /dev/sdd
>
> fsck.ocfs2 1.8.0
> Checking OCFS2 filesystem in /dev/sdd:
>    Label:              ocfs2vol2
>    UUID:               9B9182279ABD4FD99F695F91488C94C1
>    Number of blocks:   524288
>    Block size:         4096
>    Number of clusters: 524288
>    Cluster size:       4096
>    Number of slots:    4
>
> What am I missing? I?ve re-done this, re-created the devices a few too many times (thinking I may have missed something) but I am mystified. From all outer appearances I have two VMs that can see and in local heartbeat mode mount a shared OCFS2 filesystem and access it (have it running in local heartbeat mode for a cluster of rsyslog servers that are being load balanced by an F5 LTM VS with no issues) I am stumped on how to get global HB devices setup, though I have read and re-read the user guides, troubleshooting guides and wikis/blogs on how to make that work until my eyes hurt.
>
> Mounted the debugfs and ran the debugfs.ocfs2 utility but am unfamiliar of what I should be looking for there (or if this is where I would look for cluster not coming online errors)
>
> As the oc2b/ocfs modules are all kernel based I am not 100% sure how to increase debug information without digging into the source code and mucking around there.
>
> Any guidance or lessons learned (or things to check) would be super :) and if works warrant a happy scream of joy from my frustrated cube!
>
>
> Warm regards,
>
> Jon
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20141112/68b903fe/attachment.html
>
> ------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
> End of Ocfs2-users Digest, Vol 130, Issue 1
> *******************************************