[Ocfs2-users] Node 8 doesn't mount / Wrong slot map assignment?

Thu Sep 14 11:58:09 PDT 2017

Hi again,

I made some progress with debugging the situation.

To recap:

2 ocfs2 file systems:

/dev/drbd0 -> lvm -> RAID1 from 2 x 600 GB SAS disks

/dev/drbd1 -> lvm -> RAID1 from 2 x 6 TB NL (Near-Line) SAS disks

This is configured identically on 2 DELL R 530 servers (node 1 + 2 as
hypervisors). Disks are connected via PERC H730 mini (Linux kernel
driver: megaraid_sas ver. 06.811.02.00-rc1). drbd has a private GigE
link for replication traffic. Both hypervisors run 3 virtual machines each.

/dev/drbd0 works as expected as long as it's allocated on the 600 GB
RAID 1. If it's moved to the large 6 TB RAID1 device the behaviuor gets
identical to /dev/drbd1.

As described in my previous post there's an unusual slot (?) numbering
which prevents the mount of the ocfs2 file system /dev/drbd1 on node 8.
As a quick fix we could swap node numbers 1 <-> 8 in cluster.conf. But
this does not address the underlying problem as we will soon see. In
deliberately formatted form the list of nodes looks as follows:

node (number = 8, name = h1a) -  Hypervisor
node (number = 2, name = h1b) -  Hypervisor
node (number = 3, name = web1) - Guest 1 on h1a
node (number = 4, name = db1)  - Guest 2 on h1a
node (number = 5, name = srv1) - Guest 3 on h1a
node (number = 6, name = web2) - Guest 4 on h1b
node (number = 7, name = db2)  - Guest 5 on h1b
node (number = 1, name = srv2) - Guest 6 on h1b

Now node 8 is the first (Hypervisor) node to mount /dev/drbd1 which
leads to ('watch -d -n 1 "echo \"hb\" | debugfs.ocfs2 -n /dev/drbd1"):

hb

        node: node              seq       generation checksum
          64:    8 0000000059b8d9ba 73a63eb550a33095 f4e074d1

Node 2 is the second (Hypervisor) node to mount:

hb
        node: node              seq       generation checksum
          16:    2 0000000059b8d9b9 5c7504c05637983e 07d696ec
          64:    8 0000000059b8d9ba 73a63eb550a33095 f4e074d1

Again we see the strange "* 8" or "shift left 3" relationship between columns "node:" and "node".

Now the guests are brought up and mount the file system in order 3, 5, 6, 1 (I don't have the actual seq / gen values, so from memory):

hb
        node: node              seq       generation checksum
           1:    1 xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxx
           3:    3 xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxx
           5:    5 xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxx
           6:    6 xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxx
          16:    2 0000000059b8d9b9 5c7504c05637983e 07d696ec
          64:    8 0000000059b8d9ba 73a63eb550a33095 f4e074d1

Please note that the virtual machines get assigned the corresponding "node:" = "node" values as expected.

Now we went a step further and enabled tracing: "debugfs.ocfs2 -l HEARTBEAT allow". This periodically logs messages from the heartbeat threads of the individual file systems. For the file system /dev/drbd1 we get on the hypervisors:

(o2hb-3B0327532D,32784,3):o2hb_check_slot:849 Slot 1 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 1544
(o2hb-3B0327532D,32784,3):o2hb_check_slot:849 Slot 2 gen 0x98be08e71122efed cksum 0x33a84ac0 seq 1505346907 last 1505346907 changed 1 equal 0
(o2hb-3B0327532D,32784,3):o2hb_check_slot:849 Slot 3 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 1544
(o2hb-3B0327532D,32784,3):o2hb_check_slot:849 Slot 4 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 1544
(o2hb-3B0327532D,32784,3):o2hb_check_slot:849 Slot 5 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 1544
(o2hb-3B0327532D,32784,3):o2hb_check_slot:849 Slot 6 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 1544
(o2hb-3B0327532D,32784,3):o2hb_check_slot:849 Slot 7 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 1544
(o2hb-3B0327532D,32784,3):o2hb_check_slot:849 Slot 8 gen 0x551934cc4ba0b1bf cksum 0xf606e2be seq 1505346907 last 1505346907 changed 1 equal 0

We only see the hypervisors heartbeating in slots 2 and 8 although 4 additional guests have also mounted the same file system.

Tracing the ocfs2 heartbeat on one of the guests (web1) gives the following:

(o2hb-3B0327532D,514,0):o2hb_check_slot:849 Slot 1 gen 0xd1f96dee2509bc73 cksum 0x1dc10931 seq 1505371587 last 1505371587 changed 1 equal 0
(o2hb-3B0327532D,514,0):o2hb_check_slot:849 Slot 2 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 13674
(o2hb-3B0327532D,514,0):o2hb_check_slot:849 Slot 3 gen 0x5d8c200c0113510f cksum 0xbfc95a14 seq 1505371590 last 1505371590 changed 1 equal 0
(o2hb-3B0327532D,514,0):o2hb_check_slot:849 Slot 4 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 13674
(o2hb-3B0327532D,514,0):o2hb_check_slot:849 Slot 5 gen 0x39a8da3bae49161b cksum 0x49b4a110 seq 1505371588 last 1505371588 changed 1 equal 0
(o2hb-3B0327532D,514,0):o2hb_check_slot:849 Slot 6 gen 0xc00a0ba3931ad15 cksum 0x92625e99 seq 1505371587 last 1505371587 changed 1 equal 0
(o2hb-3B0327532D,514,0):o2hb_check_slot:849 Slot 7 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 13674
(o2hb-3B0327532D,514,0):o2hb_check_slot:849 Slot 8 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 13674

Here we have the mirrored situation in that only guests are seen heartbeating in slots 1, 3, 5 and 6. No trace of the hypervisors in slots 2 (16) and 8 (64) ...

Uhm, well ... trying to wrap my head around this ... :-)

Ok, so let's dig a little deeper.

Watching the init of a mount operation with HEARTBEAT tracing enabled gives different results depending on whether the same file system is mounted on a hypervisor:

(mount.ocfs2,28264,0):o2hb_init_region_params:1729 hr_start_block = 273, hr_blocks = 255
(mount.ocfs2,28264,0):o2hb_init_region_params:1731 hr_block_bytes = 4096, *hr_block_bits = 12*
(mount.ocfs2,28264,0):o2hb_init_region_params:1732 hr_timeout_ms = 2000
(mount.ocfs2,28264,0):o2hb_init_region_params:1733 dead threshold = 31
(mount.ocfs2,28264,0):o2hb_map_slot_data:1764 Going to require 255 pages to cover 255 blocks at 1 blocks per page
(o2hb-C27AC49D2B,28265,10):o2hb_thread:1221 hb thread running

... or on a guest:

(mount.ocfs2,3505,1):o2hb_init_region_params:1729 hr_start_block = 2184, hr_blocks = 255
(mount.ocfs2,3505,1):o2hb_init_region_params:1731 hr_block_bytes = 512, *hr_block_bits = 9*
(mount.ocfs2,3505,1):o2hb_init_region_params:1732 hr_timeout_ms = 2000
(mount.ocfs2,3505,1):o2hb_init_region_params:1733 dead threshold = 31
(mount.ocfs2,3505,1):o2hb_map_slot_data:1764 Going to require 32 pages to cover 255 blocks at 8 blocks per page
(o2hb-C27AC49D2B,3506,0):o2hb_thread:1221 hb thread running

So on the hypervisors we have -> hr_block_bytes = 4096, hr_block_bits = 12
and on the virtual machines ->   hr_block_bytes = 512,  hr_block_bits = 9

There we find BTW the factor of 8 in the different block size and a 3 bit length difference in hr_block_bits.

Is this ok? From my limited understanding I would have expected that the nodes mounting the shared file system would share the heartbeat system file and inside of that share a common data structure (hearbeat region). But then the nodes should have a common understanding of the size of this structure, right? Here it looks as if the hypervisors are interacting with a "large" hr structure while the guests do the same but on a "small" heartbeat region. That would be an explanation that heartbeat threads from hypervisors and guests do not "see" each other as described above.

Or are the heartbeat regions in-memory structures of different size which get translated into a common "disk" structure when being written to the hb system file?

It would be great if you could give me a little guidance here. If it actually is a bug I'm willing to work on this further. If I'm heading in the wrong direction just give me a short note and probably a hint what's wrong with my setup or which ocfs2 version might fix this problem.

Thanks a lot + Regards from Berlin ... Michael U.

On 09/13/2017 09:54 AM, Michael Ulbrich wrote:
> Hi all,
>
> we've a small (?) problem with a 2-node cluster on Debian 8:
>
> Linux h1b 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u2 (2017-06-26)
> x86_64 GNU/Linux
>
> ocfs2-tools 1.6.4-3
>
> Two ocfs2 filesystems (drbd0 600 GB w/ 8 slots and drbd1 6 TB w/ 6
> slots) are created on top of drbd w/ 4k block and cluster size,
> 'max_features' enabled.
>
> cluster.conf assigns sequential node numbers 1 - 8. Nodes 1, 2 are the
> hypervisors. Nodes 3, 4, 5 are VMs on node 1. Nodes 6, 7, 8 the
> corresponding VMs on node 2.
>
> VMs all run Debian 8 as well:
>
> Linux srv2 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64
> GNU/Linux
>
> When mounting drbd0 in order of increasing node numbers and concurrently
> watching the 'hb' output from debugsfs.ocfs2 we get a clean slot map (?):
>
> hb
>         node: node              seq       generation checksum
>            1:    1 0000000059b8d94a fa60f0d8423590d9 edec9643
>            2:    2 0000000059b8d94c aca059df4670f467 994e3458
>            3:    3 0000000059b8d949 f03dc9ba8f27582c d4473fc2
>            4:    4 0000000059b8d94b df5bbdb756e757f8 12a198eb
>            5:    5 0000000059b8d94a 1af81d94a7cb681b 91fba906
>            6:    6 0000000059b8d94b 104538f30cdb35fa 8713e798
>            7:    7 0000000059b8d94b 195658c9fb8ca7f9 5e54edf6
>            8:    8 0000000059b8d949 dc6bfb46b9cf1ac3 de7a8757
>
> Device drbd1 in contrast yields the following table after mounting on
> nodes 1, 2:
>
> hb
>         node: node              seq       generation checksum
>            8:    1 0000000059b8d9ba 73a63eb550a33095 f4e074d1
>           16:    2 0000000059b8d9b9 5c7504c05637983e 07d696ec
>
> Proceeding with the drbd1 mounts on nodes 3, 5, 6 leads us to:
>
> hb
>         node: node              seq       generation checksum
>            3:    3 0000000059b8da3b 9443b4b209b16175 f2cc87ec
>            5:    5 0000000059b8da3c 4b742f709377466f 3ac41cf3
>            6:    6 0000000059b8da3b d96e2de0a55514f6 335a4d90
>            8:    1 0000000059b8da3c 73a63eb550a33095 2312c1c4
>           16:    2 0000000059b8da3d 5c7504c05637983e 659571a1
>
> The problem arises when trying to mount node 8 since its slot is already
> occupied by node 1:
>
> kern.log node 1:
>
> (o2hb-0AEE381A14,50990,4):o2hb_check_own_slot:582 ERROR: Another node is
> heartbeating on device (drbd1): expected(1:0x18acf7b0b3e5544c,
> 0x59b8445c), ondisk(8:0xb91302db72a65364, 0x59b8445b)
>
> kern.log node 8:
>
> ocfs2: Mounting device (254,16) on (node 8, slot 7) with ordered data mode.
> (o2hb-0AEE381A14,518,1):o2hb_check_own_slot:582 ERROR: Another node is
> heartbeating on device (vdc): expected(8:0x18acf7b0b3e5544c,
> 0x59b8445c), ondisk(1:0x18acf7b0b3e5544c, 0x59b8445c)
>
> This can be "fixed" by exchanging node numbers 1 and 8 in cluster.conf.
> Then node 8 will be assigned slot 8, node 2 stays in slot 16, 3 to 7 as
> expected. There is no node 16 configured so there's no conflict. But
> since we experience some other so far not explainable instabilities with
> this ocfs2 device / system during operation further down the road we
> decided to take care of and try to fix this issue first.
>
> Somehow the failure reminds of bit shift or masking problems:
>
> 1 << 3 = 8
> 2 << 3 = 16
>
> But then again - what do I know ...
>
> Tried so far:
>
> A. Create offending file system with 8 slots instead of 6 -> same issue.
> B. Set features to 'default' (disables feature 'extended-slotmap') ->
> same issue.
>
> We'd very much appreciate any comments on this. Has anything similar
> ever been experienced before? Are we completely missing something
> important here?
>
> If there's a fix already out for this any pointers (src files / commits)
> to where to look would be greatly appreciated.
>
> Thanks in advance + Best regards ... Michael U.