[Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack"

Sat Aug 6 15:53:28 UTC 2022

Hi Mark,

On 8/5/22 12:11, Mark Fasheh wrote:
> On Thu, Aug 4, 2022 at 4:53 PM Mark Fasheh <mark at fasheh.com> wrote:
>> 2) Should we allow the user to bypass our cluster checks?
>>
>> On this question I'm still a 'no'. I simply haven't seen enough
>> evidence to warrant such a drastic change in policy. Allowing it via
>> mount option too just feels extremely error-prone. I think we need to
>> explore alternative avenues to help
>> ing the user out here. As you noted in your followup, a single node
>> config is entirely possible in pacemaker (I've run that config
>> myself). Why not provide an easy way for the user to drop down to that
>> sort of a config? I know that's kind
>> of pushing responsibility for this to the cluster stack, but that's
>> where it belongs in the first place.
>>
>> Another option might be an 'observer mode' mount, where the node
>> participates in the cluster (and the file system locking) but purely
>> in a read-only fashion.
> 
> Thinking about this some more... The only way that this works without
> potential corruptions is if we always write a periodic mmp sequence,
> even in clustered mode (which might mean each node writes to its own
> sector). That way tunefs can always check the disk for a mounted node,
> even without a cluster stack up. If tunefs sees anyone writing
> sequences to the disk, it can safely fail the operation. Tunefs also
> would have to be writing an mmp sequence once it has determined that
> the disk is not mounted. It could also write some flag alongisde the
> sequence that says 'tunefs is working on this disk'. If a cluster
> mount comes up and sees a live sequence with that flag, it will know
> to fail the mount request as the disk is being modified. Local mounts
> can also use this to ensure that they are the only mounted node.
> 

Above tunefs check & write seq steps are mmp feature work flow.

Your mentioned tunefs work flow matches my patch[4/4] idea, and this

patch does all the works in kernel ocfs2_find_slot().

Because sequences should be saved in SB, update it should grab ->osb_lock,

which influence the performance.

And for saving seqs, for saving space, we won't alloc different disk block for
different node.
  If multi-nodes share the same disk block (eg, keep using slot_map
for saving seqs),
  the updating job will make IO performance issue.

For fixing performance issue, in my view, we should disable mmp sequences

updating when mounting mode is clustered. So I make a rule:

** If last mount didn't do unmount, (eg: crash), the next mount MUST be same mount type. **

above rule another meaning:

new coming node mounting type should same with exist mounting type.

and there is a basic knowledge:

current ocfs2 code under cluster stack already have the ability to prevent
multiple mount when mounting type is clustered.

(from patch 4/4) there are mount labels:

+#define OCFS2_MMP_SEQ_CLEAN 0xFF4D4D50U /* mmp_seq value for clean unmount */

+#define OCFS2_MMP_SEQ_FSCK  0xE24D4D50U /* mmp_seq value when being fscked */

+#define OCFS2_MMP_SEQ_MAX   0xE24D4D4FU /* maximum valid mmp_seq value */

+#define OCFS2_MMP_SEQ_INIT  0x0         /* mmp_seq init value */

+#define OCFS2_VALID_CLUSTER   0xE24D4D55U /* value for clustered mount

+                                              under MMP disabled */

+#define OCFS2_VALID_NOCLUSTER 0xE24D4D5AU /* value for noclustered mount

+                                              under MMP disabled */

whenever mount successfully, there should be three types living labels

in slotmap area:

- OCFS2_MMP_SEQ_CLEAN, OCFS2_VALID_NOCLUSTER - for local/non-clustered mount

- OCFS2_VALID_CLUSTER - for clustered mount

new coming node will check if any slot contains living labels.

whenever unmount successfully, there should be two types left labels in slotmap
  area:
-
  OCFS2_MMP_SEQ_CLEAN or 0 (zero)

when a node does unmount, according to the mount type, it will clean (zeroed)

or write OCFS2_MMP_SEQ_CLEAN in the slot.

> As it turns out, we already do pretty much all of the sequence writing
> already for the o2cb cluster stack - check out cluseter/heartbeat.c.
> If memory serves, tunefs.ocfs2 has code to write to this heartbeat
> area as well. For o2cb, we use the disk heartbeat to detect node
> liveness, and to kill our local node if we see disk timeouts. For
> pcmk, we shouldn't take any of these actions as it is none of our
> responsibility. Under pcmk, the heartbeating would be purely for mount
> protection checks.
> 

 From my thinking, under cluster stack, there is enough protecting logic.
For local/non-clustered mount, mmp will give the ability for detecting node liveness.

So I only enable kmmpd-<dev> kthread for local/non-clustered mount.

> The downside to this is that all nodes would be heartbeating to the
> disk on a regular interval, not just one. To be fair, this is exactly
> how o2cb works and with the correct timeout choices, we were able to
> avoid a measurable performance impact, though in any case this might
> have to be a small price the user pays for cluster aware mount
> protection.

I already considered this performance issue in patch [4/4].

> 
> Let me know what you think.
> 
> Thanks,
>    --Mark

Thanks,
Heming