[Ocfs2-devel] + revert-ocfs2-mount-shared-volume-without-ha-stack.patch added to mm-hotfixes-unstable branch
Junxiao Bi
junxiao.bi at oracle.com
Sun Jun 26 23:18:44 UTC 2022
Hi Andrew,
Thanks for merging it to mm.
I see the “From” is “ Junxiao Bi via Ocfs2-devel <ocfs2-devel at oss.oracle.com>”, can you help fix that?
Thanks,
Junxiao
> 在 2022年6月26日,下午1:21,Andrew Morton <akpm at linux-foundation.org> 写道:
>
>
> The patch titled
> Subject: Revert "ocfs2: mount shared volume without ha stack"
> has been added to the -mm mm-hotfixes-unstable branch. Its filename is
> revert-ocfs2-mount-shared-volume-without-ha-stack.patch
>
> This patch will shortly appear at
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/revert-ocfs2-mount-shared-volume-without-ha-stack.patch
>
> This patch will later appear in the mm-hotfixes-unstable branch at
> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
>
> Before you just go and hit "reply", please:
> a) Consider who else should be cc'ed
> b) Prefer to cc a suitable mailing list as well
> c) Ideally: find the original patch on the mailing list and do a
> reply-to-all to that, adding suitable additional cc's
>
> *** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
>
> The -mm tree is included into linux-next via the mm-everything
> branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> and is updated there every 2-3 working days
>
> ------------------------------------------------------
> From: Junxiao Bi via Ocfs2-devel <ocfs2-devel at oss.oracle.com>
> Subject: Revert "ocfs2: mount shared volume without ha stack"
> Date: Fri, 3 Jun 2022 15:28:01 -0700
>
> This reverts commit 912f655d78c5d4ad05eac287f23a435924df7144.
>
> This commit introduced a regression that can cause mount hung. The
> changes in __ocfs2_find_empty_slot causes that any node with none-zero
> node number can grab the slot that was already taken by node 0, so node 1
> will access the same journal with node 0, when it try to grab journal
> cluster lock, it will hung because it was already acquired by node 0.
> It's very easy to reproduce this, in one cluster, mount node 0 first, then
> node 1, you will see the following call trace from node 1.
>
> [13148.735424] INFO: task mount.ocfs2:53045 blocked for more than 122 seconds.
> [13148.739691] Not tainted 5.15.0-2148.0.4.el8uek.mountracev2.x86_64 #2
> [13148.742560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [13148.745846] task:mount.ocfs2 state:D stack: 0 pid:53045 ppid: 53044 flags:0x00004000
> [13148.749354] Call Trace:
> [13148.750718] <TASK>
> [13148.752019] ? usleep_range+0x90/0x89
> [13148.753882] __schedule+0x210/0x567
> [13148.755684] schedule+0x44/0xa8
> [13148.757270] schedule_timeout+0x106/0x13c
> [13148.759273] ? __prepare_to_swait+0x53/0x78
> [13148.761218] __wait_for_common+0xae/0x163
> [13148.763144] __ocfs2_cluster_lock.constprop.0+0x1d6/0x870 [ocfs2]
> [13148.765780] ? ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
> [13148.768312] ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
> [13148.770968] ocfs2_journal_init+0x91/0x340 [ocfs2]
> [13148.773202] ocfs2_check_volume+0x39/0x461 [ocfs2]
> [13148.775401] ? iput+0x69/0xba
> [13148.777047] ocfs2_mount_volume.isra.0.cold+0x40/0x1f5 [ocfs2]
> [13148.779646] ocfs2_fill_super+0x54b/0x853 [ocfs2]
> [13148.781756] mount_bdev+0x190/0x1b7
> [13148.783443] ? ocfs2_remount+0x440/0x440 [ocfs2]
> [13148.785634] legacy_get_tree+0x27/0x48
> [13148.787466] vfs_get_tree+0x25/0xd0
> [13148.789270] do_new_mount+0x18c/0x2d9
> [13148.791046] __x64_sys_mount+0x10e/0x142
> [13148.792911] do_syscall_64+0x3b/0x89
> [13148.794667] entry_SYSCALL_64_after_hwframe+0x170/0x0
> [13148.797051] RIP: 0033:0x7f2309f6e26e
> [13148.798784] RSP: 002b:00007ffdcee7d408 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
> [13148.801974] RAX: ffffffffffffffda RBX: 00007ffdcee7d4a0 RCX: 00007f2309f6e26e
> [13148.804815] RDX: 0000559aa762a8ae RSI: 0000559aa939d340 RDI: 0000559aa93a22b0
> [13148.807719] RBP: 00007ffdcee7d5b0 R08: 0000559aa93a2290 R09: 00007f230a0b4820
> [13148.810659] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcee7d420
> [13148.813609] R13: 0000000000000000 R14: 0000559aa939f000 R15: 0000000000000000
> [13148.816564] </TASK>
>
> To fix it, we can just fix __ocfs2_find_empty_slot. But original commit
> introduced the feature to mount ocfs2 locally even it is cluster based,
> that is a very dangerous, it can easily cause serious data corruption,
> there is no way to stop other nodes mounting the fs and corrupting it.
> Setup ha or other cluster-aware stack is just the cost that we have to
> take for avoiding corruption, otherwise we have to do it in kernel.
>
> Link: https://lkml.kernel.org/r/20220603222801.42488-1-junxiao.bi@oracle.com
> Fixes: 912f655d78c5("ocfs2: mount shared volume without ha stack")
> Signed-off-by: Junxiao Bi <junxiao.bi at oracle.com>
> Acked-by: Joseph Qi <joseph.qi at linux.alibaba.com>
> Cc: Mark Fasheh <mark at fasheh.com>
> Cc: Joel Becker <jlbec at evilplan.org>
> Cc: Changwei Ge <gechangwei at live.cn>
> Cc: Gang He <ghe at suse.com>
> Cc: Jun Piao <piaojun at huawei.com>
> Cc: <heming.zhao at suse.com>
> Cc: <stable at vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
> ---
>
> fs/ocfs2/ocfs2.h | 4 ---
> fs/ocfs2/slot_map.c | 46 +++++++++++++++++-------------------------
> fs/ocfs2/super.c | 21 -------------------
> 3 files changed, 20 insertions(+), 51 deletions(-)
>
> --- a/fs/ocfs2/ocfs2.h~revert-ocfs2-mount-shared-volume-without-ha-stack
> +++ a/fs/ocfs2/ocfs2.h
> @@ -277,7 +277,6 @@ enum ocfs2_mount_options
> OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT = 1 << 15, /* Journal Async Commit */
> OCFS2_MOUNT_ERRORS_CONT = 1 << 16, /* Return EIO to the calling process on error */
> OCFS2_MOUNT_ERRORS_ROFS = 1 << 17, /* Change filesystem to read-only on error */
> - OCFS2_MOUNT_NOCLUSTER = 1 << 18, /* No cluster aware filesystem mount */
> };
>
> #define OCFS2_OSB_SOFT_RO 0x0001
> @@ -673,8 +672,7 @@ static inline int ocfs2_cluster_o2cb_glo
>
> static inline int ocfs2_mount_local(struct ocfs2_super *osb)
> {
> - return ((osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT)
> - || (osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER));
> + return (osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT);
> }
>
> static inline int ocfs2_uses_extended_slot_map(struct ocfs2_super *osb)
> --- a/fs/ocfs2/slot_map.c~revert-ocfs2-mount-shared-volume-without-ha-stack
> +++ a/fs/ocfs2/slot_map.c
> @@ -252,16 +252,14 @@ static int __ocfs2_find_empty_slot(struc
> int i, ret = -ENOSPC;
>
> if ((preferred >= 0) && (preferred < si->si_num_slots)) {
> - if (!si->si_slots[preferred].sl_valid ||
> - !si->si_slots[preferred].sl_node_num) {
> + if (!si->si_slots[preferred].sl_valid) {
> ret = preferred;
> goto out;
> }
> }
>
> for(i = 0; i < si->si_num_slots; i++) {
> - if (!si->si_slots[i].sl_valid ||
> - !si->si_slots[i].sl_node_num) {
> + if (!si->si_slots[i].sl_valid) {
> ret = i;
> break;
> }
> @@ -456,30 +454,24 @@ int ocfs2_find_slot(struct ocfs2_super *
> spin_lock(&osb->osb_lock);
> ocfs2_update_slot_info(si);
>
> - if (ocfs2_mount_local(osb))
> - /* use slot 0 directly in local mode */
> - slot = 0;
> - else {
> - /* search for ourselves first and take the slot if it already
> - * exists. Perhaps we need to mark this in a variable for our
> - * own journal recovery? Possibly not, though we certainly
> - * need to warn to the user */
> - slot = __ocfs2_node_num_to_slot(si, osb->node_num);
> + /* search for ourselves first and take the slot if it already
> + * exists. Perhaps we need to mark this in a variable for our
> + * own journal recovery? Possibly not, though we certainly
> + * need to warn to the user */
> + slot = __ocfs2_node_num_to_slot(si, osb->node_num);
> + if (slot < 0) {
> + /* if no slot yet, then just take 1st available
> + * one. */
> + slot = __ocfs2_find_empty_slot(si, osb->preferred_slot);
> if (slot < 0) {
> - /* if no slot yet, then just take 1st available
> - * one. */
> - slot = __ocfs2_find_empty_slot(si, osb->preferred_slot);
> - if (slot < 0) {
> - spin_unlock(&osb->osb_lock);
> - mlog(ML_ERROR, "no free slots available!\n");
> - status = -EINVAL;
> - goto bail;
> - }
> - } else
> - printk(KERN_INFO "ocfs2: Slot %d on device (%s) was "
> - "already allocated to this node!\n",
> - slot, osb->dev_str);
> - }
> + spin_unlock(&osb->osb_lock);
> + mlog(ML_ERROR, "no free slots available!\n");
> + status = -EINVAL;
> + goto bail;
> + }
> + } else
> + printk(KERN_INFO "ocfs2: Slot %d on device (%s) was already "
> + "allocated to this node!\n", slot, osb->dev_str);
>
> ocfs2_set_slot(si, slot, osb->node_num);
> osb->slot_num = slot;
> --- a/fs/ocfs2/super.c~revert-ocfs2-mount-shared-volume-without-ha-stack
> +++ a/fs/ocfs2/super.c
> @@ -172,7 +172,6 @@ enum {
> Opt_dir_resv_level,
> Opt_journal_async_commit,
> Opt_err_cont,
> - Opt_nocluster,
> Opt_err,
> };
>
> @@ -206,7 +205,6 @@ static const match_table_t tokens = {
> {Opt_dir_resv_level, "dir_resv_level=%u"},
> {Opt_journal_async_commit, "journal_async_commit"},
> {Opt_err_cont, "errors=continue"},
> - {Opt_nocluster, "nocluster"},
> {Opt_err, NULL}
> };
>
> @@ -618,13 +616,6 @@ static int ocfs2_remount(struct super_bl
> goto out;
> }
>
> - tmp = OCFS2_MOUNT_NOCLUSTER;
> - if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) {
> - ret = -EINVAL;
> - mlog(ML_ERROR, "Cannot change nocluster option on remount\n");
> - goto out;
> - }
> -
> tmp = OCFS2_MOUNT_HB_LOCAL | OCFS2_MOUNT_HB_GLOBAL |
> OCFS2_MOUNT_HB_NONE;
> if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) {
> @@ -865,7 +856,6 @@ static int ocfs2_verify_userspace_stack(
> }
>
> if (ocfs2_userspace_stack(osb) &&
> - !(osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) &&
> strncmp(osb->osb_cluster_stack, mopt->cluster_stack,
> OCFS2_STACK_LABEL_LEN)) {
> mlog(ML_ERROR,
> @@ -1137,11 +1127,6 @@ static int ocfs2_fill_super(struct super
> osb->s_mount_opt & OCFS2_MOUNT_DATA_WRITEBACK ? "writeback" :
> "ordered");
>
> - if ((osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) &&
> - !(osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT))
> - printk(KERN_NOTICE "ocfs2: The shared device (%s) is mounted "
> - "without cluster aware mode.\n", osb->dev_str);
> -
> atomic_set(&osb->vol_state, VOLUME_MOUNTED);
> wake_up(&osb->osb_mount_event);
>
> @@ -1452,9 +1437,6 @@ static int ocfs2_parse_options(struct su
> case Opt_journal_async_commit:
> mopt->mount_opt |= OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT;
> break;
> - case Opt_nocluster:
> - mopt->mount_opt |= OCFS2_MOUNT_NOCLUSTER;
> - break;
> default:
> mlog(ML_ERROR,
> "Unrecognized mount option \"%s\" "
> @@ -1566,9 +1548,6 @@ static int ocfs2_show_options(struct seq
> if (opts & OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT)
> seq_printf(s, ",journal_async_commit");
>
> - if (opts & OCFS2_MOUNT_NOCLUSTER)
> - seq_printf(s, ",nocluster");
> -
> return 0;
> }
>
> _
>
> Patches currently in -mm which might be from ocfs2-devel at oss.oracle.com are
>
> revert-ocfs2-mount-shared-volume-without-ha-stack.patch
>
More information about the Ocfs2-devel
mailing list