[Ocfs2-devel] + revert-ocfs2-mount-shared-volume-without-ha-stack.patch added to mm-hotfixes-unstable branch

Junxiao Bi junxiao.bi at oracle.com
Sun Jun 26 23:18:44 UTC 2022


Hi Andrew,

Thanks for merging it to mm.
I see the “From” is “ Junxiao Bi via Ocfs2-devel <ocfs2-devel at oss.oracle.com>”, can you help fix that?

Thanks,
Junxiao


> 在 2022年6月26日,下午1:21,Andrew Morton <akpm at linux-foundation.org> 写道:
> 
> 
> The patch titled
>     Subject: Revert "ocfs2: mount shared volume without ha stack"
> has been added to the -mm mm-hotfixes-unstable branch.  Its filename is
>     revert-ocfs2-mount-shared-volume-without-ha-stack.patch
> 
> This patch will shortly appear at
>     https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/revert-ocfs2-mount-shared-volume-without-ha-stack.patch
> 
> This patch will later appear in the mm-hotfixes-unstable branch at
>    git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> 
> Before you just go and hit "reply", please:
>   a) Consider who else should be cc'ed
>   b) Prefer to cc a suitable mailing list as well
>   c) Ideally: find the original patch on the mailing list and do a
>      reply-to-all to that, adding suitable additional cc's
> 
> *** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
> 
> The -mm tree is included into linux-next via the mm-everything
> branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> and is updated there every 2-3 working days
> 
> ------------------------------------------------------
> From: Junxiao Bi via Ocfs2-devel <ocfs2-devel at oss.oracle.com>
> Subject: Revert "ocfs2: mount shared volume without ha stack"
> Date: Fri, 3 Jun 2022 15:28:01 -0700
> 
> This reverts commit 912f655d78c5d4ad05eac287f23a435924df7144.
> 
> This commit introduced a regression that can cause mount hung.  The
> changes in __ocfs2_find_empty_slot causes that any node with none-zero
> node number can grab the slot that was already taken by node 0, so node 1
> will access the same journal with node 0, when it try to grab journal
> cluster lock, it will hung because it was already acquired by node 0. 
> It's very easy to reproduce this, in one cluster, mount node 0 first, then
> node 1, you will see the following call trace from node 1.
> 
> [13148.735424] INFO: task mount.ocfs2:53045 blocked for more than 122 seconds.
> [13148.739691]       Not tainted 5.15.0-2148.0.4.el8uek.mountracev2.x86_64 #2
> [13148.742560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [13148.745846] task:mount.ocfs2     state:D stack:    0 pid:53045 ppid: 53044 flags:0x00004000
> [13148.749354] Call Trace:
> [13148.750718]  <TASK>
> [13148.752019]  ? usleep_range+0x90/0x89
> [13148.753882]  __schedule+0x210/0x567
> [13148.755684]  schedule+0x44/0xa8
> [13148.757270]  schedule_timeout+0x106/0x13c
> [13148.759273]  ? __prepare_to_swait+0x53/0x78
> [13148.761218]  __wait_for_common+0xae/0x163
> [13148.763144]  __ocfs2_cluster_lock.constprop.0+0x1d6/0x870 [ocfs2]
> [13148.765780]  ? ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
> [13148.768312]  ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
> [13148.770968]  ocfs2_journal_init+0x91/0x340 [ocfs2]
> [13148.773202]  ocfs2_check_volume+0x39/0x461 [ocfs2]
> [13148.775401]  ? iput+0x69/0xba
> [13148.777047]  ocfs2_mount_volume.isra.0.cold+0x40/0x1f5 [ocfs2]
> [13148.779646]  ocfs2_fill_super+0x54b/0x853 [ocfs2]
> [13148.781756]  mount_bdev+0x190/0x1b7
> [13148.783443]  ? ocfs2_remount+0x440/0x440 [ocfs2]
> [13148.785634]  legacy_get_tree+0x27/0x48
> [13148.787466]  vfs_get_tree+0x25/0xd0
> [13148.789270]  do_new_mount+0x18c/0x2d9
> [13148.791046]  __x64_sys_mount+0x10e/0x142
> [13148.792911]  do_syscall_64+0x3b/0x89
> [13148.794667]  entry_SYSCALL_64_after_hwframe+0x170/0x0
> [13148.797051] RIP: 0033:0x7f2309f6e26e
> [13148.798784] RSP: 002b:00007ffdcee7d408 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
> [13148.801974] RAX: ffffffffffffffda RBX: 00007ffdcee7d4a0 RCX: 00007f2309f6e26e
> [13148.804815] RDX: 0000559aa762a8ae RSI: 0000559aa939d340 RDI: 0000559aa93a22b0
> [13148.807719] RBP: 00007ffdcee7d5b0 R08: 0000559aa93a2290 R09: 00007f230a0b4820
> [13148.810659] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcee7d420
> [13148.813609] R13: 0000000000000000 R14: 0000559aa939f000 R15: 0000000000000000
> [13148.816564]  </TASK>
> 
> To fix it, we can just fix __ocfs2_find_empty_slot.  But original commit
> introduced the feature to mount ocfs2 locally even it is cluster based,
> that is a very dangerous, it can easily cause serious data corruption,
> there is no way to stop other nodes mounting the fs and corrupting it. 
> Setup ha or other cluster-aware stack is just the cost that we have to
> take for avoiding corruption, otherwise we have to do it in kernel.
> 
> Link: https://lkml.kernel.org/r/20220603222801.42488-1-junxiao.bi@oracle.com
> Fixes: 912f655d78c5("ocfs2: mount shared volume without ha stack")
> Signed-off-by: Junxiao Bi <junxiao.bi at oracle.com>
> Acked-by: Joseph Qi <joseph.qi at linux.alibaba.com>
> Cc: Mark Fasheh <mark at fasheh.com>
> Cc: Joel Becker <jlbec at evilplan.org>
> Cc: Changwei Ge <gechangwei at live.cn>
> Cc: Gang He <ghe at suse.com>
> Cc: Jun Piao <piaojun at huawei.com>
> Cc: <heming.zhao at suse.com>
> Cc: <stable at vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
> ---
> 
> fs/ocfs2/ocfs2.h    |    4 ---
> fs/ocfs2/slot_map.c |   46 +++++++++++++++++-------------------------
> fs/ocfs2/super.c    |   21 -------------------
> 3 files changed, 20 insertions(+), 51 deletions(-)
> 
> --- a/fs/ocfs2/ocfs2.h~revert-ocfs2-mount-shared-volume-without-ha-stack
> +++ a/fs/ocfs2/ocfs2.h
> @@ -277,7 +277,6 @@ enum ocfs2_mount_options
>    OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT = 1 << 15,  /* Journal Async Commit */
>    OCFS2_MOUNT_ERRORS_CONT = 1 << 16, /* Return EIO to the calling process on error */
>    OCFS2_MOUNT_ERRORS_ROFS = 1 << 17, /* Change filesystem to read-only on error */
> -    OCFS2_MOUNT_NOCLUSTER = 1 << 18, /* No cluster aware filesystem mount */
> };
> 
> #define OCFS2_OSB_SOFT_RO    0x0001
> @@ -673,8 +672,7 @@ static inline int ocfs2_cluster_o2cb_glo
> 
> static inline int ocfs2_mount_local(struct ocfs2_super *osb)
> {
> -    return ((osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT)
> -        || (osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER));
> +    return (osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT);
> }
> 
> static inline int ocfs2_uses_extended_slot_map(struct ocfs2_super *osb)
> --- a/fs/ocfs2/slot_map.c~revert-ocfs2-mount-shared-volume-without-ha-stack
> +++ a/fs/ocfs2/slot_map.c
> @@ -252,16 +252,14 @@ static int __ocfs2_find_empty_slot(struc
>    int i, ret = -ENOSPC;
> 
>    if ((preferred >= 0) && (preferred < si->si_num_slots)) {
> -        if (!si->si_slots[preferred].sl_valid ||
> -            !si->si_slots[preferred].sl_node_num) {
> +        if (!si->si_slots[preferred].sl_valid) {
>            ret = preferred;
>            goto out;
>        }
>    }
> 
>    for(i = 0; i < si->si_num_slots; i++) {
> -        if (!si->si_slots[i].sl_valid ||
> -            !si->si_slots[i].sl_node_num) {
> +        if (!si->si_slots[i].sl_valid) {
>            ret = i;
>            break;
>        }
> @@ -456,30 +454,24 @@ int ocfs2_find_slot(struct ocfs2_super *
>    spin_lock(&osb->osb_lock);
>    ocfs2_update_slot_info(si);
> 
> -    if (ocfs2_mount_local(osb))
> -        /* use slot 0 directly in local mode */
> -        slot = 0;
> -    else {
> -        /* search for ourselves first and take the slot if it already
> -         * exists. Perhaps we need to mark this in a variable for our
> -         * own journal recovery? Possibly not, though we certainly
> -         * need to warn to the user */
> -        slot = __ocfs2_node_num_to_slot(si, osb->node_num);
> +    /* search for ourselves first and take the slot if it already
> +     * exists. Perhaps we need to mark this in a variable for our
> +     * own journal recovery? Possibly not, though we certainly
> +     * need to warn to the user */
> +    slot = __ocfs2_node_num_to_slot(si, osb->node_num);
> +    if (slot < 0) {
> +        /* if no slot yet, then just take 1st available
> +         * one. */
> +        slot = __ocfs2_find_empty_slot(si, osb->preferred_slot);
>        if (slot < 0) {
> -            /* if no slot yet, then just take 1st available
> -             * one. */
> -            slot = __ocfs2_find_empty_slot(si, osb->preferred_slot);
> -            if (slot < 0) {
> -                spin_unlock(&osb->osb_lock);
> -                mlog(ML_ERROR, "no free slots available!\n");
> -                status = -EINVAL;
> -                goto bail;
> -            }
> -        } else
> -            printk(KERN_INFO "ocfs2: Slot %d on device (%s) was "
> -                   "already allocated to this node!\n",
> -                   slot, osb->dev_str);
> -    }
> +            spin_unlock(&osb->osb_lock);
> +            mlog(ML_ERROR, "no free slots available!\n");
> +            status = -EINVAL;
> +            goto bail;
> +        }
> +    } else
> +        printk(KERN_INFO "ocfs2: Slot %d on device (%s) was already "
> +               "allocated to this node!\n", slot, osb->dev_str);
> 
>    ocfs2_set_slot(si, slot, osb->node_num);
>    osb->slot_num = slot;
> --- a/fs/ocfs2/super.c~revert-ocfs2-mount-shared-volume-without-ha-stack
> +++ a/fs/ocfs2/super.c
> @@ -172,7 +172,6 @@ enum {
>    Opt_dir_resv_level,
>    Opt_journal_async_commit,
>    Opt_err_cont,
> -    Opt_nocluster,
>    Opt_err,
> };
> 
> @@ -206,7 +205,6 @@ static const match_table_t tokens = {
>    {Opt_dir_resv_level, "dir_resv_level=%u"},
>    {Opt_journal_async_commit, "journal_async_commit"},
>    {Opt_err_cont, "errors=continue"},
> -    {Opt_nocluster, "nocluster"},
>    {Opt_err, NULL}
> };
> 
> @@ -618,13 +616,6 @@ static int ocfs2_remount(struct super_bl
>        goto out;
>    }
> 
> -    tmp = OCFS2_MOUNT_NOCLUSTER;
> -    if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) {
> -        ret = -EINVAL;
> -        mlog(ML_ERROR, "Cannot change nocluster option on remount\n");
> -        goto out;
> -    }
> -
>    tmp = OCFS2_MOUNT_HB_LOCAL | OCFS2_MOUNT_HB_GLOBAL |
>        OCFS2_MOUNT_HB_NONE;
>    if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) {
> @@ -865,7 +856,6 @@ static int ocfs2_verify_userspace_stack(
>    }
> 
>    if (ocfs2_userspace_stack(osb) &&
> -        !(osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) &&
>        strncmp(osb->osb_cluster_stack, mopt->cluster_stack,
>            OCFS2_STACK_LABEL_LEN)) {
>        mlog(ML_ERROR,
> @@ -1137,11 +1127,6 @@ static int ocfs2_fill_super(struct super
>           osb->s_mount_opt & OCFS2_MOUNT_DATA_WRITEBACK ? "writeback" :
>           "ordered");
> 
> -    if ((osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) &&
> -       !(osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT))
> -        printk(KERN_NOTICE "ocfs2: The shared device (%s) is mounted "
> -               "without cluster aware mode.\n", osb->dev_str);
> -
>    atomic_set(&osb->vol_state, VOLUME_MOUNTED);
>    wake_up(&osb->osb_mount_event);
> 
> @@ -1452,9 +1437,6 @@ static int ocfs2_parse_options(struct su
>        case Opt_journal_async_commit:
>            mopt->mount_opt |= OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT;
>            break;
> -        case Opt_nocluster:
> -            mopt->mount_opt |= OCFS2_MOUNT_NOCLUSTER;
> -            break;
>        default:
>            mlog(ML_ERROR,
>                 "Unrecognized mount option \"%s\" "
> @@ -1566,9 +1548,6 @@ static int ocfs2_show_options(struct seq
>    if (opts & OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT)
>        seq_printf(s, ",journal_async_commit");
> 
> -    if (opts & OCFS2_MOUNT_NOCLUSTER)
> -        seq_printf(s, ",nocluster");
> -
>    return 0;
> }
> 
> _
> 
> Patches currently in -mm which might be from ocfs2-devel at oss.oracle.com are
> 
> revert-ocfs2-mount-shared-volume-without-ha-stack.patch
> 


More information about the Ocfs2-devel mailing list