[Ocfs2-devel] [PATCH 1/1] ocfs2: fix ocfs2_find_slot repeats alloc same slot issue

Sun Jun 12 14:16:02 UTC 2022

Hi,

Why can't use local mount? I don't remember if we discuss about this.

Thanks,
Joseph

On 6/8/22 6:48 PM, Heming Zhao wrote:
> Below commit log copied from Junxiao's patch:
> https://oss.oracle.com/pipermail/ocfs2-devel/2022-June/000107.html
> 
> Junxiao planed to revert commit 912f655d78c5("ocfs2: mount shared volume
> without ha stack"), but maintainer & I preferred to keep and fix this bug.
> 
> -------------------------- snip  --------------------------
> This commit introduced a regression that can cause mount hung.
> The changes in __ocfs2_find_empty_slot causes that any node with
> none-zero node number can grab the slot that was already taken by
> node 0, so node 1 will access the same journal with node 0, when it
> try to grab journal cluster lock, it will hung because it was already
> acquired by node 0.
> It's very easy to reproduce this, in one cluster, mount node 0 first,
> then node 1, you will see the following call trace from node 1.
> 
> [13148.735424] INFO: task mount.ocfs2:53045 blocked for more than 122 seconds.
> [13148.739691]       Not tainted 5.15.0-2148.0.4.el8uek.mountracev2.x86_64 #2
> [13148.742560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [13148.745846] task:mount.ocfs2     state:D stack:    0 pid:53045 ppid: 53044 flags:0x00004000
> [13148.749354] Call Trace:
> [13148.750718]  <TASK>
> [13148.752019]  ? usleep_range+0x90/0x89
> [13148.753882]  __schedule+0x210/0x567
> [13148.755684]  schedule+0x44/0xa8
> [13148.757270]  schedule_timeout+0x106/0x13c
> [13148.759273]  ? __prepare_to_swait+0x53/0x78
> [13148.761218]  __wait_for_common+0xae/0x163
> [13148.763144]  __ocfs2_cluster_lock.constprop.0+0x1d6/0x870 [ocfs2]
> [13148.765780]  ? ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
> [13148.768312]  ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
> [13148.770968]  ocfs2_journal_init+0x91/0x340 [ocfs2]
> [13148.773202]  ocfs2_check_volume+0x39/0x461 [ocfs2]
> [13148.775401]  ? iput+0x69/0xba
> [13148.777047]  ocfs2_mount_volume.isra.0.cold+0x40/0x1f5 [ocfs2]
> [13148.779646]  ocfs2_fill_super+0x54b/0x853 [ocfs2]
> [13148.781756]  mount_bdev+0x190/0x1b7
> [13148.783443]  ? ocfs2_remount+0x440/0x440 [ocfs2]
> [13148.785634]  legacy_get_tree+0x27/0x48
> [13148.787466]  vfs_get_tree+0x25/0xd0
> [13148.789270]  do_new_mount+0x18c/0x2d9
> [13148.791046]  __x64_sys_mount+0x10e/0x142
> [13148.792911]  do_syscall_64+0x3b/0x89
> [13148.794667]  entry_SYSCALL_64_after_hwframe+0x170/0x0
> [13148.797051] RIP: 0033:0x7f2309f6e26e
> [13148.798784] RSP: 002b:00007ffdcee7d408 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
> [13148.801974] RAX: ffffffffffffffda RBX: 00007ffdcee7d4a0 RCX: 00007f2309f6e26e
> [13148.804815] RDX: 0000559aa762a8ae RSI: 0000559aa939d340 RDI: 0000559aa93a22b0
> [13148.807719] RBP: 00007ffdcee7d5b0 R08: 0000559aa93a2290 R09: 00007f230a0b4820
> [13148.810659] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcee7d420
> [13148.813609] R13: 0000000000000000 R14: 0000559aa939f000 R15: 0000000000000000
> [13148.816564]  </TASK>
> 
> To fix it, we can just fix __ocfs2_find_empty_slot. But original commit
> introduced the feature to mount ocfs2 locally even it is cluster based,
> that is a very dangerous, it can easily cause serious data corruption,
> there is no way to stop other nodes mounting the fs and corrupting it.
> Setup ha or other cluster-aware stack is just the cost that we have to
> take for avoiding corruption, otherwise we have to do it in kernel.
> -------------------------- snip  --------------------------
> 
> ** analysis **
> 
> Under Junxiao's call trace, in __ocfs2_find_empty_slot(), the 'if'
> accessment is wrong. sl_node_num could be 0 at o2cb env.
> 
> with current information, the trigger flow (may):
> 1>
> node1 with 'node_num = 0' for mounting. it will succeed.
> at this time, slotmap extent block will contains es_valid:1 &
> es_node_num:0 for node1
> then ocfs2_update_disk_slot() will write back slotmap info to disk.
> 
> 2>
> then, node2 with 'node_num = 1' for mounting
> 
> ocfs2_find_slot
>  + ocfs2_update_slot_info //read slotmap info from disk
>  |  + set si->si_slots[0].es_valid = 1 & si->si_slots[0].sl_node_num = 0
>  |
>  + __ocfs2_node_num_to_slot //will return -ENOENT.
>     __ocfs2_find_empty_slot
>      + search preferred (node_num:1) failed
>      + 'si->si_slots[0].sl_node_num' is false. trigger 'break' condition.
>      + return slot 0 will cause node2 grab node1 journal dlm lock, then trigger hung.
> 
> ** how to fix **
> 
> For simplifing code logic design, We make a rule:
> If last mount didn't do umount, (eg: crash happened), the next mount MUST
> be same mount type.
> 
> All possible cases, when enter ocfs2_find_slot():
> 
> 1. all slots are empty, [cluster|nocluster] mode mount will succeed.
>    this is clean ocfs2 volume case.
> 
> 2. for nocluster mount action
>    - slot 0 is empty, but another slot is not empty:
>      - mount failure. (there should be in clustered env, deny mixed mount case)
>    - slot 0 is not empty, all other slots are empty:
>      - slot 0 is nocluster type: mount success  (may crash last time)
>      - slot 0 is cluster type: mount failure (deny mixed mount case)
> 
> 3. for cluster mount action
>    - slot 0 is empty, but another slot is not empty:
>      - mount success
>    - slot 0 is not empty, all other slots are empty:
>      - slot 0 is nocluster type: mount failure (deny mixed mount case)
>      - slot 0 is cluster type: mount success (may crash last time)
> 
> above with simplified form:
> 1.
> clean parition => nocluster/cluster at any_node    - success
> 
> 2.
> cluster at any_node => nocluster at any_node          - failure
> nocluster at node1 => crash => nocluster at node1     - success
> nocluster at node2 => nocluster at node1              - success [*]
> cluster at any_node => crash => nocluster at any_node - failure
> 
> 3.
> cluster at any_node => cluster at any_node            - success
> cluster at any_node => crash => cluster at any_node   - success
> nocluster at any_node => crash => cluster at any_node - failure
> 
> [*]: this is the only risk to corrupt data. we allow this case happen:
> - node2 may crash or borken, and fails to bootup anymore.
> - node2 fails to access ocfs2 volume, needs to manage from another node.
> - mount.ocfs2 will give special warning for this case.
> 
> Fixes: 912f655d78c5("ocfs2: mount shared volume without ha stack")
> Reported-by: Junxiao Bi <junxiao.bi at oracle.com>
> Signed-off-by: Heming Zhao <heming.zhao at suse.com>
> ---
>  fs/ocfs2/dlmglue.c  |  3 ++
>  fs/ocfs2/ocfs2_fs.h |  3 ++
>  fs/ocfs2/slot_map.c | 70 ++++++++++++++++++++++++++++++++++++---------
>  3 files changed, 62 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 801e60bab955..6b017ae46145 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -3403,6 +3403,9 @@ void ocfs2_dlm_shutdown(struct ocfs2_super *osb,
>  	ocfs2_lock_res_free(&osb->osb_nfs_sync_lockres);
>  	ocfs2_lock_res_free(&osb->osb_orphan_scan.os_lockres);
>  
> +	if (ocfs2_mount_local(osb))
> +		return;
> +
>  	ocfs2_cluster_disconnect(osb->cconn, hangup_pending);
>  	osb->cconn = NULL;
>  
> diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
> index 638d875eccc7..4fe42e638309 100644
> --- a/fs/ocfs2/ocfs2_fs.h
> +++ b/fs/ocfs2/ocfs2_fs.h
> @@ -534,6 +534,9 @@ struct ocfs2_slot_map {
>   */
>  };
>  
> +#define OCFS2_SLOTMAP_CLUSTER   1
> +#define OCFS2_SLOTMAP_NOCLUSTER 2
> +
>  struct ocfs2_extended_slot {
>  /*00*/	__u8	es_valid;
>  	__u8	es_reserved1[3];
> diff --git a/fs/ocfs2/slot_map.c b/fs/ocfs2/slot_map.c
> index 0b0ae3ebb0cf..a5c06d3ecb27 100644
> --- a/fs/ocfs2/slot_map.c
> +++ b/fs/ocfs2/slot_map.c
> @@ -26,7 +26,7 @@
>  
>  
>  struct ocfs2_slot {
> -	int sl_valid;
> +	u8 sl_valid;
>  	unsigned int sl_node_num;
>  };
>  
> @@ -52,11 +52,11 @@ static void ocfs2_invalidate_slot(struct ocfs2_slot_info *si,
>  }
>  
>  static void ocfs2_set_slot(struct ocfs2_slot_info *si,
> -			   int slot_num, unsigned int node_num)
> +			   int slot_num, unsigned int node_num, u8 valid)
>  {
>  	BUG_ON((slot_num < 0) || (slot_num >= si->si_num_slots));
>  
> -	si->si_slots[slot_num].sl_valid = 1;
> +	si->si_slots[slot_num].sl_valid = valid;
>  	si->si_slots[slot_num].sl_node_num = node_num;
>  }
>  
> @@ -75,7 +75,8 @@ static void ocfs2_update_slot_info_extended(struct ocfs2_slot_info *si)
>  		     i++, slotno++) {
>  			if (se->se_slots[i].es_valid)
>  				ocfs2_set_slot(si, slotno,
> -					       le32_to_cpu(se->se_slots[i].es_node_num));
> +					       le32_to_cpu(se->se_slots[i].es_node_num),
> +						   le32_to_cpu(se->se_slots[i].es_valid));
>  			else
>  				ocfs2_invalidate_slot(si, slotno);
>  		}
> @@ -97,7 +98,7 @@ static void ocfs2_update_slot_info_old(struct ocfs2_slot_info *si)
>  		if (le16_to_cpu(sm->sm_slots[i]) == (u16)OCFS2_INVALID_SLOT)
>  			ocfs2_invalidate_slot(si, i);
>  		else
> -			ocfs2_set_slot(si, i, le16_to_cpu(sm->sm_slots[i]));
> +			ocfs2_set_slot(si, i, le16_to_cpu(sm->sm_slots[i]), OCFS2_SLOTMAP_CLUSTER);
>  	}
>  }
>  
> @@ -252,16 +253,14 @@ static int __ocfs2_find_empty_slot(struct ocfs2_slot_info *si,
>  	int i, ret = -ENOSPC;
>  
>  	if ((preferred >= 0) && (preferred < si->si_num_slots)) {
> -		if (!si->si_slots[preferred].sl_valid ||
> -		    !si->si_slots[preferred].sl_node_num) {
> +		if (!si->si_slots[preferred].sl_valid) {
>  			ret = preferred;
>  			goto out;
>  		}
>  	}
>  
>  	for(i = 0; i < si->si_num_slots; i++) {
> -		if (!si->si_slots[i].sl_valid ||
> -		    !si->si_slots[i].sl_node_num) {
> +		if (!si->si_slots[i].sl_valid) {
>  			ret = i;
>  			break;
>  		}
> @@ -270,6 +269,20 @@ static int __ocfs2_find_empty_slot(struct ocfs2_slot_info *si,
>  	return ret;
>  }
>  
> +static int __ocfs2_find_used_slot(struct ocfs2_slot_info *si)
> +{
> +	int i, ret = -ENOENT;
> +
> +	for (i = 0; i < si->si_num_slots; i++) {
> +		if (si->si_slots[i].sl_valid) {
> +			ret = i;
> +			break;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
>  int ocfs2_node_num_to_slot(struct ocfs2_super *osb, unsigned int node_num)
>  {
>  	int slot;
> @@ -449,17 +462,45 @@ int ocfs2_find_slot(struct ocfs2_super *osb)
>  {
>  	int status;
>  	int slot;
> +	int nocluster_mnt = 0;
>  	struct ocfs2_slot_info *si;
>  
>  	si = osb->slot_info;
>  
>  	spin_lock(&osb->osb_lock);
>  	ocfs2_update_slot_info(si);
> +	slot = __ocfs2_find_used_slot(si);
> +	if (slot == 0 && (si->si_slots[0].sl_valid == OCFS2_SLOTMAP_NOCLUSTER))
> +		nocluster_mnt = 1;
>  
> -	if (ocfs2_mount_local(osb))
> -		/* use slot 0 directly in local mode */
> -		slot = 0;
> -	else {
> +	/*
> +	 * We set a rule:
> +	 * if last mount didn't do unmount, (eg: crash happened), the next mount
> +	 * MUST be same mount type.
> +	 */
> +	if (ocfs2_mount_local(osb)) {
> +		/* empty slotmap, or partition didn't unmount last time */
> +		if ((slot == -ENOENT) || nocluster_mnt) {
> +			/* use slot 0 directly in local mode */
> +			slot = 0;
> +			nocluster_mnt = 1;
> +		} else {
> +			spin_unlock(&osb->osb_lock);
> +			mlog(ML_ERROR, "found clustered mount slot in noclustered env!\n");
> +			mlog(ML_ERROR, "please clean slotmap info for mount.\n");
> +			mlog(ML_ERROR, "eg. remount then unmount with clustered mode\n");
> +			status = -EINVAL;
> +			goto bail;
> +		}
> +	} else {
> +		if (nocluster_mnt) {
> +			spin_unlock(&osb->osb_lock);
> +			mlog(ML_ERROR, "found noclustered mount slot in clustered env!\n");
> +			mlog(ML_ERROR, "please clean slotmap info for mount.\n");
> +			mlog(ML_ERROR, "eg. remount then unmount with noclustered mode\n");
> +			status = -EINVAL;
> +			goto bail;
> +		}
>  		/* search for ourselves first and take the slot if it already
>  		 * exists. Perhaps we need to mark this in a variable for our
>  		 * own journal recovery? Possibly not, though we certainly
> @@ -481,7 +522,8 @@ int ocfs2_find_slot(struct ocfs2_super *osb)
>  			       slot, osb->dev_str);
>  	}
>  
> -	ocfs2_set_slot(si, slot, osb->node_num);
> +	ocfs2_set_slot(si, slot, osb->node_num, nocluster_mnt ?
> +					OCFS2_SLOTMAP_NOCLUSTER : OCFS2_SLOTMAP_CLUSTER);
>  	osb->slot_num = slot;
>  	spin_unlock(&osb->osb_lock);
>