[Ocfs2-devel] [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally

Joseph Qi joseph.qi at linux.alibaba.com
Sun Jun 12 02:57:18 UTC 2022



On 5/21/22 6:14 PM, Heming Zhao wrote:
> When la state is ENABLE, ocfs2_recalc_la_window restores la window
> unconditionally. The logic is wrong.
> 
> Let's image below path.
> 
> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED.
> 
> 2. About 30s (OCFS2_LA_ENABLE_INTERVAL), delayed work is triggered,
>    ocfs2_la_enable_worker set la state to ENABLED directly.
> 
> 3. a write IOs thread run:
> 
>    ```
>    ocfs2_write_begin
>     ...
>      ocfs2_lock_allocators
>       ocfs2_reserve_clusters
>        ocfs2_reserve_clusters_with_limit
>         ocfs2_reserve_local_alloc_bits
>          ocfs2_local_alloc_slide_window // [1]
>           + ocfs2_recalc_la_window(osb, OCFS2_LA_EVENT_SLIDE) // [2]
>           + ...
>           + ocfs2_local_alloc_new_window
>              ocfs2_claim_clusters // [3]
>    ```
> 
> [1]: will be called when la window bits used up.
> [2]: under la state is ENABLED (eg OCFS2_LA_ENABLE_INTERVAL delayed work
>      happened), it unconditionally restores la window to default value.
> [3]: will use default la window size to search clusters. IMO the timing
>      is O(n^4). The timing O(n^4) will cost huge time to scan global
>      bitmap. It makes write IOs (eg user space 'dd') become dramatically
>      slow.
> 
> i.e.
> an ocfs2 partition size: 1.45TB, cluster size: 4KB,
> la window default size: 106MB.
> The partition is fragmentation by creating & deleting huge mount of
> small file.
> 
> the timing should be (the number got from real world):
> - la window size change order (size: MB):
>   106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
>   only 0.8MB succeed, 0.8MB also triggers la window to disable.
>   ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
>   runs in worst case.
> - group chain number: 242
>   ocfs2_claim_suballoc_bits calls for-loop 242 times
> - each chain has 49 block group
>   ocfs2_search_chain calls while-loop 49 times
> - each bg has 32256 blocks
>   ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
>   for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
>   (32256/64) for timing calucation.
> 
> So the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)
> 
> In the worst case, user space writes 100MB data will trigger 42M scanning
> times, and if the write can't finish within 30s (OCFS2_LA_ENABLE_INTERVAL),
> the write IO will suffer another 42M scanning times. It makes the ocfs2
> partition keep pool performance all the time.
> 
> The fix method:
> 
> 1.  la restores double la size once.
> 
> current code logic decrease la window with half size once, but directly
> restores default_bits one time. It bounces the la window between '<1M'
> and default_bits. This patch makes restoring process more smoothly.
> eg.
> la default window is 106MB, current la window is 13MB.
> when there is a free action to release one block group space, la should
> roll back la size to 26MB (by 13*2).
> if there are many free actions to release many block group space, la
> will smoothly roll back to default window (106MB).
> 
> 2. introduced a new state: OCFS2_LA_RESTORE.
> 
> Current code uses OCFS2_LA_ENABLED to mark a new big space available.
> the state overwrite OCFS2_LA_THROTTLED, it makes la window forget
> it's already in throttled status.
> '->local_alloc_state' should keep OCFS2_LA_THROTTLED until la window
> restore to default_bits.

Since now we have enough free space, why not restore to default la
window?

This is an issue happened in a corner case, which blames current restore
window is too large. I agree with your method that restoring double la
size once instead of default directly. So why not just change the logic
of ocfs2_recalc_la_window() to do this?

Thanks,
Joseph

> 
> Signed-off-by: Heming Zhao <heming.zhao at suse.com>
> ---
>  fs/ocfs2/localalloc.c | 30 +++++++++++++++++++++---------
>  fs/ocfs2/ocfs2.h      | 18 +++++++++++-------
>  fs/ocfs2/suballoc.c   |  2 +-
>  3 files changed, 33 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
> index c4426d12a2ad..28acea717d7f 100644
> --- a/fs/ocfs2/localalloc.c
> +++ b/fs/ocfs2/localalloc.c
> @@ -205,20 +205,21 @@ void ocfs2_la_set_sizes(struct ocfs2_super *osb, int requested_mb)
>  
>  static inline int ocfs2_la_state_enabled(struct ocfs2_super *osb)
>  {
> -	return (osb->local_alloc_state == 	 ||
> -		osb->local_alloc_state == OCFS2_LA_ENABLED);
> +	return osb->local_alloc_state & OCFS2_LA_ACTIVE;
>  }
>  
>  void ocfs2_local_alloc_seen_free_bits(struct ocfs2_super *osb,
>  				      unsigned int num_clusters)
>  {
>  	spin_lock(&osb->osb_lock);
> -	if (osb->local_alloc_state == OCFS2_LA_DISABLED ||
> -	    osb->local_alloc_state == OCFS2_LA_THROTTLED)
> +	if (osb->local_alloc_state & (OCFS2_LA_DISABLED |
> +		OCFS2_LA_THROTTLED | OCFS2_LA_RESTORE)) {
>  		if (num_clusters >= osb->local_alloc_default_bits) {
>  			cancel_delayed_work(&osb->la_enable_wq);
> -			osb->local_alloc_state = OCFS2_LA_ENABLED;
> +			osb->local_alloc_state &= ~OCFS2_LA_DISABLED;
> +			osb->local_alloc_state |= OCFS2_LA_RESTORE;
>  		}
> +	}
>  	spin_unlock(&osb->osb_lock);
>  }
>  
> @@ -228,7 +229,10 @@ void ocfs2_la_enable_worker(struct work_struct *work)
>  		container_of(work, struct ocfs2_super,
>  			     la_enable_wq.work);
>  	spin_lock(&osb->osb_lock);
> -	osb->local_alloc_state = OCFS2_LA_ENABLED;
> +	if (osb->local_alloc_state & OCFS2_LA_DISABLED) {
> +		osb->local_alloc_state &= ~OCFS2_LA_DISABLED;
> +		osb->local_alloc_state |= OCFS2_LA_ENABLED;
> +	}
>  	spin_unlock(&osb->osb_lock);
>  }
>  
> @@ -1067,7 +1071,7 @@ static int ocfs2_recalc_la_window(struct ocfs2_super *osb,
>  			 * reason to assume the bitmap situation might
>  			 * have changed.
>  			 */
> -			osb->local_alloc_state = OCFS2_LA_THROTTLED;
> +			osb->local_alloc_state |= OCFS2_LA_THROTTLED;
>  			osb->local_alloc_bits = bits;
>  		} else {
>  			osb->local_alloc_state = OCFS2_LA_DISABLED;
> @@ -1083,8 +1087,16 @@ static int ocfs2_recalc_la_window(struct ocfs2_super *osb,
>  	 * risk bouncing around the global bitmap during periods of
>  	 * low space.
>  	 */
> -	if (osb->local_alloc_state != OCFS2_LA_THROTTLED)
> -		osb->local_alloc_bits = osb->local_alloc_default_bits;
> +	if (osb->local_alloc_state & OCFS2_LA_RESTORE) {
> +		bits = osb->local_alloc_bits * 2;
> +		if (bits > osb->local_alloc_default_bits) {
> +			osb->local_alloc_bits = osb->local_alloc_default_bits;
> +			osb->local_alloc_state = OCFS2_LA_ENABLED;
> +		} else {
> +			/* keep RESTORE state & set new bits */
> +			osb->local_alloc_bits = bits;
> +		}
> +	}
>  
>  out_unlock:
>  	state = osb->local_alloc_state;
> diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
> index 337527571461..1764077e3229 100644
> --- a/fs/ocfs2/ocfs2.h
> +++ b/fs/ocfs2/ocfs2.h
> @@ -245,14 +245,18 @@ struct ocfs2_alloc_stats
>  
>  enum ocfs2_local_alloc_state
>  {
> -	OCFS2_LA_UNUSED = 0,	/* Local alloc will never be used for
> -				 * this mountpoint. */
> -	OCFS2_LA_ENABLED,	/* Local alloc is in use. */
> -	OCFS2_LA_THROTTLED,	/* Local alloc is in use, but number
> -				 * of bits has been reduced. */
> -	OCFS2_LA_DISABLED	/* Local alloc has temporarily been
> -				 * disabled. */
> +	/* Local alloc will never be used for this mountpoint. */
> +	OCFS2_LA_UNUSED = 1 << 0,
> +	/* Local alloc is in use. */
> +	OCFS2_LA_ENABLED = 1 << 1,
> +	/* Local alloc is in use, but number of bits has been reduced. */
> +	OCFS2_LA_THROTTLED = 1 << 2,
> +	/* In throttle state, Local alloc meets contig big space. */
> +	OCFS2_LA_RESTORE = 1 << 3,
> +	/* Local alloc has temporarily been disabled. */
> +	OCFS2_LA_DISABLED = 1 << 4,
>  };
> +#define OCFS2_LA_ACTIVE (OCFS2_LA_ENABLED | OCFS2_LA_THROTTLED | OCFS2_LA_RESTORE)
>  
>  enum ocfs2_mount_options
>  {
> diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
> index 166c8918c825..b0df1ab2d6dd 100644
> --- a/fs/ocfs2/suballoc.c
> +++ b/fs/ocfs2/suballoc.c
> @@ -1530,7 +1530,7 @@ static int ocfs2_cluster_group_search(struct inode *inode,
>  		 * of bits. */
>  		if (min_bits <= res->sr_bits)
>  			search = 0; /* success */
> -		else if (res->sr_bits) {
> +		if (res->sr_bits) {
>  			/*
>  			 * Don't show bits which we'll be returning
>  			 * for allocation to the local alloc bitmap.



More information about the Ocfs2-devel mailing list