[Ocfs2-devel] [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally

Sun Jun 12 12:38:45 UTC 2022

On 6/12/22 3:45 PM, heming.zhao at suse.com wrote:
> On 6/12/22 10:57, Joseph Qi wrote:
>>
>>
>> On 5/21/22 6:14 PM, Heming Zhao wrote:
>>> When la state is ENABLE, ocfs2_recalc_la_window restores la window
>>> unconditionally. The logic is wrong.
>>>
>>> Let's image below path.
>>>
>>> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED.
>>>
>>> 2. About 30s (OCFS2_LA_ENABLE_INTERVAL), delayed work is triggered,
>>>     ocfs2_la_enable_worker set la state to ENABLED directly.
>>>
>>> 3. a write IOs thread run:
>>>
>>>     ```
>>>     ocfs2_write_begin
>>>      ...
>>>       ocfs2_lock_allocators
>>>        ocfs2_reserve_clusters
>>>         ocfs2_reserve_clusters_with_limit
>>>          ocfs2_reserve_local_alloc_bits
>>>           ocfs2_local_alloc_slide_window // [1]
>>>            + ocfs2_recalc_la_window(osb, OCFS2_LA_EVENT_SLIDE) // [2]
>>>            + ...
>>>            + ocfs2_local_alloc_new_window
>>>               ocfs2_claim_clusters // [3]
>>>     ```
>>>
>>> [1]: will be called when la window bits used up.
>>> [2]: under la state is ENABLED (eg OCFS2_LA_ENABLE_INTERVAL delayed work
>>>       happened), it unconditionally restores la window to default value.
>>> [3]: will use default la window size to search clusters. IMO the timing
>>>       is O(n^4). The timing O(n^4) will cost huge time to scan global
>>>       bitmap. It makes write IOs (eg user space 'dd') become dramatically
>>>       slow.
>>>
>>> i.e.
>>> an ocfs2 partition size: 1.45TB, cluster size: 4KB,
>>> la window default size: 106MB.
>>> The partition is fragmentation by creating & deleting huge mount of
>>> small file.
>>>
>>> the timing should be (the number got from real world):
>>> - la window size change order (size: MB):
>>>    106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
>>>    only 0.8MB succeed, 0.8MB also triggers la window to disable.
>>>    ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
>>>    runs in worst case.
>>> - group chain number: 242
>>>    ocfs2_claim_suballoc_bits calls for-loop 242 times
>>> - each chain has 49 block group
>>>    ocfs2_search_chain calls while-loop 49 times
>>> - each bg has 32256 blocks
>>>    ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
>>>    for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
>>>    (32256/64) for timing calucation.
>>>
>>> So the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)
>>>
>>> In the worst case, user space writes 100MB data will trigger 42M scanning
>>> times, and if the write can't finish within 30s (OCFS2_LA_ENABLE_INTERVAL),
>>> the write IO will suffer another 42M scanning times. It makes the ocfs2
>>> partition keep pool performance all the time.
>>>
>>> The fix method:
>>>
>>> 1.  la restores double la size once.
>>>
>>> current code logic decrease la window with half size once, but directly
>>> restores default_bits one time. It bounces the la window between '<1M'
>>> and default_bits. This patch makes restoring process more smoothly.
>>> eg.
>>> la default window is 106MB, current la window is 13MB.
>>> when there is a free action to release one block group space, la should
>>> roll back la size to 26MB (by 13*2).
>>> if there are many free actions to release many block group space, la
>>> will smoothly roll back to default window (106MB).
>>>
>>> 2. introduced a new state: OCFS2_LA_RESTORE.
>>>
>>> Current code uses OCFS2_LA_ENABLED to mark a new big space available.
>>> the state overwrite OCFS2_LA_THROTTLED, it makes la window forget
>>> it's already in throttled status.
>>> '->local_alloc_state' should keep OCFS2_LA_THROTTLED until la window
>>> restore to default_bits.
>>
>> Since now we have enough free space, why not restore to default la
>> window?
> 
> The key is: the decrease speed is not same with increase speed.
> The decrease la window happens on the current la window size is not available any more.
> But current restore action happens on there appears any one of default_bits space.
> e.g:
> the default la window is 100MB, current system only has some 20MB contiguous space.
> la window change to half size (10MB) when there is no 20MB space any more.
> but current code restore logic will restore to 100MB. and allocation path will
> suffer the O(n^4) timing.
> 
> From my understanding, most of the ocfs2 users use ocfs2 to manage big & not huge
> number of files. But my patch response scenario: ocfs2 volume contains huge number
> of small files. user case is creating/deleting/moving the small files all the time.
> It makes the fs fragmentation totally.
> 

Yes, typically scenario is vm images with cluster size 1M.
This is a corny talk and I'm afraid it still cannot resolve above case
with optimized local alloc window.

>>
>> This is an issue happened in a corner case, which blames current restore
>> window is too large. I agree with your method that restoring double la
>> size once instead of default directly. So why not just change the logic
>> of ocfs2_recalc_la_window() to do this?
> 
> This path is related with two issues:
> - restore speed more quick than decrease speed.
> - unconditionally restore la window
> 
> only change ocfs2_recalc_la_window() can't avoid unconditionally restore la window.

Seems the following will restore double each time?

if (osb->local_alloc_state != OCFS2_LA_THROTTLED)
	osb->local_alloc_bits <<= 1;

This may introduce another issue which blames restore too slow. So it's a
balance.

Thanks,
Joseph

> so I introduced new state OCFS2_LA_RESTORE, which will help ocfs2 to keep throttled
> state.
> 
> btw, during I investigating this bug, I found other two issues/works need to do:
> - current allocation algorithm is very easy to generate fragment.
> - there is O(n^4) timing. even after with this patch, the timing only becomes O(n^3):
>   <group chain number> * <block group per chain> * <bg bitmap size>
> we needs to improve the allocation algorithm to make ocfs2 more power.
> 
> Thanks,
> Heming