[Ocfs2-devel] [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally
Joseph Qi
joseph.qi at linux.alibaba.com
Thu Jun 2 10:02:19 UTC 2022
On 5/21/22 6:14 PM, Heming Zhao wrote:
> When la state is ENABLE, ocfs2_recalc_la_window restores la window
> unconditionally. The logic is wrong.
>
> Let's image below path.
>
> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED.
>
> 2. About 30s (OCFS2_LA_ENABLE_INTERVAL), delayed work is triggered,
> ocfs2_la_enable_worker set la state to ENABLED directly.
>
> 3. a write IOs thread run:
>
> ```
> ocfs2_write_begin
> ...
> ocfs2_lock_allocators
> ocfs2_reserve_clusters
> ocfs2_reserve_clusters_with_limit
> ocfs2_reserve_local_alloc_bits
> ocfs2_local_alloc_slide_window // [1]
> + ocfs2_recalc_la_window(osb, OCFS2_LA_EVENT_SLIDE) // [2]
> + ...
> + ocfs2_local_alloc_new_window
> ocfs2_claim_clusters // [3]
> ```
>
> [1]: will be called when la window bits used up.
> [2]: under la state is ENABLED (eg OCFS2_LA_ENABLE_INTERVAL delayed work
> happened), it unconditionally restores la window to default value.
> [3]: will use default la window size to search clusters. IMO the timing
> is O(n^4). The timing O(n^4) will cost huge time to scan global
> bitmap. It makes write IOs (eg user space 'dd') become dramatically
> slow.
>
> i.e.
> an ocfs2 partition size: 1.45TB, cluster size: 4KB,
> la window default size: 106MB.
> The partition is fragmentation by creating & deleting huge mount of
> small file.
>
> the timing should be (the number got from real world):
> - la window size change order (size: MB):
> 106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
> only 0.8MB succeed, 0.8MB also triggers la window to disable.
> ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
> runs in worst case.
> - group chain number: 242
> ocfs2_claim_suballoc_bits calls for-loop 242 times
> - each chain has 49 block group
> ocfs2_search_chain calls while-loop 49 times
> - each bg has 32256 blocks
> ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
> for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
> (32256/64) for timing calucation.
>
> So the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)
>
> In the worst case, user space writes 100MB data will trigger 42M scanning
> times, and if the write can't finish within 30s (OCFS2_LA_ENABLE_INTERVAL),
> the write IO will suffer another 42M scanning times. It makes the ocfs2
> partition keep pool performance all the time.
>
The scenario makes sense.
I have to spend more time to dig into the code and then get back to you.
Thanks,
Joseph
More information about the Ocfs2-devel
mailing list