[Ocfs2-devel] discuss about jbd2 assertion in defragment path

Thu Feb 16 14:58:17 UTC 2023

On Wed, Feb 15, 2023 at 08:04:48PM +0800, Joseph Qi wrote:
> 
> 
> On 2/15/23 6:02 PM, Heming Zhao wrote:
> > On 2/15/23 2:42 PM, Joseph Qi wrote:
> >>
> >>
> >> On 2/15/23 2:20 PM, Heming Zhao wrote:
> >>> On 2/15/23 10:06 AM, Joseph Qi wrote:
> >>>>
> >>>>
> >>>> On 2/14/23 7:48 PM, Heming Zhao wrote:
> >>>>> On Tue, Feb 14, 2023 at 07:08:16PM +0800, Joseph Qi wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> On 2/14/23 12:33 PM, Heming Zhao wrote:
> >>>>>>> On Tue, Feb 14, 2023 at 10:52:30AM +0800, Joseph Qi wrote:
> >>>>>>>> Hi, Sorry about the late reply.
> >>>>>>>> This thread is indeed a long time ago:(
> >>>>>>>> It seems that I said the two ocfs2_journal_access_di() are for different
> >>>>>>>> buffer head. Anyway, I have to recall the discussion before and get back
> >>>>>>>> to you.
> >>>>>>>>
> >>>>>>> If you belive from [a] to [b.1] doesn't need any journal protection, my patch
> >>>>>>> makes sense from theory.
> >>>>>>>
> >>>>>>>   From code logic, if the defrag path (ocfs2_split_extent) doesn't call
> >>>>>>> jbd2_journal_restart(), current code is absolute correct. Increasing credits
> >>>>>>> can't help avoid jbd2 assert crash, but make more easily to trigger crash,
> >>>>>>> because it makes jbd2_journal_extend() more easily to return "status > 0". In my
> >>>>>>> view, the correct fix should delete the access/dirty pair and make sure the
> >>>>>>> ocfs2_split_extent() is journal self-service.
> >>>>>>>
> >>>>>>> Different buffer head issue could be handled by rechecking ocfs2_split_extent(),
> >>>>>>> make sure all journal handle branches are correct. I already checked
> >>>>>>> ocfs2_split_extent, but I can't assure my eyes didn't miss any corner.
> >>>>>>
> >>>>>> ocfs2_split_extent are not just 'read' operations, it may grow the tree,
> >>>>>> that's why we have to do a transaction here.
> >>>>>
> >>>>> You misunderstand my meaning, my mean: from the caller __ocfs2_move_extent()
> >>>>> calling ocfs2_journal_access_di() (below [a]) to (before) the handling
> >>>>> ctxt.c_contig_type (below [b.1]) in ocfs2_split_extent, this area is purely read
> >>>>> operations. It's the range [ [a], [b.1]) (include [a], not include [b.1])
> >>>>>
> >>>>> __ocfs2_move_extent
> >>>>>    + ocfs2_journal_access_di //[a]   <------+
> >>>>>    + ocfs2_split_extent      //[b]          | [a, b.1) area is read operations
> >>>>>    |  + if                   //[b.1] <------+
> >>>>>    |  |  A
> >>>>>    |  |  +---- from this line, this func starts r/w IOs & needs jbd2 operations
> >>>>>    |  |
> >>>>>    |  |   ocfs2_replace_extent_rec()/ocfs2_split_and_insert()
> >>>>>    |  + else
> >>>>>    |      ocfs2_try_to_merge_extent ()
> >>>>>    |
> >>>>>    + ocfs2_journal_dirty     //[c]
> >>>>>
> >>>>>>
> >>>>>> The whole code flow:
> >>>>>> ocfs2_ioctl_move_extents
> >>>>>>     ocfs2_move_extents
> >>>>>>       __ocfs2_move_extents_range
> >>>>>>         ocfs2_defrag_extent  [1]
> >>>>>>           ocfs2_start_trans  [a]
> >>>>>>           __ocfs2_move_extent
> >>>>>>             ocfs2_journal_access_di
> >>>>>>             ocfs2_split_extent
> >>>>>>             ocfs2_journal_dirty
> >>>>>>           ocfs2_commit_trans
> >>>>>>         ocfs2_move_extent  [2]
> >>>>>>           ocfs2_start_trans
> >>>>>>           __ocfs2_move_extent
> >>>>>>             ocfs2_journal_access_di
> >>>>>>             ocfs2_split_extent
> >>>>>>             ocfs2_journal_dirty
> >>>>>>           ocfs2_commit_trans
> >>>>>>       ocfs2_start_trans  [b]
> >>>>>>       ocfs2_journal_access_di
> >>>>>>       ocfs2_journal_dirty
> >>>>>>       ocfs2_commit_trans
> >>>>>>
> >>>>>> In above, [a] and [b] are different transaction start/commit pair, and
> >>>>>> each allocates different journal credits, even they are operate the same
> >>>>>> bh, they are still different operations.
> >>>>>
> >>>>> I agree above writing, but it can't change my conclusion: ocfs2_split_extent()
> >>>>> is journal self-service.
> >>>>
> >>>> Don't understand what does 'journal self-service' mean.
> >>>
> >>> Sorry for my English skill, I invent this word 'journal self-service'.
> >>> The meaning is: this function could handle journal operations totally by itself.
> >>> Caller doesn't need to call access/dirty pair. Caller only needs to call jbd2
> >>> journal start/stop pair.
> >>>
> >>> Because ocfs2_start_trans() & ocfs2_commit_trans are reenterable function,
> >>> we even could add these two func in ocfs2_split_extent(). Then any caller won't
> >>> take care of any journal operations when calling ocfs2_split_extent().
> >>>
> >>>>
> >>>>> It seems the author meaning: he wanted [a] to protect defragging actions (extents
> >>>>> changes), wanted [b] to protect inode ctime changes. the
> >>>>> ocfs2_start_trans/ocfs2_commit_trans in [a] area is necessary. but the
> >>>>> access/dirty pair in __ocfs2_move_extent() is potential crash risk.
> >>>>>
> >>>> If so, what is transaction [a] protect? (no dirty buffer)
> >>>
> >>> Transaction [a] protects the dirty buffers in ocfs2_split_extent().
> >>> ocfs2_split_extent() has already correctly used access/dirty pair to
> >>> mark/handle dirty buffers.
> >>>
> >> Okay, I think I've gotten your idea now. You change looks reasonable.
> >> Could you please also check if move extents without auto defrag also has
> >> the same issue?
> >>
> > 
> > I have finished code checking. ocfs2_move_extent() path has the same issue.
> > My patch also works fine from this code branch.
> > In theory, ocfs2_move_extent() has less opportunity to trigger crash, because
> > userspace defrag.ocfs2 very likely set less than file size on
> > 'struct ocfs2_move_extents : me_len', if defrag.ocfs2 would support this
> > code branch.
> > 
> 
> If we write a simple program to do move extents ioctl on the same ocfs2
> (without auto defrag), can we trigger this crash?

For this question, first, I modified defrag.ocfs2 to test non-auto defrag flow,
and found three bugs. I will describe it at the end of this mail.

Second, I failed to trigger the crash with non-auto defrag flow.

In theory, the non-auto & auto share the same defrag path from __ocfs2_move_extent().
Before I send mail to start this dicussion, I had tested some days, and failed
to trigger the crash. This crash, happened last year, one of SUSE customer
triggered it. But for confidence, I can't talk too much about it.

Under theory analysis, the crash derives from ocfs2_extend_trans() calling
jbd2_journal_restart() and getting return value "status > 0".

In jbd2_journal_extend(), there are only two cases (A & B) for return "value > 0".

```<A>
/* Don't extend a locked-down transaction! */
if (transaction->t_state != T_RUNNING) {
    jbd2_debug(3, "denied handle %p %d blocks: "
          "transaction not running\n", handle, nblocks);
    goto error_out;
}
```

```<B>
if (wanted > journal->j_max_transaction_buffers) {
    jbd2_debug(3, "denied handle %p %d blocks: "
          "transaction too large\n", handle, nblocks);
    atomic_sub(nblocks, &transaction->t_outstanding_credits);
    goto error_out;
}
```

For case <A>, in my view, if the transaction state is not T_RUNNING, which should
be committing or committed by jbd2. So I did:
- generated depth 3 B+tree
- modified defragfs.ocfs2 with/without OCFS2_MOVE_EXT_FL_PART_DEFRAG.
- mount options using commit_interval=[1,3]
- enable some mlog(0) to output log for slowing down ocfs2 layer defrag speed.


For case <B>, I did
- generated depth 3 B+tree
- modified defragfs.ocfs2 with/without OCFS2_MOVE_EXT_FL_PART_DEFRAG.
- mount options using: resv_level=[0,2]  commit_interval=[1,3,5,30]
- running defragfs, meanwhile, running dd to issue write IOs.

Both two cases with lots of times test failed to trigger crash. I only met
ocfs2 ininite loop issue: ocfs2 kept busy all the time, the defrag job is running
all the time.


At last, I describe the three bugs for non-auto defrag flow.
The fixing code as below:

```

diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
index 192cad0662d8..19693894aeec 100644
--- a/fs/ocfs2/move_extents.c
+++ b/fs/ocfs2/move_extents.c
@@ -444,7 +444,7 @@ static int ocfs2_find_victim_alloc_group(struct inode *inode,
 			bg = (struct ocfs2_group_desc *)gd_bh->b_data;

 			if (vict_blkno < (le64_to_cpu(bg->bg_blkno) +
-						le16_to_cpu(bg->bg_bits))) {
+						(le16_to_cpu(bg->bg_bits) << bits_per_unit))) {

 				*ret_bh = gd_bh;
 				*vict_bit = (vict_blkno - blkno) >>
@@ -559,6 +559,7 @@ static void ocfs2_probe_alloc_group(struct inode *inode, struct buffer_head *bh,
 			last_free_bits++;

 		if (last_free_bits == move_len) {
+			i -= move_len;
 			*goal_bit = i;
 			*phys_cpos = base_cpos + i;
 			break;
@@ -1030,18 +1031,19 @@ int ocfs2_ioctl_move_extents(struct file *filp, void __user *argp)

 	context->range = ⦥

+	/*
+	 * ok, the default theshold for the defragmentation
+	 * is 1M, since our maximum clustersize was 1M also.
+	 * any thought?
+	 */
+	if (!range.me_threshold)
+		range.me_threshold = 1024 * 1024;
+
+	if (range.me_threshold > i_size_read(inode))
+		range.me_threshold = i_size_read(inode);
+
 	if (range.me_flags & OCFS2_MOVE_EXT_FL_AUTO_DEFRAG) {
 		context->auto_defrag = 1;
-		/*
-		 * ok, the default theshold for the defragmentation
-		 * is 1M, since our maximum clustersize was 1M also.
-		 * any thought?
-		 */
-		if (!range.me_threshold)
-			range.me_threshold = 1024 * 1024;
-
-		if (range.me_threshold > i_size_read(inode))
-			range.me_threshold = i_size_read(inode);

 		if (range.me_flags & OCFS2_MOVE_EXT_FL_PART_DEFRAG)
 			context->partial = 1;
```

I will file a new patch for above fix.

Thanks,
Heming