[Ocfs2-devel] [PATCH 2/2] ocfs2: add error handling path when jbd2 enter ABORT status

Tue May 9 15:22:39 UTC 2023

On Tue, May 09, 2023 at 09:53:06AM +0800, Joseph Qi wrote:
> 
> 
> On 5/9/23 12:40 AM, Heming Zhao wrote:
> > Sorry for reply late, I am a little bit busy recently.
> > 
> > On Fri, May 05, 2023 at 11:42:51AM +0800, Joseph Qi wrote:
> >>
> >>
> >> On 5/5/23 12:20 AM, Heming Zhao wrote:
> >>> On Thu, May 04, 2023 at 05:41:29PM +0800, Joseph Qi wrote:
> >>>>
> >>>>
> >>>> On 5/4/23 4:02 PM, Heming Zhao wrote:
> >>>>> On Thu, May 04, 2023 at 03:34:49PM +0800, Joseph Qi wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 5/4/23 2:21 PM, Heming Zhao wrote:
> >>>>>>> On Thu, May 04, 2023 at 10:27:46AM +0800, Joseph Qi wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 4/30/23 11:13 AM, Heming Zhao wrote:
> >>>>>>>>> fstest generic cases 347 361 628 629 trigger a same issue:
> >>>>>>>>> When jbd2 enter ABORT status, ocfs2 ignores it and keep going to commit
> >>>>>>>>> journal.
> >>>>>>>>
> >>>>>>>> What's the end user impact?
> >>>>>>>
> >>>>>>> There are two impacts:
> >>>>>>>
> >>>>>>> 1. this issue makes unmount command hanging.
> >>>>>>>
> >>>>>>> No one likes or accepts filesystem unmount failure when underlying devices meets
> >>>>>>> error. For comparison, other FSs (e.g. ext4, xfs, gfs2, ..) could do unmount
> >>>>>>> successfully.
> >>>>>>>
> >>>>>>
> >>>>>> So umount hang should be in the patch description, right?
> >>>>>
> >>>>> OK.
> >>>>>
> >>>>>>
> >>>>>>> 2. for developers to verify their patch doesn't make any regression.
> >>>>>>>
> >>>>>>> fstest is a famious & important fs testsuite.
> >>>>>>> (Yes, I know ocfs2 has itself testsuite ocfs2_test.)
> >>>>>>>
> >>>>>>> Current status, there are many test cases (about 9 in my env) cause fstest
> >>>>>>> hanging and blocking fstest to run. I did test for gfs2 on latest tumbleweed,
> >>>>>>> gfs2 only has 1 hanging case.
> >>>>>>>
> >>>>>>> In my view, ocfs2 developers or maintainers at least make ocfs2 to finish the
> >>>>>>> testsuite. Long-term target is to make ocfs2 to pass all the testsuite.
> >>>>>>>
> >>>>>>> On kernel 6.2.9-1, the fstest 'quick' test group result:
> >>>>>>> (the result based on my two patches [1/2] & [2/2])
> >>>>>>> ```
> >>>>>>> ./check -g quick -T -b -s ocfs2 -e generic/081 -e generic/648
> >>>>>>>
> >>>>>>> Failures: generic/003 generic/013 generic/015 generic/040 generic/041
> >>>>>>> generic/062 generic/082 generic/104 generic/107 generic/228 generic/244
> >>>>>>> generic/266 generic/272 generic/277 generic/281 generic/322 generic/329
> >>>>>>> generic/331 generic/336 generic/343 generic/376 generic/379 generic/380
> >>>>>>> generic/383 generic/384 generic/385 generic/386 generic/400 generic/410
> >>>>>>> generic/424 generic/441 generic/448 generic/449 generic/471 generic/479
> >>>>>>> generic/493 generic/495 generic/510 generic/537 generic/552 generic/563
> >>>>>>> generic/578 generic/594 generic/607 generic/620 generic/628 generic/630
> >>>>>>> generic/636 generic/702
> >>>>>>> Failed 49 of 568 tests
> >>>>>>> ```
> >>>>>>>
> >>>>>>>>
> >>>>>>>> JBD2_ABORT indicates a fatal error happens, either in jounral layer or
> >>>>>>>> filesystem. And we should not commit any further transactions.
> >>>>>>>> It seems that we may unify the behavior like:
> >>>>>>>>
> >>>>>>>> if (is_journal_aborted(journal))
> >>>>>>>> 	return -EROFS;
> >>>>>>>>
> >>>>>>>>
> >>>>>>> IIUC, do you mean add above code in ocfs2_commit_cache or in
> >>>>>>> ocfs2_commit_thread?
> >>>>>>
> >>>>>> Yes, break the loop in ocfs2_commit_thread() in case of journal abort.
> >>>>>> Actually we've handled this case before, but just limit the print. But
> >>>>>> it seems not enough now.
> >>>>>
> >>>>> I wrote in my previous mail. Follow your idea, The code should be:
> >>>>>
> >>>>> if (is_journal_aborted(journal)) {
> >>>>> 	ocfs2_error(osb->sb, "jbd2 status: ABORT.\n"); //this line is important.
> >>>>> 	return -EROFS;
> >>>>> }
> >>>>>
> >>>>> Only return -EROFS then break loop in ocfs2_commit_thread() is not enough.
> >>>>> Without ocfs2_error(), ocfs2 could accept new IOs from userspace, then the new
> >>>>> IOs will trigger IO error then trigger RO status. This flow is wrong, we should
> >>>>> mark RO as early as possible when JBD ABORT happens. In my view, the best place
> >>>>> is my [2/2] patch code which in ocfs2_commit_cache().
> >>>>>
> >>>>
> >>>> Agree, but ocfs2_abort() is more appropriate here, see
> >>>> ocfs2_start_trans(). But ocfs2_abort() may panic system, I'm afraid it
> >>>> has to change to read-only accordingly.
> >>>
> >>> You are right. I ever used ocfs2_abort() then got panic. :)
> >>> So I changed to ocfs2_error(). I also sent two mails in 2023-4-30 (the first
> >>> mail has mess format by thunderbird). I copy the mail content here:
> >>> ```
> >>> More info for this patch, maybe I should write below info in commit log.
> >>>
> >>> From the comment of __ocfs2_abort(), there is another way to handle
> >>> journal ABORT: panic.
> >>> And fstest generic test NO. 648 will trigger ocfs2_abort then make
> >>> kernel panic. I don't like the panic style, ocfs2 should elegantly
> >>> handle journal abnormal case.
> >>> ```
> >>>
> >>> fstest NO. 648 could trigger ocfs2_abort from ocfs2_start_trans(). So you could
> >>> find my previous mail, I ran the fstest with parameter '-e generic/648', which
> >>> means fstest excludes NO. 648.
> >>>
> >>> The only different between ocfs2_abort and ocfs2_error is set
> >>> OCFS2_MOUNT_ERRORS_PANIC, which could trigger panic. If you think panic is
> >>> brute, I suggest to use ocfs2_error.
> >>>
> >>
> >> Looked back the history, it is designed to panic in cluster mode:
> >> a2f2ddbf2baf ocfs2: __ocfs2_abort() should not enable panic for local mounts
> > 
> > About the panic action from ocfs2_abort(), in my view, it's too scared to
> > acceptable. I am not familar with o2cb stack, in pcmk stack, there is
> > resource-agents RA Filesystem[1], which includes two level to detect fs abormal:
> > - Filesystem_monitor_10 : read the device
> > - Filesystem_monitor_20 : write and read a status file
> > 
> > if ocfs2 enters abort status, Filesystem RA will detect it and trigger HA stack
> > to do the handover resources job.
> > if Filesystem RA doesn't apply monitor_[10|20] level, a readonly fs (not panic
> > then generate a coredump) could give administrator more info to locate the
> > rootcause.
> > 
> > I write above base on gfs2 fstest result: gfs2 doesn't panic for the ocfs2 panic
> > cases.
> > 
> > [1]:
> > https://github.com/ClusterLabs/resource-agents/blob/main/heartbeat/Filesystem
> > 
> 
> Agree, so a more elegant way is to make the journal abort node behaves
> like node down, e.g. set hard readonly so that heartbeat also can not
> update normally.
> 
> In xfstests, it uses device mapper to simulate disk down, so it will
> also trigger node down recover since heartbeat won't update normally?

I am not familiar with hearbeat. Base on current code, in pcmk stack,
your description is partly right. (see my answer in below. node may be hanging
forever)
> 
> Also, I'm not sure pcmk supports fence other node by only rejecting IOs.
> The built-in ocfs2 fence policy only supports fence self through panic
> or reset.
> 

Yes, pcmk supports fence another node.

SUSE HA stack pcmk (crmsh + corosync + pacemaker + stonith) is powerful and
enterprise-class software solution, which usually deploys on SAP environment.
For rejecting IOs or jbd2 aborted, Filesystem RA (see above [1]) could use 
monitor_[10|20] (OCF_CHECK_LEVEL) to enable detecting IOs error. If customer
doesn't config OCF_CHECK_LEVEL, and the storage stack: disk => dlm+ocfs2, these
will be hanging forever. 

But, for a typical user case for SUSE customer:
disks => mpath => lvm2 (dlm/lvmlockd) => ocfs2
pcmk stack could monitor layer: dlm lvm2 lvmlockd ocfs2. anyone enters abnormal
status, there will trigger fence action.

- Heming

> >>>>
> >>>>>> BTW, the basic rule here is, we don't want to change journal to prevent
> >>>>>> other nodes corrupting the filesystem.
> >>>>>
> >>>>> If my memory is correct, every node has itself special journal partition.
> >>>>> If the HA stack assigns the same (before fenced) nodeid to JBD ABORTed machine, 
> >>>>> No other node could corrupting the fs.
> >>>>>
> >>>> I don't think it's a good idea to modify journal even flush fails. Why
> >>>> not catch EROFS in commit thread and break? Now we can only expect
> >>>> umount continues with error, right?
> >>>>
> >>>
> >>> With your idea, I spent more than two hours for coding & testing. The result is
> >>> my patch [2/2] is correct. Please believe my patch [2/2].
> >>>
> >>
> >> I am not arguing if your patch can pass the testcases above. I am
> >> talking about whether we do it this way is right.
> >>
> >> Now journal flush actually fails since journal is abort, which means
> >> journal data are not reflected in filesystem. But after your change, it
> >> behaves like we've done a normal checkpoint.
> >>
> >> Take the following scenario into consideration:
> >> N0 does some changes on file1, but now disk down and journal is abort,
> >> and checkpoint behaves successfully even it does nothing.
> >> N1 get the access right on file1 and does other changes, and checkpoint
> >> normally.
> >> N0 is crash, and N1 begins to recover N0, then data corruption happens.
> >>
> >> I've fixed a similar issue in the past:
> >> 6f6a6fda2945 jbd2: fix ocfs2 corrupt when updating journal superblock fails
> >>
> > 
> > I agree with above description. Thank you for pointing out my mistake, I forgot
> > the dlm layer callback for node-down case.
> > 
> > - Heming