[Ocfs2-devel] [PATCH] Revert "ocfs2: mount shared volume without ha stack"

Tue Jun 7 09:42:46 UTC 2022

On Tue, Jun 07, 2022 at 02:31:01PM +0800, Joseph Qi wrote:
> 
> 
> On 6/7/22 11:06 AM, Junxiao Bi wrote:
> >>>> Seems I am missing some mails for this thread.
> >>>> The 'nocluster' mount is introduced by Gang and I think it has real
> >>>> user scenarios. I am curious about since node 0 is commonly used in
> >>>> o2cb, why there is no any bug report before.
> >>>> So let's try to fix the regression first.
> >>> Real user case doesn’t mean this has to been done through kernel? This sounds like doing something in kernel that is to workaround some issue that can be done from user space.
> >>> I didn’t see a Reviewed-by for the patch, how did it get merged?
> >> Gang had left SUSE for some time, and busy with his new job.
> >> I have vague memory, he said this commit approved & merged directly by Andrew Morton.
> >> Gang dedicated to contribute ocfs2 community many years, and set up his competence
> >> to other maintainers & reviewers.
> >>
> >> If Junxiao dislike this feature, and don't want to fix it as a bug.
> >> I am willing to file a patch.
> > To fix, it’s not only the regression it causes, but also do something in mount.ocfs2 to check whether any node is mounting the volume before nocluster mount and also stop other nodes mounting before  nocluster mount is unmounted. That’s to simulate what fsck.ocfs2/mkfs.ocfs2 do. Without that guard, this feature just provides a new way for customer to corrupt their data. It’s just time some customer would do something bad and lost their data. 
> > I will leave how to handle this to you and Joseph, we already reverted this patch.
> 
> Searched the maillist and find the original thread for reference:
> https://lore.kernel.org/ocfs2-devel/CH2PR18MB32064CCD80FE98F03B82A816CFAC0@CH2PR18MB3206.namprd18.prod.outlook.com/
> 
> I suggest we leave nocluster mount as a special mode and make its logic
> won't impact other mode like cluster or local mount.
> Agree with Junxiao, we have to try our best to prevent data corruption
> even mistakenly used by customer.
> 

It's a bad news for oracle people to revert this feature.
As I said that the hung is not a big bug.

Under Junxiao's commit log information, commit 912f655d78c5d introduced bug code
in below area.

@@ -254,14 +254,16 @@ static int __ocfs2_find_empty_slot(struct
ocfs2_slot_info *si,
     int i, ret = -ENOSPC;
 
     if ((preferred >= 0) && (preferred < si->si_num_slots)) {
-        if (!si->si_slots[preferred].sl_valid) {
+        if (!si->si_slots[preferred].sl_valid ||
+            !si->si_slots[preferred].sl_node_num) {
             ret = preferred;
             goto out;
         }
     }
 
     for(i = 0; i < si->si_num_slots; i++) {
-        if (!si->si_slots[i].sl_valid) {
+        if (!si->si_slots[i].sl_valid ||
+            !si->si_slots[i].sl_node_num) {
             ret = i;
             break;
         }

the 'if' accessment is wrong. sl_node_num could be 0 at o2cb env.

with current information, the trigger flow (may):
1>
node1 with 'node_num = 0' for mounting. it will succeed.
at this time, slotmap extent block will contains es_valid:1 & es_node_num:0 for node1
then ocfs2_update_disk_slot() will write back slotmap info to disk.

2>
then, node2 with 'node_num = 1' for mounting

ocfs2_find_slot
 + ocfs2_update_slot_info //read slotmap info from disk
 |  + set si->si_slots[0].es_valid = 1 & si->si_slots[0].sl_node_num = 0
 |
 + __ocfs2_node_num_to_slot //will return -ENOENT.
    __ocfs2_find_empty_slot
     + search preferred (node_num:1) failed
     + 'si->si_slots[0].sl_node_num' is false. trigger 'break' condition.
     + return slot 0  //will cause node2 grab node1 journal dlm lock, trigger hung.


I copied the related comments from Joseph provided URL.
> https://lore.kernel.org/ocfs2-devel/CH2PR18MB32064CCD80FE98F03B82A816CFAC0@CH2PR18MB3206.namprd18.prod.outlook.com/

```
> > @@ -254,14 +254,16 @@ static int __ocfs2_find_empty_slot(struct
> ocfs2_slot_info *si,
> >  	int i, ret = -ENOSPC;
> >
> >  	if ((preferred >= 0) && (preferred < si->si_num_slots)) {
> > -		if (!si->si_slots[preferred].sl_valid) {
> > +		if (!si->si_slots[preferred].sl_valid ||
> > +		    !si->si_slots[preferred].sl_node_num) {
> 
> Why specially handle node num 0 here?
> It seems breaks original logic.
Since in local(or nocluster) mode, the code will not invoke any DLM/cluster related interfaces.
The node_num is set to 0 directly.
In the past, local mount/cluster mount will not happen on the same volume.
But, after nocluster option is introduced, local(nocluster) mount/cluster mount will possibly happen(not at the same time) on the same volume. 
If we mount the shared volume with the cluster mode after a local (nocluster mode) mount crash,
We have to use that slot(slot 0, which was used by the last local mount), otherwise, there is possibly not more slot available.
```

Gang comment flow: (based on patched code)
1. do noclustered mount, use slot 0
2. crash, reboot
3. do clustered mount.
4. in slot 0, sl_valid:1, ls_node_num:0
   ocfs2_find_slot
    + __ocfs2_node_num_to_slot()
    |  //fails for 'sl_valid == 1'
    |
    + __ocfs2_find_empty_slot()
       //should reuse the noclustered mount occupied slot 0
       //so 'sl_node_num == 0' should be treated as an empty slot for reuse.
       //if not reuse this slot, will don't have enough slot for mount (other nodes).

at last, I need some time to find a solution.

Thanks,
Heming