[Ocfs2-users] Avoid node fence and fail gracefully

Srinivas Eeda srinivas.eeda at oracle.com
Fri May 31 10:09:02 PDT 2013


The reason nodes are fenced during network failures is because we need 
to guarantee that no i/o's are going to happen from this fenced node. If 
you just change the fs to read-only we still cannot guarantee that there 
are no inflight-io's from this node from previous writes.


On 05/31/2013 08:33 AM, Vineeth Thampi wrote:
> Hi,
>
> I have been working around the issue of Node fence in case of a 
> heartbeat failure / Network timeout. I modified o2quo_fence_self() in 
> quorum.c to make all ocfs2 filesystems RO, when tested it worked like 
> a charm, and the filesystems were made RO, but I am not able to umount 
> the filesystem or stop O2CB service.
>
> Is there any way by which I could ask O2CB to abort heartbeat and 
> treat the filesystem as LOCAL instead of GLOBAL?
>
> The following is the code change that I made.
>
> **************************************************
> static void make_fs_RO(struct super_block *sb, void *arg)
> {
>     struct ocfs2_super *osb = OCFS2_SB(sb);
>
>     sb->s_flags |= MS_RDONLY;
>     ocfs2_set_osb_flag(osb, OCFS2_OSB_ERROR_FS);
>     ocfs2_set_ro_flag(osb, *(int *)arg);
> }
>
> /* this is horribly heavy-handed.  It should instead flip the file
>  * system RO and call some userspace script. */
> static void o2quo_fence_self(void)
> {
>
> *...*
>
>         case O2NM_FENCE_RESET:
>                 printk(KERN_ERR "*** Hard failure in O2CB, all ocfs2 "
>                        "filesystems made RO ***\n");
>
>                 /* Iterate through all ocfs2 super blocks and make 
> each of
>                    them RO */
>                 fs_type = get_fs_type("ocfs2");
>                 if (fs_type)
>                         iterate_supers_type(fs_type, make_fs_RO, 
> &hard_reset);
>
>                 break;
> *...*
>
> }
> ***************************************************************
>
>
> The error from kern.log:
>
> =======================================
> May 31 16:08:18 localhost kernel: [ 5434.076126] 
> (kworker/u:2,577,3):dlm_send_remote_convert_request:395 ERROR: Error 
> -107 when sending message 504 (key 0xcfe4a084) to node 0
> May 31 16:08:18 localhost kernel: [ 5434.076178] o2dlm: Waiting on the 
> death of node 0 in domain A4E98618A3744717A65AF04E943D035A
> =======================================
>
> Any pointers would be much appreciated.
>
> Thanks,
>
> Vineeth
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20130531/490424af/attachment.html 


More information about the Ocfs2-users mailing list