[Ocfs2-users] Avoid node fence and fail gracefully

Vineeth Thampi vineeth.thampi at gmail.com
Sun Jun 2 01:55:18 PDT 2013


Hi Shencanquan / Srini,

Thanks for the comments.

If I am ready to compromise the kernel io that are pending, is there a way
to do it. what I need is to stop heartbeat when the heartbeat region is not
reachable?

In my case the host has got other types of filesystems as well that users
use, and I cannot give an explanation to those users for the host reboot.

Thanks,

Vineth


On Sun, Jun 2, 2013 at 3:19 AM, shencanquan <shencanquan at huawei.com> wrote:

>  On 2013/6/1 1:09, Srinivas Eeda wrote:
>
> The reason nodes are fenced during network failures is because we need to
> guarantee that no i/o's are going to happen from this fenced node. If you
> just change the fs to read-only we still cannot guarantee that there are no
> inflight-io's from this node from previous writes.
>
>  I agree it.
> set the ocfs2 to read-only, it just prevent io from user space
> application.  on the kernel cache for example page cache or currently write
> maybe write to io the SAN.
>
> the best way is use the SCSI-3 Persistent Group Reservation to fence the
> node.
>
>
>
> On 05/31/2013 08:33 AM, Vineeth Thampi wrote:
>
>  Hi,
>
>  I have been working around the issue of Node fence in case of a
> heartbeat failure / Network timeout. I modified o2quo_fence_self() in
> quorum.c to make all ocfs2 filesystems RO, when tested it worked like a
> charm, and the filesystems were made RO, but I am not able to umount the
> filesystem or stop O2CB service.
>
>  Is there any way by which I could ask O2CB to abort heartbeat and treat
> the filesystem as LOCAL instead of GLOBAL?
>
>  The following is the code change that I made.
>
> **************************************************
> static void make_fs_RO(struct super_block *sb, void *arg)
> {
>     struct ocfs2_super *osb = OCFS2_SB(sb);
>
>     sb->s_flags |= MS_RDONLY;
>     ocfs2_set_osb_flag(osb, OCFS2_OSB_ERROR_FS);
>     ocfs2_set_ro_flag(osb, *(int *)arg);
> }
>
> /* this is horribly heavy-handed.  It should instead flip the file
>  * system RO and call some userspace script. */
> static void o2quo_fence_self(void)
> {
>
> *...*
>
>         case O2NM_FENCE_RESET:
>                 printk(KERN_ERR "*** Hard failure in O2CB, all ocfs2 "
>                        "filesystems made RO ***\n");
>
>                 /* Iterate through all ocfs2 super blocks and make each of
>                    them RO */
>                 fs_type = get_fs_type("ocfs2");
>                 if (fs_type)
>                         iterate_supers_type(fs_type, make_fs_RO,
> &hard_reset);
>
>                 break;
> *...*
>
> }
> ***************************************************************
>
>
>  The error from kern.log:
>
> =======================================
> May 31 16:08:18 localhost kernel: [ 5434.076126]
> (kworker/u:2,577,3):dlm_send_remote_convert_request:395 ERROR: Error -107
> when sending message 504 (key 0xcfe4a084) to node 0
> May 31 16:08:18 localhost kernel: [ 5434.076178] o2dlm: Waiting on the
> death of node 0 in domain A4E98618A3744717A65AF04E943D035A
> =======================================
>
>  Any pointers would be much appreciated.
>
>  Thanks,
>
> Vineeth
>
>
> _______________________________________________
> Ocfs2-users mailing listOcfs2-users at oss.oracle.comhttps://oss.oracle.com/mailman/listinfo/ocfs2-users
>
>
>
>
> _______________________________________________
> Ocfs2-users mailing listOcfs2-users at oss.oracle.comhttps://oss.oracle.com/mailman/listinfo/ocfs2-users
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20130602/07e807f3/attachment.html 


More information about the Ocfs2-users mailing list