[Ocfs2-users] Avoid node fence and fail gracefully
Srinivas Eeda
srinivas.eeda at oracle.com
Fri May 31 10:09:02 PDT 2013
The reason nodes are fenced during network failures is because we need
to guarantee that no i/o's are going to happen from this fenced node. If
you just change the fs to read-only we still cannot guarantee that there
are no inflight-io's from this node from previous writes.
On 05/31/2013 08:33 AM, Vineeth Thampi wrote:
> Hi,
>
> I have been working around the issue of Node fence in case of a
> heartbeat failure / Network timeout. I modified o2quo_fence_self() in
> quorum.c to make all ocfs2 filesystems RO, when tested it worked like
> a charm, and the filesystems were made RO, but I am not able to umount
> the filesystem or stop O2CB service.
>
> Is there any way by which I could ask O2CB to abort heartbeat and
> treat the filesystem as LOCAL instead of GLOBAL?
>
> The following is the code change that I made.
>
> **************************************************
> static void make_fs_RO(struct super_block *sb, void *arg)
> {
> struct ocfs2_super *osb = OCFS2_SB(sb);
>
> sb->s_flags |= MS_RDONLY;
> ocfs2_set_osb_flag(osb, OCFS2_OSB_ERROR_FS);
> ocfs2_set_ro_flag(osb, *(int *)arg);
> }
>
> /* this is horribly heavy-handed. It should instead flip the file
> * system RO and call some userspace script. */
> static void o2quo_fence_self(void)
> {
>
> *...*
>
> case O2NM_FENCE_RESET:
> printk(KERN_ERR "*** Hard failure in O2CB, all ocfs2 "
> "filesystems made RO ***\n");
>
> /* Iterate through all ocfs2 super blocks and make
> each of
> them RO */
> fs_type = get_fs_type("ocfs2");
> if (fs_type)
> iterate_supers_type(fs_type, make_fs_RO,
> &hard_reset);
>
> break;
> *...*
>
> }
> ***************************************************************
>
>
> The error from kern.log:
>
> =======================================
> May 31 16:08:18 localhost kernel: [ 5434.076126]
> (kworker/u:2,577,3):dlm_send_remote_convert_request:395 ERROR: Error
> -107 when sending message 504 (key 0xcfe4a084) to node 0
> May 31 16:08:18 localhost kernel: [ 5434.076178] o2dlm: Waiting on the
> death of node 0 in domain A4E98618A3744717A65AF04E943D035A
> =======================================
>
> Any pointers would be much appreciated.
>
> Thanks,
>
> Vineeth
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20130531/490424af/attachment.html
More information about the Ocfs2-users
mailing list