[Ocfs2-users] any way to ignore quorum in a two node cluster with one node down?

Wed Jul 25 11:24:56 PDT 2007

Quick code scan suggests that it'll fail until the heartbeat sends
the fs a node down event. Only then will the node be moved from
the fs mounted nodemap to the fs recovery nodemap.

So, the upper bound would be the heartbeat timeout.

Andrew D. Ball wrote:
> Can you help estimate an upper bound on how long a failure like this
> could last?
>
> Thanks for your help.
> Andrew
>
> On Mon, 2007-07-23 at 13:37 -0700, Sunil Mushran wrote:
>   
>> Yes, the failure is temporary.
>>
>> Andrew D. Ball wrote:
>>     
>>> On Tue, 2007-07-17 at 16:17 -0700, Sunil Mushran wrote:
>>>   
>>>       
>>>> Ahh... this is a 1.2 "feature". :)
>>>>
>>>> In 1.2, the fs does messaging (votes) for mount/umount, rename,
>>>> unlink and delete. So the said operations can fail... if a node dies
>>>> during the voting process.
>>>>
>>>> We have addressed this issue in mainline. Sometime in 2.6.18/19.
>>>> As in, rename, unlink and delete now use the dlm and thus should
>>>> no longer fail on a node death.
>>>>
>>>>     
>>>>         
>>> Should these start working again when retried after some amount of time?
>>> I can't have them fail forever, and if mount/unmount don't work, that
>>> would likely make it very hard to recover.
>>>
>>> Peace,
>>> Andrew
>>>
>>>   
>>>       
>>>> Andrew D. Ball wrote:
>>>>     
>>>>         
>>>>> >From /var/log/messages:
>>>>>
>>>>> Jul 16 17:06:43 enva11 kernel: (29933,2):ocfs2_broadcast_vote:725 ERROR:
>>>>> status 
>>>>> = -107
>>>>> Jul 16 17:06:43 enva11 kernel: (29933,2):ocfs2_do_request_vote:798
>>>>> ERROR: status
>>>>>  = -107
>>>>> Jul 16 17:06:43 enva11 kernel: (29933,2):ocfs2_rename:1196 ERROR: status
>>>>> = -107
>>>>>
>>>>> The userspace error is a failed invocation of the mv command.  I know
>>>>> that the return code is not 0, but didn't capture it.  I can re-run
>>>>> tomorrow if that would be helpful.
>>>>>
>>>>> Thanks for your prompt response!
>>>>>
>>>>> Peace,
>>>>> Andrew
>>>>>
>>>>> On Tue, 2007-07-17 at 15:31 -0700, Sunil Mushran wrote:
>>>>>   
>>>>>       
>>>>>           
>>>>>> It should behavw as you expect it to. That's the idea.
>>>>>> What are the errors when mkdir fails?
>>>>>> As in, userspace and dmesg.
>>>>>>
>>>>>> Andrew D. Ball wrote:
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>>> I would really like to see the following behavior:
>>>>>>>
>>>>>>> (1) I start with a two-node cluster, both nodes online, with an ocfs2
>>>>>>> filesystem mounted on both nodes.
>>>>>>> (2) I power off one of the nodes without unmounting the filesystem.
>>>>>>> (3) The node that is still powered on continues to use the filesystem
>>>>>>> mounted read-write with no problems.
>>>>>>>
>>>>>>> I believe I'm seeing that the node that is still online fails to write
>>>>>>> data to the filesystem.  Specifically, mkdir(2) is failing.
>>>>>>>
>>>>>>> This is related to having a quorum right?  Can the quorum requirements
>>>>>>> be disabled?  I have a file-backed database on the filesystem and my
>>>>>>> entire software stack will be broken if any surviving nodes cannot
>>>>>>> update the database.  Is there any reason why ignoring the quorum would
>>>>>>> be not a good idea?
>>>>>>>
>>>>>>> Thanks for your help,
>>>>>>> Andrew
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Ocfs2-users mailing list
>>>>>>> Ocfs2-users at oss.oracle.com
>>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>>>   
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>> _______________________________________________
>>>>> Ocfs2-users mailing list
>>>>> Ocfs2-users at oss.oracle.com
>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>   
>>>>>       
>>>>>           
>>>   
>>>       
>
>