[Ocfs2-users] any way to ignore quorum in a two node cluster with one node down?

Andrew D. Ball aball at linux.vnet.ibm.com
Mon Jul 23 12:15:18 PDT 2007


Yes, I'm committed to SLES 10 SP1.

Peace,
Andrew

On Mon, 2007-07-23 at 10:33 -0700, Sunil Mushran wrote:
> What is your target kernel? As in, are you committed to
> SLES10 or is that the kernel you happen to be working on?
> 
> Andrew D. Ball wrote:
> > I'm using a flat file database that I wrote.  We use ocfs2 for other
> > things as well, so it's not as silly as it sounds... 8-)
> >
> > The renaming happens while modifying a record, which really ought not to
> > fail if we can help it.  I have a simple algorithm to make this
> > operation transactional, but really want to avoid unnecessary failures.
> >
> > In my understanding, the failures we've seen occur when a healthy node
> > is updating the database.  Of course I'll work through alternatives.
> >
> > Peace,
> > Andrew
> >
> > On Sat, 2007-07-21 at 15:44 -0700, Sunil Mushran wrote:
> >   
> >> No. Adding a node will not help.It's a node death during the said
> >> operation that is causing the temporary failure.
> >>
> >> I am curious as to why is this such "big" issue for you. Can you elaborate
> >> on that?
> >>
> >> As this has been addressed in mainline, it will work as expected in 1.4.
> >>
> >> Andrew D. Ball wrote:
> >>     
> >>> Given that the change is non-trivial, would requiring at least three
> >>> nodes to be in the ocfs2 be a successful workaround?
> >>>
> >>> Thanks for your help,
> >>> Andrew
> >>>
> >>> On Fri, 2007-07-20 at 10:10 -0700, Sunil Mushran wrote:
> >>>   
> >>>       
> >>>> There are dlm bug fixes that should already be in SLES10 SP1.
> >>>>
> >>>> The change for what you are looking for was not trivial and
> >>>> merging it in the 1.2 tree will not be either.
> >>>>
> >>>> Some of the patches related to this are as follows:
> >>>>
> >>>> Date:   Fri Sep 8 11:37:32 2006 -0700
> >>>>
> >>>> commit 3384f3df5ed939a25135e1b2734fb7cdee1720a8
> >>>> ocfs2: Allow binary names in the DLM
> >>>>
> >>>> commit ea5b3a187e2724fa9d08b2fbd3898c149ed95c6b
> >>>> ocfs2: Update dlmfs for new dlmlock() API
> >>>>
> >>>> commit f0681062b8e369d9fb6f3ce10f4e3fc8cea5f910
> >>>> ocfs2: Update dlmglue for new dlmlock() API
> >>>>
> >>>> commit d680efe9d8fe0eb99d9dd063a4def6b362cdb40d
> >>>> ocfs2: Add new cluster lock type
> >>>>
> >>>> commit 80c05846f604bab6d61e9732c262420ee9f5f358
> >>>> ocfs2: Add dentry tracking API
> >>>>
> >>>> commit 379dfe9d0db99ed33fb089fcb9c07f5f92566e9e
> >>>> ocfs2: Hook rest of the file system into dentry locking API
> >>>>
> >>>> commit 1390334b4c697b7588d5661fcf6acaeec409cf4c
> >>>> ocfs2: Remove the dentry vote
> >>>>
> >>>> commit 1ba9da2ffa54b56a6346746248bfa38124d499a6
> >>>> ocfs2: manually d_move() during ocfs2_rename()
> >>>>
> >>>> Needless to add, this changes the dlm protocol.
> >>>>
> >>>> Andrew D. Ball wrote:
> >>>>     
> >>>>         
> >>>>> Can you help me determine which patches need to be backported to SLES 10
> >>>>> SP1 to resolve this?  Someone in the LTC saw the following that looked
> >>>>> promising:
> >>>>>
> >>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2f5bf1f2d061dea5146aa283685ce2b00cea2f3d
> >>>>>   
> >>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=78062cb2e54ffe0df811dce5e68b54da9b8c9025
> >>>>>   
> >>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b36c3f84988eebf38acaccc756e05f6b70e333ab
> >>>>>   
> >>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3fca0894a4b5e52c278421b04435b88e32b423ad
> >>>>>   
> >>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0dd82141b236ce36253e3056c6068ee3d5732196
> >>>>>   
> >>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e4968476a9bc5a6b30076076b4f3ce3e692e0d79
> >>>>>   
> >>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f3f854648de64c4b6f13f6f13113bc9525c621e5
> >>>>>   
> >>>>>
> >>>>> Thanks!
> >>>>> Andrew
> >>>>>
> >>>>> On Tue, 2007-07-17 at 16:17 -0700, Sunil Mushran wrote:
> >>>>>   
> >>>>>       
> >>>>>           
> >>>>>> Ahh... this is a 1.2 "feature". :)
> >>>>>>
> >>>>>> In 1.2, the fs does messaging (votes) for mount/umount, rename,
> >>>>>> unlink and delete. So the said operations can fail... if a node dies
> >>>>>> during the voting process.
> >>>>>>
> >>>>>> We have addressed this issue in mainline. Sometime in 2.6.18/19.
> >>>>>> As in, rename, unlink and delete now use the dlm and thus should
> >>>>>> no longer fail on a node death.
> >>>>>>
> >>>>>> Andrew D. Ball wrote:
> >>>>>>     
> >>>>>>         
> >>>>>>             
> >>>>>>> >From /var/log/messages:
> >>>>>>>
> >>>>>>> Jul 16 17:06:43 enva11 kernel: (29933,2):ocfs2_broadcast_vote:725 ERROR:
> >>>>>>> status 
> >>>>>>> = -107
> >>>>>>> Jul 16 17:06:43 enva11 kernel: (29933,2):ocfs2_do_request_vote:798
> >>>>>>> ERROR: status
> >>>>>>>  = -107
> >>>>>>> Jul 16 17:06:43 enva11 kernel: (29933,2):ocfs2_rename:1196 ERROR: status
> >>>>>>> = -107
> >>>>>>>
> >>>>>>> The userspace error is a failed invocation of the mv command.  I know
> >>>>>>> that the return code is not 0, but didn't capture it.  I can re-run
> >>>>>>> tomorrow if that would be helpful.
> >>>>>>>
> >>>>>>> Thanks for your prompt response!
> >>>>>>>
> >>>>>>> Peace,
> >>>>>>> Andrew
> >>>>>>>
> >>>>>>> On Tue, 2007-07-17 at 15:31 -0700, Sunil Mushran wrote:
> >>>>>>>   
> >>>>>>>       
> >>>>>>>           
> >>>>>>>               
> >>>>>>>> It should behavw as you expect it to. That's the idea.
> >>>>>>>> What are the errors when mkdir fails?
> >>>>>>>> As in, userspace and dmesg.
> >>>>>>>>
> >>>>>>>> Andrew D. Ball wrote:
> >>>>>>>>     
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> I would really like to see the following behavior:
> >>>>>>>>>
> >>>>>>>>> (1) I start with a two-node cluster, both nodes online, with an ocfs2
> >>>>>>>>> filesystem mounted on both nodes.
> >>>>>>>>> (2) I power off one of the nodes without unmounting the filesystem.
> >>>>>>>>> (3) The node that is still powered on continues to use the filesystem
> >>>>>>>>> mounted read-write with no problems.
> >>>>>>>>>
> >>>>>>>>> I believe I'm seeing that the node that is still online fails to write
> >>>>>>>>> data to the filesystem.  Specifically, mkdir(2) is failing.
> >>>>>>>>>
> >>>>>>>>> This is related to having a quorum right?  Can the quorum requirements
> >>>>>>>>> be disabled?  I have a file-backed database on the filesystem and my
> >>>>>>>>> entire software stack will be broken if any surviving nodes cannot
> >>>>>>>>> update the database.  Is there any reason why ignoring the quorum would
> >>>>>>>>> be not a good idea?
> >>>>>>>>>
> >>>>>>>>> Thanks for your help,
> >>>>>>>>> Andrew
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Ocfs2-users mailing list
> >>>>>>>>> Ocfs2-users at oss.oracle.com
> >>>>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >>>>>>>>>   
> >>>>>>>>>       
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>> _______________________________________________
> >>>>>>> Ocfs2-users mailing list
> >>>>>>> Ocfs2-users at oss.oracle.com
> >>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >>>>>>>   
> >>>>>>>       
> >>>>>>>           
> >>>>>>>               
> >>>>>   
> >>>>>       
> >>>>>           
> >>>   
> >>>       
> >
> >   
> 




More information about the Ocfs2-users mailing list