[Ocfs2-users] Unstable Cluster Node

Tue Dec 4 07:17:58 PST 2007

Hi,

first of all thank you very much for providing the patches to me so fast!

On Monday, December 3, 2007 7:18:12 PM Mark Fasheh wrote:
> Attached is a pair of patches which applied more cleanly. Basically it
> includes another tcp.c fix which the -EAGAIN fix built on top of. Both
 would
> be good for you to have one way or the other. Fair warning though - I
 don't
> really have the ability to test 2.6.18 fixes right now, so you're going
 to
> have to be a bit of a beta tester ;) That said, they look pretty clean
 to me
> so I have a relatively high confidence that they should work.

I applied both patches as well as all the patches found on http://www.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/backports/2.6.18/

Further I applied the current stable openVZ patch for 2.6.18 as well as
a little patch i wrote on my own for IPVS (both already applied to the last
used unstable kernel).

All the patches fit perfect and I have the kernel up and running now.
At least it is already stable for some hours, but more about stability
I can tell you only tomorrow.

> I'm not sure why it would be always one node and not the other. We'd
> probably need more detailing information about what's going on to
 figure
> that out. Maybe some combination of user application + cluster stack
> conspires to put a larger messaging load on it?
> 
> Are there any other ocfs2 messages in your logs for that node?

All I found is that it sometimes say
dlm_send_remote_convert_request:395 ERROR: status = -112

instead of

dlm_send_remote_convert_request:395 ERROR: status = -107

shortly before crash.

Further I found some messages, but they are kinda historical. So I am not sure anymore if they were during normal operation or during examination of some other configuration:
kernel: o2net: no longer connected to node webhost2 (num 1) at 10.2.0.71:7777
kernel: (6693,3):dlm_send_proxy_ast_msg:457 ERROR: status = -112
kernel: (6693,3):dlm_flush_asts:589 ERROR: status = -112

and

kernel: o2net: no longer connected to node webhost2 (num 1) at 10.2.0.71:7777
kernel: (27088,2):dlm_do_master_request:1331 ERROR: link to 1 went down!
kernel: (27088,2):dlm_get_lock_resource:915 ERROR: status = -112

You further asked for my cluster setup:
Base is a DRBD 8.0.4 device in primary/primary mode. This is formated
with OCFS2 as one partition. Inside this partition are the private areas of
openVZ virtual enviroments (VPS). Inside these VPS run
mostly webservers but also some other network services.

Between this two cluster nodes I have an ultramonkey heartbeat that
manages an IPVS load balancer for the webservers that are located
inside the VPS on both cluster nodes on the OCFS2
filesystem. The crashing machine is always the one, that is the hot
standby for IPVS.

I will further test if this changes if I make the other node the hot standby.

> If the two patches here work for you, I'll probably just add them to
 that
> directory for others to use.

Until now your patches work pretty good for me, but if they really solve my stability problem I can only tell you tomorrow when I hopefully see that both nodes survived the night ;-)

Thanks very much for you expert help,

- Rainer

      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ