[Ocfs2-users] Unstable Cluster Node
rain c
rain_c1 at yahoo.com
Tue Dec 4 07:17:58 PST 2007
Hi,
first of all thank you very much for providing the patches to me so fast!
On Monday, December 3, 2007 7:18:12 PM Mark Fasheh wrote:
> Attached is a pair of patches which applied more cleanly. Basically it
> includes another tcp.c fix which the -EAGAIN fix built on top of. Both
would
> be good for you to have one way or the other. Fair warning though - I
don't
> really have the ability to test 2.6.18 fixes right now, so you're going
to
> have to be a bit of a beta tester ;) That said, they look pretty clean
to me
> so I have a relatively high confidence that they should work.
I applied both patches as well as all the patches found on http://www.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/backports/2.6.18/
Further I applied the current stable openVZ patch for 2.6.18 as well as
a little patch i wrote on my own for IPVS (both already applied to the last
used unstable kernel).
All the patches fit perfect and I have the kernel up and running now.
At least it is already stable for some hours, but more about stability
I can tell you only tomorrow.
> I'm not sure why it would be always one node and not the other. We'd
> probably need more detailing information about what's going on to
figure
> that out. Maybe some combination of user application + cluster stack
> conspires to put a larger messaging load on it?
>
> Are there any other ocfs2 messages in your logs for that node?
All I found is that it sometimes say
dlm_send_remote_convert_request:395 ERROR: status = -112
instead of
dlm_send_remote_convert_request:395 ERROR: status = -107
shortly before crash.
Further I found some messages, but they are kinda historical. So I am not sure anymore if they were during normal operation or during examination of some other configuration:
kernel: o2net: no longer connected to node webhost2 (num 1) at 10.2.0.71:7777
kernel: (6693,3):dlm_send_proxy_ast_msg:457 ERROR: status = -112
kernel: (6693,3):dlm_flush_asts:589 ERROR: status = -112
and
kernel: o2net: no longer connected to node webhost2 (num 1) at 10.2.0.71:7777
kernel: (27088,2):dlm_do_master_request:1331 ERROR: link to 1 went down!
kernel: (27088,2):dlm_get_lock_resource:915 ERROR: status = -112
You further asked for my cluster setup:
Base is a DRBD 8.0.4 device in primary/primary mode. This is formated
with OCFS2 as one partition. Inside this partition are the private areas of
openVZ virtual enviroments (VPS). Inside these VPS run
mostly webservers but also some other network services.
Between this two cluster nodes I have an ultramonkey heartbeat that
manages an IPVS load balancer for the webservers that are located
inside the VPS on both cluster nodes on the OCFS2
filesystem. The crashing machine is always the one, that is the hot
standby for IPVS.
I will further test if this changes if I make the other node the hot standby.
> If the two patches here work for you, I'll probably just add them to
that
> directory for others to use.
Until now your patches work pretty good for me, but if they really solve my stability problem I can only tell you tomorrow when I hopefully see that both nodes survived the night ;-)
Thanks very much for you expert help,
- Rainer
____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
More information about the Ocfs2-users
mailing list