[Ocfs2-users] Unstable Cluster Node

Mark Fasheh mark.fasheh at oracle.com
Mon Dec 3 10:18:12 PST 2007


On Mon, Dec 03, 2007 at 04:45:01AM -0800, rain c wrote:
> thanks very much for your answer.
> My problem is, that I connot really use kernel 2.6.22, because I also need
> the openVZ patch which is not available in a stable version for 2.6.22. Is
> there a way to backport ocfs2-Retry-if-it-returns-EAGAIN to 2.6.18?

Attached is a pair of patches which applied more cleanly. Basically it
includes another tcp.c fix which the -EAGAIN fix built on top of. Both would
be good for you to have one way or the other. Fair warning though - I don't
really have the ability to test 2.6.18 fixes right now, so you're going to
have to be a bit of a beta tester ;) That said, they look pretty clean to me
so I have a relatively high confidence that they should work.

Be sure to apply them in order:

$ cd linux-2.6.18
$ patch -p1 < 0001-ocfs2-Backport-message-locking-fix-to-2.6.18.patch
$ patch -p1 < 0002-ocfs2-Backport-sendpage-fix-to-2.6.18.patch


> Further I wonder why only one (and always the same) of my nodes is so
> unstable.

I'm not sure why it would be always one node and not the other. We'd
probably need more detailing information about what's going on to figure
that out. Maybe some combination of user application + cluster stack
conspires to put a larger messaging load on it?

Are there any other ocfs2 messages in your logs for that node?


> Are you sure that it cannot be any other problem?

No, not 100% sure. My first hunch was the -EAGAIN bug because your messages
looked exactly what I saw there. Looking a bit deeper, it seems that your
value (when turned into a signed integer) is -32, which would actually make
it -EPIPE. 

-EPIPE gets returned from several places in the tcp code, in particular
do_tcp_sendpages() and sk_stream_wait_memory(). If you look at the 1st patch
that's attached, you'll see that it fixes some races that occurred when
sending outgoing messages, including when those functions were called. While
I'm not 100% sure these patches will fix it, I definitely think it's the 1st
thing we should try.

By the way, while you're doing this it might be a good idea to also apply
some of the other patches we backported to 2.6.18 a long time ago:

http://www.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/backports/2.6.18/


If the two patches here work for you, I'll probably just add them to that
directory for others to use.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-ocfs2-Backport-message-locking-fix-to-2.6.18.patch
Type: text/x-diff
Size: 3917 bytes
Desc: not available
Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20071203/94fde8e5/0001-ocfs2-Backport-message-locking-fix-to-2.6.18.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-ocfs2-Backport-sendpage-fix-to-2.6.18.patch
Type: text/x-diff
Size: 1853 bytes
Desc: not available
Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20071203/94fde8e5/0002-ocfs2-Backport-sendpage-fix-to-2.6.18.bin


More information about the Ocfs2-users mailing list