[Ocfs2-users] Unstable Cluster Node

Wed Dec 5 05:27:13 PST 2007

Hi,



as I wrote yesterday I applyed all the patches. Unfortunately it did not bring the wanted results. The same node crashed again with very similar messages today. I attached also the messages of the other node that stayed alive. 

Not to forget to mention that in the meantime I switched the full hardware to make sure that it is not a hardware problem. On the first view it looks like a network problem for me, but as I already wrote before the two nodes are IBM blades in the same bladecenter directly connected by the bladcenters internal switch. All the other blades in the same bladecenter make no problems.



I am really at the end of my knowledge and hope you can still help me.



Thanks very much,

- Rainer



+----------------------------------------------+

 | These are the messages of the crashing node: |

+----------------------------------------------+



Dec  5 12:58:14 webhost2 kernel: o2net: no longer connected to node webhost1 (num 0) at 10.2.0.70:7777

Dec  5 12:58:14 webhost2 kernel: (10409,1):dlm_send_remote_convert_request:395 ERROR: status = -112

Dec  5 12:58:14 webhost2 kernel: (14860,2):dlm_send_remote_convert_request:395 ERROR: status = -112

Dec  5 12:58:14 webhost2 kernel: (14860,2):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0

Dec  5 12:58:14 webhost2 kernel: (10409,1):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0

Dec  5 12:58:14 webhost2 kernel: (8536,0):dlm_send_remote_convert_request:395 ERROR: status = -112

Dec  5 12:58:14 webhost2 kernel: (8536,0):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0

Dec  5 12:58:20 webhost2 kernel: (10409,1):dlm_send_remote_convert_request:395 ERROR: status = -107

Dec  5 12:58:20 webhost2 kernel: (14860,3):dlm_send_remote_convert_request:395 ERROR: status = -107

Dec  5 12:58:20 webhost2 kernel: (8536,0):dlm_send_remote_convert_request:395 ERROR: status = -107

Dec  5 12:58:20 webhost2 kernel: (14860,3):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0

Dec  5 12:58:20 webhost2 kernel: (8536,0):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0

Dec  5 12:58:20 webhost2 kernel: (10409,1):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0

Dec  5 12:58:25 webhost2 kernel: (10409,1):dlm_send_remote_convert_request:395 ERROR: status = -107

Dec  5 12:58:25 webhost2 kernel: (14860,3):dlm_send_remote_convert_request:395 ERROR: status = -107

Dec  5 12:58:25 webhost2 kernel: (8536,0):dlm_send_remote_convert_request:395 ERROR: status = -107

Dec  5 12:58:25 webhost2 kernel: (14860,3):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0

Dec  5 12:58:25 webhost2 kernel: (8536,0):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0

Dec  5 12:58:25 webhost2 kernel: (10409,1):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0

Dec  5 12:58:30 webhost2 kernel: (14860,2):dlm_send_remote_convert_request:395 ERROR: status = -107

Dec  5 12:58:30 webhost2 kernel: (10409,0):dlm_send_remote_convert_request:395 ERROR: status = -107

Dec  5 12:58:30 webhost2 kernel: (8536,1):dlm_send_remote_convert_request:395 ERROR: status = -107

Dec  5 12:58:30 webhost2 kernel: (10409,0):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0

Dec  5 12:58:30 webhost2 kernel: (8536,1):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0

Dec  5 12:58:30 webhost2 kernel: (14860,2):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0



+--------------------------------------------------------------------------------+

 | During that crash you can see the following messages at the other (stable) node: |

+--------------------------------------------------------------------------------+



Dec  5 12:58:15 webhost1 kernel: o2net: connection to node webhost2 (num 1) at 10.2.0.71:7777 has been idle for 10 seconds, shutting it down.

Dec  5 12:58:15 webhost1 kernel: (0,2):o2net_idle_timer:1313 here are some times that might help debug the situation: (tmr 1196859485.13835 now 1196859495.12881 dr 1196859485.13824 adv 1196859485.13837:1196859485.13838 func (434028bd:504) 1196859485.12053:1196859485.12057)

Dec  5 12:58:15 webhost1 kernel: o2net: no longer connected to node webhost2 (num 1) at 10.2.0.71:7777

Dec  5 12:58:15 webhost1 kernel: (8511,2):dlm_send_proxy_ast_msg:457 ERROR: status = -112

Dec  5 12:58:15 webhost1 kernel: (8511,2):dlm_flush_asts:589 ERROR: status = -112

Dec  5 12:58:55 webhost1 kernel: (11011,3):ocfs2_replay_journal:1184 Recovering node 1 from slot 0 on device (147,0)





--------------------------------------------------------------------------------------------



On Monday, December 3, 2007 7:18:12 PM Mark Fasheh wrote:

 On Mon, Dec 03, 2007 at 04:45:01AM -0800, rain c wrote:

> thanks very much for your answer.

> My problem is, that I connot really use kernel 2.6.22, because I also  need

> the openVZ patch which is not available in a stable version for  2.6.22. Is

> there a way to backport ocfs2-Retry-if-it-returns-EAGAIN to 2.6.18?



Attached is a pair of patches which applied more cleanly. Basically it

includes another tcp.c fix which the -EAGAIN fix built on top of. Both  would

be good for you to have one way or the other. Fair warning though - I  don't

really have the ability to test 2.6.18 fixes right now, so you're going  to

have to be a bit of a beta tester ;) That said, they look pretty clean  to me

so I have a relatively high confidence that they should work.



Be sure to apply them in order:



$ cd linux-2.6.18

$ patch -p1 < 0001-ocfs2-Backport-message-locking-fix-to-2.6.18.patch

$ patch -p1 < 0002-ocfs2-Backport-sendpage-fix-to-2.6.18.patch





> Further I wonder why only one (and always the same) of my nodes is so

> unstable.



I'm not sure why it would be always one node and not the other. We'd

probably need more detailing information about what's going on to  figure

that out. Maybe some combination of user application + cluster stack

conspires to put a larger messaging load on it?



Are there any other ocfs2 messages in your logs for that node?





> Are you sure that it cannot be any other problem?



No, not 100% sure. My first hunch was the -EAGAIN bug because your  messages

looked exactly what I saw there. Looking a bit deeper, it seems that  your

value (when turned into a signed integer) is -32, which would actually  make

it -EPIPE. 



-EPIPE gets returned from several places in the tcp code, in particular

do_tcp_sendpages() and sk_stream_wait_memory(). If you look at the 1st  patch

that's attached, you'll see that it fixes some races that occurred when

sending outgoing messages, including when those functions were called.  While

I'm not 100% sure these patches will fix it, I definitely think it's  the 1st

thing we should try.



By the way, while you're doing this it might be a good idea to also  apply

some of the other patches we backported to 2.6.18 a long time ago:



http://www.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/backports/2.6.18/





If the two patches here work for you, I'll probably just add them to  that

directory for others to use.

    --Mark



--

Mark Fasheh

Senior Software Developer, Oracle

mark.fasheh at oracle.com





-----Inline Attachment Follows-----



>From 42318a6658696711baf25d8bd17e3d2827472d66 Mon Sep 17 00:00:00 2001

From: Zhen Wei <zwei at novell.com>

Date: Tue, 23 Jan 2007 17:19:59 -0800

Subject: ocfs2: Backport message locking fix to 2.6.18



Untested fix, apply at your own risk.

Original commit message follows.



ocfs2: introduce sc->sc_send_lock to protect outbound outbound messages



When there is a lot of multithreaded I/O usage, two threads can collide

while sending out a message to the other nodes. This is due to the lack  of

locking between threads while sending out the messages.



When a connected TCP send(), sendto(), or sendmsg() arrives in the  Linux

kernel, it eventually comes through tcp_sendmsg(). tcp_sendmsg()  protects

itself by acquiring a lock at invocation by calling lock_sock().

tcp_sendmsg() then loops over the buffers in the iovec, allocating

associated sk_buff's and cache pages for use in the actual send. As it  does

so, it pushes the data out to tcp for actual transmission. However, if  one

of those allocation fails (because a large number of large sends is  being

processed, for example), it must wait for memory to become available.  It

does so by jumping to wait_for_sndbuf or wait_for_memory, both of which

eventually cause a call to sk_stream_wait_memory().  sk_stream_wait_memory()

contains a code path that calls sk_wait_event(). Finally,  sk_wait_event()

contains the call to release_sock().



The following patch adds a lock to the socket container in order to

properly serialize outbound requests.



From: Zhen Wei <zwei at novell.com>

Acked-by: Jeff Mahoney <jeffm at suse.com>

Signed-off-by: Mark Fasheh <mark.fasheh at oracle.com>

---

 fs/ocfs2/cluster/tcp.c          |    8 ++++++++

 fs/ocfs2/cluster/tcp_internal.h |    2 ++

 2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c

index b650efa..3c5bf4d 100644

--- a/fs/ocfs2/cluster/tcp.c

+++ b/fs/ocfs2/cluster/tcp.c

@@ -520,6 +520,8 @@ static void o2net_register_callbacks(struct sock  *sk,

     sk->sk_data_ready = o2net_data_ready;

     sk->sk_state_change = o2net_state_change;

 

+    mutex_init(&sc->sc_send_lock);

+

     write_unlock_bh(&sk->sk_callback_lock);

 }

 

@@ -818,10 +820,12 @@ static void o2net_sendpage(struct  o2net_sock_container *sc,

     ssize_t ret;

 

 

+    mutex_lock(&sc->sc_send_lock);

     ret = sc->sc_sock->ops->sendpage(sc->sc_sock,

                      virt_to_page(kmalloced_virt),

                      (long)kmalloced_virt & ~PAGE_MASK,

                      size, MSG_DONTWAIT);

+    mutex_unlock(&sc->sc_send_lock);

     if (ret != size) {

         mlog(ML_ERROR, "sendpage of size %zu to " SC_NODEF_FMT 

              " failed with %zd\n", size, SC_NODEF_ARGS(sc), ret);

@@ -936,8 +940,10 @@ int o2net_send_message_vec(u32 msg_type, u32 key,  struct kvec *caller_vec,

 

     /* finally, convert the message header to network byte-order

      * and send */

+    mutex_lock(&sc->sc_send_lock);

     ret = o2net_send_tcp_msg(sc->sc_sock, vec, veclen,

                  sizeof(struct o2net_msg) + caller_bytes);

+    mutex_unlock(&sc->sc_send_lock);

     msglog(msg, "sending returned %d\n", ret);

     if (ret < 0) {

         mlog(0, "error returned from o2net_send_tcp_msg=%d\n", ret);

@@ -1068,8 +1074,10 @@ static int o2net_process_message(struct  o2net_sock_container *sc,

 

 out_respond:

     /* this destroys the hdr, so don't use it after this */

+    mutex_lock(&sc->sc_send_lock);

     ret = o2net_send_status_magic(sc->sc_sock, hdr, syserr,

                       handler_status);

+    mutex_unlock(&sc->sc_send_lock);

     hdr = NULL;

     mlog(0, "sending handler status %d, syserr %d returned %d\n",

          handler_status, syserr, ret);

diff --git a/fs/ocfs2/cluster/tcp_internal.h  b/fs/ocfs2/cluster/tcp_internal.h

index ff9e2e2..008fcf9 100644

--- a/fs/ocfs2/cluster/tcp_internal.h

+++ b/fs/ocfs2/cluster/tcp_internal.h

@@ -142,6 +142,8 @@ struct o2net_sock_container {

     struct timeval         sc_tv_func_stop;

     u32            sc_msg_key;

     u16            sc_msg_type;

+

+    struct mutex        sc_send_lock;

 };

 

 struct o2net_msg_handler {

-- 

1.5.3.4









-----Inline Attachment Follows-----



>From 355053cdec5205ff35398d78f5c93a59eeb502ce Mon Sep 17 00:00:00 2001

From: Sunil Mushran <sunil.mushran at oracle.com>

Date: Mon, 30 Jul 2007 11:02:50 -0700

Subject: ocfs2: Backport sendpage() fix to 2.6.18



Untested fix, apply at your own risk.

Original commit message follows.



ocfs2: Retry sendpage() if it returns EAGAIN



Instead of treating EAGAIN, returned from sendpage(), as an error, this

patch retries the operation.



Signed-off-by: Sunil Mushran <sunil.mushran at oracle.com>

Signed-off-by: Mark Fasheh <mark.fasheh at oracle.com>

---

 fs/ocfs2/cluster/tcp.c |   24 ++++++++++++++++--------

 1 files changed, 16 insertions(+), 8 deletions(-)



diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c

index 3c5bf4d..29554e5 100644

--- a/fs/ocfs2/cluster/tcp.c

+++ b/fs/ocfs2/cluster/tcp.c

@@ -819,17 +819,25 @@ static void o2net_sendpage(struct  o2net_sock_container *sc,

     struct o2net_node *nn = o2net_nn_from_num(sc->sc_node->nd_num);

     ssize_t ret;

 

-

-    mutex_lock(&sc->sc_send_lock);

-    ret = sc->sc_sock->ops->sendpage(sc->sc_sock,

-                     virt_to_page(kmalloced_virt),

-                     (long)kmalloced_virt & ~PAGE_MASK,

-                     size, MSG_DONTWAIT);

-    mutex_unlock(&sc->sc_send_lock);

-    if (ret != size) {

+    while (1) {

+        mutex_lock(&sc->sc_send_lock);

+        ret = sc->sc_sock->ops->sendpage(sc->sc_sock,

+                         virt_to_page(kmalloced_virt),

+                         (long)kmalloced_virt & ~PAGE_MASK,

+                         size, MSG_DONTWAIT);

+        mutex_unlock(&sc->sc_send_lock);

+        if (ret == size)

+            break;

+        if (ret == (ssize_t)-EAGAIN) {

+            mlog(0, "sendpage of size %zu to " SC_NODEF_FMT

+                 " returned EAGAIN\n", size, SC_NODEF_ARGS(sc));

+            cond_resched();

+            continue;

+        }

         mlog(ML_ERROR, "sendpage of size %zu to " SC_NODEF_FMT 

              " failed with %zd\n", size, SC_NODEF_ARGS(sc), ret);

         o2net_ensure_shutdown(nn, sc, 0);

+        break;

     }

 }

 

-- 

1.5.3.4













      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs