[Ocfs2-users] Unstable Cluster Node

Wed Dec 5 10:05:42 PST 2007

As I had mentioned earlier, 2.6.18 is missing a lot of fixes.

We provide patch fixes for all bugs for all kernels starting from
2.6.20. That's the current cut-off. If you are on 2.6.18, you should
look into switching to (rh)el5. That is 2.6.18 based. You could then
just use ocfs2 1.2.7-1 packages built for it.

If you cannot shift to (rh)el5, then see if you can build the ocfs2-1.2.7
source tarball with your kernel. Do upgrade tools to 1.2.7 too.

Sunil

rain c wrote:
> Hi,
>
>
>
> as I wrote yesterday I applyed all the patches. Unfortunately it did not bring the wanted results. The same node crashed again with very similar messages today. I attached also the messages of the other node that stayed alive. 
>
> Not to forget to mention that in the meantime I switched the full hardware to make sure that it is not a hardware problem. On the first view it looks like a network problem for me, but as I already wrote before the two nodes are IBM blades in the same bladecenter directly connected by the bladcenters internal switch. All the other blades in the same bladecenter make no problems.
>
>
>
> I am really at the end of my knowledge and hope you can still help me.
>
>
>
> Thanks very much,
>
> - Rainer
>
>
>
> +----------------------------------------------+
>
>  | These are the messages of the crashing node: |
>
> +----------------------------------------------+
>
>
>
> Dec  5 12:58:14 webhost2 kernel: o2net: no longer connected to node webhost1 (num 0) at 10.2.0.70:7777
>
> Dec  5 12:58:14 webhost2 kernel: (10409,1):dlm_send_remote_convert_request:395 ERROR: status = -112
>
> Dec  5 12:58:14 webhost2 kernel: (14860,2):dlm_send_remote_convert_request:395 ERROR: status = -112
>
> Dec  5 12:58:14 webhost2 kernel: (14860,2):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0
>
> Dec  5 12:58:14 webhost2 kernel: (10409,1):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0
>
> Dec  5 12:58:14 webhost2 kernel: (8536,0):dlm_send_remote_convert_request:395 ERROR: status = -112
>
> Dec  5 12:58:14 webhost2 kernel: (8536,0):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0
>
> Dec  5 12:58:20 webhost2 kernel: (10409,1):dlm_send_remote_convert_request:395 ERROR: status = -107
>
> Dec  5 12:58:20 webhost2 kernel: (14860,3):dlm_send_remote_convert_request:395 ERROR: status = -107
>
> Dec  5 12:58:20 webhost2 kernel: (8536,0):dlm_send_remote_convert_request:395 ERROR: status = -107
>
> Dec  5 12:58:20 webhost2 kernel: (14860,3):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0
>
> Dec  5 12:58:20 webhost2 kernel: (8536,0):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0
>
> Dec  5 12:58:20 webhost2 kernel: (10409,1):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0
>
> Dec  5 12:58:25 webhost2 kernel: (10409,1):dlm_send_remote_convert_request:395 ERROR: status = -107
>
> Dec  5 12:58:25 webhost2 kernel: (14860,3):dlm_send_remote_convert_request:395 ERROR: status = -107
>
> Dec  5 12:58:25 webhost2 kernel: (8536,0):dlm_send_remote_convert_request:395 ERROR: status = -107
>
> Dec  5 12:58:25 webhost2 kernel: (14860,3):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0
>
> Dec  5 12:58:25 webhost2 kernel: (8536,0):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0
>
> Dec  5 12:58:25 webhost2 kernel: (10409,1):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0
>
> Dec  5 12:58:30 webhost2 kernel: (14860,2):dlm_send_remote_convert_request:395 ERROR: status = -107
>
> Dec  5 12:58:30 webhost2 kernel: (10409,0):dlm_send_remote_convert_request:395 ERROR: status = -107
>
> Dec  5 12:58:30 webhost2 kernel: (8536,1):dlm_send_remote_convert_request:395 ERROR: status = -107
>
> Dec  5 12:58:30 webhost2 kernel: (10409,0):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0
>
> Dec  5 12:58:30 webhost2 kernel: (8536,1):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0
>
> Dec  5 12:58:30 webhost2 kernel: (14860,2):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0
>
>
>
> +--------------------------------------------------------------------------------+
>
>  | During that crash you can see the following messages at the other (stable) node: |
>
> +--------------------------------------------------------------------------------+
>
>
>
> Dec  5 12:58:15 webhost1 kernel: o2net: connection to node webhost2 (num 1) at 10.2.0.71:7777 has been idle for 10 seconds, shutting it down.
>
> Dec  5 12:58:15 webhost1 kernel: (0,2):o2net_idle_timer:1313 here are some times that might help debug the situation: (tmr 1196859485.13835 now 1196859495.12881 dr 1196859485.13824 adv 1196859485.13837:1196859485.13838 func (434028bd:504) 1196859485.12053:1196859485.12057)
>
> Dec  5 12:58:15 webhost1 kernel: o2net: no longer connected to node webhost2 (num 1) at 10.2.0.71:7777
>
> Dec  5 12:58:15 webhost1 kernel: (8511,2):dlm_send_proxy_ast_msg:457 ERROR: status = -112
>
> Dec  5 12:58:15 webhost1 kernel: (8511,2):dlm_flush_asts:589 ERROR: status = -112
>
> Dec  5 12:58:55 webhost1 kernel: (11011,3):ocfs2_replay_journal:1184 Recovering node 1 from slot 0 on device (147,0)
>
>
>
>
>
> --------------------------------------------------------------------------------------------
>
>
>
> On Monday, December 3, 2007 7:18:12 PM Mark Fasheh wrote:
>
>  On Mon, Dec 03, 2007 at 04:45:01AM -0800, rain c wrote:
>
>   
>> thanks very much for your answer.
>>     
>
>   
>> My problem is, that I connot really use kernel 2.6.22, because I also  need
>>     
>
>   
>> the openVZ patch which is not available in a stable version for  2.6.22. Is
>>     
>
>   
>> there a way to backport ocfs2-Retry-if-it-returns-EAGAIN to 2.6.18?
>>     
>
>
>
> Attached is a pair of patches which applied more cleanly. Basically it
>
> includes another tcp.c fix which the -EAGAIN fix built on top of. Both  would
>
> be good for you to have one way or the other. Fair warning though - I  don't
>
> really have the ability to test 2.6.18 fixes right now, so you're going  to
>
> have to be a bit of a beta tester ;) That said, they look pretty clean  to me
>
> so I have a relatively high confidence that they should work.
>
>
>
> Be sure to apply them in order:
>
>
>
> $ cd linux-2.6.18
>
> $ patch -p1 < 0001-ocfs2-Backport-message-locking-fix-to-2.6.18.patch
>
> $ patch -p1 < 0002-ocfs2-Backport-sendpage-fix-to-2.6.18.patch
>
>
>
>
>
>   
>> Further I wonder why only one (and always the same) of my nodes is so
>>     
>
>   
>> unstable.
>>     
>
>
>
> I'm not sure why it would be always one node and not the other. We'd
>
> probably need more detailing information about what's going on to  figure
>
> that out. Maybe some combination of user application + cluster stack
>
> conspires to put a larger messaging load on it?
>
>
>
> Are there any other ocfs2 messages in your logs for that node?
>
>
>
>
>
>   
>> Are you sure that it cannot be any other problem?
>>     
>
>
>
> No, not 100% sure. My first hunch was the -EAGAIN bug because your  messages
>
> looked exactly what I saw there. Looking a bit deeper, it seems that  your
>
> value (when turned into a signed integer) is -32, which would actually  make
>
> it -EPIPE. 
>
>
>
> -EPIPE gets returned from several places in the tcp code, in particular
>
> do_tcp_sendpages() and sk_stream_wait_memory(). If you look at the 1st  patch
>
> that's attached, you'll see that it fixes some races that occurred when
>
> sending outgoing messages, including when those functions were called.  While
>
> I'm not 100% sure these patches will fix it, I definitely think it's  the 1st
>
> thing we should try.
>
>
>
> By the way, while you're doing this it might be a good idea to also  apply
>
> some of the other patches we backported to 2.6.18 a long time ago:
>
>
>
> http://www.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/backports/2.6.18/
>
>
>
>
>
> If the two patches here work for you, I'll probably just add them to  that
>
> directory for others to use.
>
>     --Mark
>
>
>
> --
>
> Mark Fasheh
>
> Senior Software Developer, Oracle
>
> mark.fasheh at oracle.com
>
>
>
>
>
> -----Inline Attachment Follows-----
>
>
>
> >From 42318a6658696711baf25d8bd17e3d2827472d66 Mon Sep 17 00:00:00 2001
>
> From: Zhen Wei <zwei at novell.com>
>
> Date: Tue, 23 Jan 2007 17:19:59 -0800
>
> Subject: ocfs2: Backport message locking fix to 2.6.18
>
>
>
> Untested fix, apply at your own risk.
>
> Original commit message follows.
>
>
>
> ocfs2: introduce sc->sc_send_lock to protect outbound outbound messages
>
>
>
> When there is a lot of multithreaded I/O usage, two threads can collide
>
> while sending out a message to the other nodes. This is due to the lack  of
>
> locking between threads while sending out the messages.
>
>
>
> When a connected TCP send(), sendto(), or sendmsg() arrives in the  Linux
>
> kernel, it eventually comes through tcp_sendmsg(). tcp_sendmsg()  protects
>
> itself by acquiring a lock at invocation by calling lock_sock().
>
> tcp_sendmsg() then loops over the buffers in the iovec, allocating
>
> associated sk_buff's and cache pages for use in the actual send. As it  does
>
> so, it pushes the data out to tcp for actual transmission. However, if  one
>
> of those allocation fails (because a large number of large sends is  being
>
> processed, for example), it must wait for memory to become available.  It
>
> does so by jumping to wait_for_sndbuf or wait_for_memory, both of which
>
> eventually cause a call to sk_stream_wait_memory().  sk_stream_wait_memory()
>
> contains a code path that calls sk_wait_event(). Finally,  sk_wait_event()
>
> contains the call to release_sock().
>
>
>
> The following patch adds a lock to the socket container in order to
>
> properly serialize outbound requests.
>
>
>
> From: Zhen Wei <zwei at novell.com>
>
> Acked-by: Jeff Mahoney <jeffm at suse.com>
>
> Signed-off-by: Mark Fasheh <mark.fasheh at oracle.com>
>
> ---
>
>  fs/ocfs2/cluster/tcp.c          |    8 ++++++++
>
>  fs/ocfs2/cluster/tcp_internal.h |    2 ++
>
>  2 files changed, 10 insertions(+), 0 deletions(-)
>
>
>
> diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
>
> index b650efa..3c5bf4d 100644
>
> --- a/fs/ocfs2/cluster/tcp.c
>
> +++ b/fs/ocfs2/cluster/tcp.c
>
> @@ -520,6 +520,8 @@ static void o2net_register_callbacks(struct sock  *sk,
>
>      sk->sk_data_ready = o2net_data_ready;
>
>      sk->sk_state_change = o2net_state_change;
>
>  
>
> +    mutex_init(&sc->sc_send_lock);
>
> +
>
>      write_unlock_bh(&sk->sk_callback_lock);
>
>  }
>
>  
>
> @@ -818,10 +820,12 @@ static void o2net_sendpage(struct  o2net_sock_container *sc,
>
>      ssize_t ret;
>
>  
>
>  
>
> +    mutex_lock(&sc->sc_send_lock);
>
>      ret = sc->sc_sock->ops->sendpage(sc->sc_sock,
>
>                       virt_to_page(kmalloced_virt),
>
>                       (long)kmalloced_virt & ~PAGE_MASK,
>
>                       size, MSG_DONTWAIT);
>
> +    mutex_unlock(&sc->sc_send_lock);
>
>      if (ret != size) {
>
>          mlog(ML_ERROR, "sendpage of size %zu to " SC_NODEF_FMT 
>
>               " failed with %zd\n", size, SC_NODEF_ARGS(sc), ret);
>
> @@ -936,8 +940,10 @@ int o2net_send_message_vec(u32 msg_type, u32 key,  struct kvec *caller_vec,
>
>  
>
>      /* finally, convert the message header to network byte-order
>
>       * and send */
>
> +    mutex_lock(&sc->sc_send_lock);
>
>      ret = o2net_send_tcp_msg(sc->sc_sock, vec, veclen,
>
>                   sizeof(struct o2net_msg) + caller_bytes);
>
> +    mutex_unlock(&sc->sc_send_lock);
>
>      msglog(msg, "sending returned %d\n", ret);
>
>      if (ret < 0) {
>
>          mlog(0, "error returned from o2net_send_tcp_msg=%d\n", ret);
>
> @@ -1068,8 +1074,10 @@ static int o2net_process_message(struct  o2net_sock_container *sc,
>
>  
>
>  out_respond:
>
>      /* this destroys the hdr, so don't use it after this */
>
> +    mutex_lock(&sc->sc_send_lock);
>
>      ret = o2net_send_status_magic(sc->sc_sock, hdr, syserr,
>
>                        handler_status);
>
> +    mutex_unlock(&sc->sc_send_lock);
>
>      hdr = NULL;
>
>      mlog(0, "sending handler status %d, syserr %d returned %d\n",
>
>           handler_status, syserr, ret);
>
> diff --git a/fs/ocfs2/cluster/tcp_internal.h  b/fs/ocfs2/cluster/tcp_internal.h
>
> index ff9e2e2..008fcf9 100644
>
> --- a/fs/ocfs2/cluster/tcp_internal.h
>
> +++ b/fs/ocfs2/cluster/tcp_internal.h
>
> @@ -142,6 +142,8 @@ struct o2net_sock_container {
>
>      struct timeval         sc_tv_func_stop;
>
>      u32            sc_msg_key;
>
>      u16            sc_msg_type;
>
> +
>
> +    struct mutex        sc_send_lock;
>
>  };
>
>  
>
>  struct o2net_msg_handler {
>
>