[Ocfs2-users] Disk access hang

Gabriele Alberti gabriele.alberti at pg.infn.it
Thu Mar 11 04:20:10 PST 2010


Hello,
I looked for the infos you requested.

1) The eviction message was on all nodes. Playing with grep I noticed
in some nodes it appeared twice with different numbers in parenthesis:

Mar  4 04:10:22 node05 kernel: (22595,1):o2dlm_eviction_cb:258 o2dlm
has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471
Mar  4 04:10:23 node05 kernel: (22328,0):o2dlm_eviction_cb:258 o2dlm
has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471
Mar  4 04:10:35 node07 kernel: (6900,0):o2dlm_eviction_cb:258 o2dlm
has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471
Mar  4 04:10:35 node07 kernel: (6892,0):o2dlm_eviction_cb:258 o2dlm
has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471

2) The recovery master message appeared on one node, here is the log
at that time. Please note that node10 (hostname) is Node 3 (ocfs2
settings)

Mar  4 04:09:51 node10 kernel: o2net: connection to node node08 (num
9) at 192.168.1.8:7777 has been idle for 30.0 seconds, shutting it
down.
Mar  4 04:09:51 node10 kernel: (0,0):o2net_idle_timer:1498 here are
some times that might help debug the situation: (tmr 1267672161.718025
now 1267672191.723171 dr 1267672161.718019 adv 1267672161.718025:12676
72161.718026 func (a6c57cb2:502) 1267552114.706439:1267552114.706441)
Mar  4 04:09:51 node10 kernel: o2net: no longer connected to node
node08 (num 9) at 192.168.1.8:7777
Mar  4 04:10:21 node10 kernel: (30475,0):o2net_connect_expired:1659
ERROR: no connection established with node 9 after 30.0 seconds,
giving up and returning errors.
Mar  4 04:10:23 node10 kernel: (30740,1):o2dlm_eviction_cb:258 o2dlm
has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471
Mar  4 04:10:23 node10 kernel: (30772,0):dlm_get_lock_resource:839
B2F5C3291557493B99AE7326AF8B7471:$RECOVERY: at least one node (9) to
recover before lock mastery can begin
Mar  4 04:10:23 node10 kernel: (30772,0):dlm_get_lock_resource:873
B2F5C3291557493B99AE7326AF8B7471: recovery map is not empty, but must
master $RECOVERY lock now
Mar  4 04:10:23 node10 kernel: (30772,0):dlm_do_recovery:524 (30772)
Node 3 is the Recovery Master for the Dead Node 9 for Domain
B2F5C3291557493B99AE7326AF8B7471

And the log doesnt contain anything til the morning.
Instead, another node contains the following:

Mar  4 04:10:29 node05 kernel: (1861,1):ocfs2_replay_journal:1224
Recovering node 9 from slot 7 on device (152,0)

But the ocfs2 disk was unavailable anyway.

Any other hint?

Regards,

G.

On Wed, Mar 10, 2010 at 8:56 PM, Sunil Mushran <sunil.mushran at oracle.com> wrote:
> Were the first set of messages on all nodes? On that node atleast
> the o2hb node down event fired. It should have fired on all nodes.
> This is the dlm eviction message.
>
> If they all fired, then look for a node to have a message that
> reads "Node x is the Recovery Master for the Dead Node y".
>
> That shows a node was elected to run the dlm recovery. That has
> to complete before the journal is replayed. "Recovering node x
> from slot y on device".
>
> I did a quick scan of the patches since 2.6.28. They are a lot
> of them. I did not see any fixes in this area.
> git log --oneline --no-merges v2.6.28..HEAD fs/ocfs2
>
> Sunil
>
> Gabriele Alberti wrote:
>>
>> Hello,
>> I have a weird behavior in my ocfs2 cluster. I have few nodes
>> accessing a shared device, and everything works fine as long as one
>> node crashes for whatever reason. When this happens, the ocfs2
>> filesystem hangs and it seems impossible to access it until I dont
>> bring down all the nodes but one. I have a (commented) log of what
>> happened few nights ago, when a node shut itself down because of a fan
>> failure. In order to avoid uncontrolled re-joins to the cluster my
>> nodes stay off when they go off for a reason.
>>
>> The log is available at http://pastebin.com/gDg577hH
>>
>> Is this the expected behavior? I thought when one node fails, the rest
>> of the world should go on working after the timeout (I used default
>> values for timeouts).
>>
>> Here are my versions
>>
>> # modinfo ocfs2
>> filename:       /lib/modules/2.6.28.9/kernel/fs/ocfs2/ocfs2.ko
>> author:         Oracle
>> license:        GPL
>> description:    OCFS2 1.5.0
>> version:        1.5.0
>> vermagic:       2.6.28.9 SMP mod_unload modversions PENTIUM4 4KSTACKS
>> depends:        jbd2,ocfs2_stackglue,ocfs2_nodemanager
>> srcversion:     FEA8BA1FCC9D61DAAF32077
>>
>> Best regards,
>>
>> G.
>



More information about the Ocfs2-users mailing list