[Ocfs2-users] Disk access hang

Thu Mar 11 19:01:24 PST 2010

Other than messages the only other persistent info that can be
used for debugging is tcpdump. And I will be surprised if you
have that.

Some other useful files are:
# cat /sys/kernel/debug/o2dlm/DOMAIN/dlm_state
# cat /sys/kernel/debug/ocfs2/UUID/fs_state

These are synthetic files and thus not persistent. This allows
one to monitor the state(s). Say the recovery master is waiting
a message from a node during recovery, this state file will
indicate that.

It is interesting that you see the replay_journal on one node. Means
that the dlm recovery completed. That node was then able to take
an exclusive lock on the super block lock and replay the journal.
Others should have followed.

Sunil

Gabriele Alberti wrote:
> Hello,
> I looked for the infos you requested.
>
> 1) The eviction message was on all nodes. Playing with grep I noticed
> in some nodes it appeared twice with different numbers in parenthesis:
>
> Mar  4 04:10:22 node05 kernel: (22595,1):o2dlm_eviction_cb:258 o2dlm
> has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471
> Mar  4 04:10:23 node05 kernel: (22328,0):o2dlm_eviction_cb:258 o2dlm
> has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471
> Mar  4 04:10:35 node07 kernel: (6900,0):o2dlm_eviction_cb:258 o2dlm
> has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471
> Mar  4 04:10:35 node07 kernel: (6892,0):o2dlm_eviction_cb:258 o2dlm
> has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471
>
> 2) The recovery master message appeared on one node, here is the log
> at that time. Please note that node10 (hostname) is Node 3 (ocfs2
> settings)
>
> Mar  4 04:09:51 node10 kernel: o2net: connection to node node08 (num
> 9) at 192.168.1.8:7777 has been idle for 30.0 seconds, shutting it
> down.
> Mar  4 04:09:51 node10 kernel: (0,0):o2net_idle_timer:1498 here are
> some times that might help debug the situation: (tmr 1267672161.718025
> now 1267672191.723171 dr 1267672161.718019 adv 1267672161.718025:12676
> 72161.718026 func (a6c57cb2:502) 1267552114.706439:1267552114.706441)
> Mar  4 04:09:51 node10 kernel: o2net: no longer connected to node
> node08 (num 9) at 192.168.1.8:7777
> Mar  4 04:10:21 node10 kernel: (30475,0):o2net_connect_expired:1659
> ERROR: no connection established with node 9 after 30.0 seconds,
> giving up and returning errors.
> Mar  4 04:10:23 node10 kernel: (30740,1):o2dlm_eviction_cb:258 o2dlm
> has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471
> Mar  4 04:10:23 node10 kernel: (30772,0):dlm_get_lock_resource:839
> B2F5C3291557493B99AE7326AF8B7471:$RECOVERY: at least one node (9) to
> recover before lock mastery can begin
> Mar  4 04:10:23 node10 kernel: (30772,0):dlm_get_lock_resource:873
> B2F5C3291557493B99AE7326AF8B7471: recovery map is not empty, but must
> master $RECOVERY lock now
> Mar  4 04:10:23 node10 kernel: (30772,0):dlm_do_recovery:524 (30772)
> Node 3 is the Recovery Master for the Dead Node 9 for Domain
> B2F5C3291557493B99AE7326AF8B7471
>
> And the log doesnt contain anything til the morning.
> Instead, another node contains the following:
>
> Mar  4 04:10:29 node05 kernel: (1861,1):ocfs2_replay_journal:1224
> Recovering node 9 from slot 7 on device (152,0)
>
> But the ocfs2 disk was unavailable anyway.
>
> Any other hint?
>
> Regards,
>
> G.
>
> On Wed, Mar 10, 2010 at 8:56 PM, Sunil Mushran <sunil.mushran at oracle.com> wrote:
>> Were the first set of messages on all nodes? On that node atleast
>> the o2hb node down event fired. It should have fired on all nodes.
>> This is the dlm eviction message.
>>
>> If they all fired, then look for a node to have a message that
>> reads "Node x is the Recovery Master for the Dead Node y".
>>
>> That shows a node was elected to run the dlm recovery. That has
>> to complete before the journal is replayed. "Recovering node x
>> from slot y on device".
>>
>> I did a quick scan of the patches since 2.6.28. They are a lot
>> of them. I did not see any fixes in this area.
>> git log --oneline --no-merges v2.6.28..HEAD fs/ocfs2
>>
>> Sunil
>>
>> Gabriele Alberti wrote:
>>> Hello,
>>> I have a weird behavior in my ocfs2 cluster. I have few nodes
>>> accessing a shared device, and everything works fine as long as one
>>> node crashes for whatever reason. When this happens, the ocfs2
>>> filesystem hangs and it seems impossible to access it until I dont
>>> bring down all the nodes but one. I have a (commented) log of what
>>> happened few nights ago, when a node shut itself down because of a fan
>>> failure. In order to avoid uncontrolled re-joins to the cluster my
>>> nodes stay off when they go off for a reason.
>>>
>>> The log is available at http://pastebin.com/gDg577hH
>>>
>>> Is this the expected behavior? I thought when one node fails, the rest
>>> of the world should go on working after the timeout (I used default
>>> values for timeouts).
>>>
>>> Here are my versions
>>>
>>> # modinfo ocfs2
>>> filename:       /lib/modules/2.6.28.9/kernel/fs/ocfs2/ocfs2.ko
>>> author:         Oracle
>>> license:        GPL
>>> description:    OCFS2 1.5.0
>>> version:        1.5.0
>>> vermagic:       2.6.28.9 SMP mod_unload modversions PENTIUM4 4KSTACKS
>>> depends:        jbd2,ocfs2_stackglue,ocfs2_nodemanager
>>> srcversion:     FEA8BA1FCC9D61DAAF32077
>>>
>>> Best regards,
>>>
>>> G.