[Ocfs2-users] issues with my ocfs2 cluster

Wed Dec 27 09:21:59 PST 2017

hi list,

I have a ocfs2 filesystem setup as a shared filesystem between 12 openstack
compute nodes which are Ubuntu 16.04.3.
I have a very big concern of stability.
A month ago I lost a good deal of files, I don't know the real reason, but
things seemed to point to the ofcs2 cluster.
Last week I found many of my compute nodes with the nova service down. The
node which went down first has a "stuck" file/directory in the ocfs2
filesystem

root at node-99:/mnt/MSA_FC_Vol1/nodes/cb5c94d0-ed4f-457d-88b0-17d49eb7006a# ls

directory in the command above has vHD files in it. Running a simple 'ls'
command hangs indefinitely (i've left it hung 5 days now, it is stuck,
never completes). at the end of this email I've pasted the end of the dmesg
output.....

I ran fsck.ocfs2 on the filesystem and it did fix somethings. but running
'ls' again in that directory still becomes stuck, and the nova service
still comes down on all nodes.

when I restart nova services on all these node they come down again after
some time. when I stop ocfs2 on all these nodes they no longer come down.

I have other openstack compute nodes that are identical except they use
local storage and do not use ocfs2 and these have always been stable.

maybe ocfs2 just isn't stable on Ubuntu 16.04.3? I am using version
1.6.4-3.1

any advice or comments would be appreciated!!!

[Thu Dec 21 20:22:35 2017] INFO: task ls:11052 blocked for more than 120
seconds.
[Thu Dec 21 20:22:35 2017]       Not tainted 4.4.0-98-generic #121-Ubuntu
[Thu Dec 21 20:22:35 2017] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Dec 21 20:22:35 2017] ls              D ffff880074c7b8d8     0 11052
    1 0x00000004
[Thu Dec 21 20:22:35 2017]  ffff880074c7b8d8 ffff882016f1d400
ffff882038f70e00 ffff880035a8c600
[Thu Dec 21 20:22:35 2017]  ffff880074c7c000 ffff880074c7ba80
ffff880074c7ba78 ffff880035a8c600
[Thu Dec 21 20:22:35 2017]  0000000000000000 ffff880074c7b8f0
ffffffff81840585 7fffffffffffffff
[Thu Dec 21 20:22:35 2017] Call Trace:
[Thu Dec 21 20:22:35 2017]  [<ffffffff81840585>] schedule+0x35/0x80
[Thu Dec 21 20:22:35 2017]  [<ffffffff818436d5>]
schedule_timeout+0x1b5/0x270
[Thu Dec 21 20:22:35 2017]  [<ffffffff81840fe3>]
wait_for_completion+0xb3/0x140
[Thu Dec 21 20:22:35 2017]  [<ffffffff810ac630>] ? wake_up_q+0x70/0x70
[Thu Dec 21 20:22:35 2017]  [<ffffffffc0779145>]
__ocfs2_cluster_lock.isra.34+0x415/0x750 [ocfs2]
[Thu Dec 21 20:22:35 2017]  [<ffffffff810f634b>] ? ktime_get+0x3b/0xb0
[Thu Dec 21 20:22:35 2017]  [<ffffffffc077a20a>]
ocfs2_inode_lock_full_nested+0x16a/0x920 [ocfs2]
[Thu Dec 21 20:22:35 2017]  [<ffffffffc0786ee9>] ocfs2_iget+0x499/0x6c0
[ocfs2]
[Thu Dec 21 20:22:35 2017]  [<ffffffffc0770ab8>] ?
ocfs2_free_dir_lookup_result+0x28/0x50 [ocfs2]
[Thu Dec 21 20:22:35 2017]  [<ffffffffc077264e>] ?
ocfs2_lookup_ino_from_name+0x4e/0x70 [ocfs2]
[Thu Dec 21 20:22:35 2017]  [<ffffffffc0796ff5>] ocfs2_lookup+0x145/0x2f0
[ocfs2]
[Thu Dec 21 20:22:35 2017]  [<ffffffff8121a54d>] lookup_real+0x1d/0x60
[Thu Dec 21 20:22:35 2017]  [<ffffffff8121be42>] __lookup_hash+0x42/0x60
[Thu Dec 21 20:22:35 2017]  [<ffffffff8121d1d6>] walk_component+0x226/0x300
[Thu Dec 21 20:22:35 2017]  [<ffffffffc0774e33>] ?
ocfs2_should_refresh_lock_res+0x113/0x160 [ocfs2]
[Thu Dec 21 20:22:35 2017]  [<ffffffff8121eb1d>] path_lookupat+0x5d/0x110
[Thu Dec 21 20:22:35 2017]  [<ffffffff81220761>] filename_lookup+0xb1/0x180
[Thu Dec 21 20:22:35 2017]  [<ffffffffc07855d3>] ?
ocfs2_inode_revalidate+0x93/0x180 [ocfs2]
[Thu Dec 21 20:22:35 2017]  [<ffffffff811eebb7>] ?
kmem_cache_alloc+0x187/0x1f0
[Thu Dec 21 20:22:35 2017]  [<ffffffff81220366>] ? getname_flags+0x56/0x1f0
[Thu Dec 21 20:22:35 2017]  [<ffffffff81220906>]
user_path_at_empty+0x36/0x40
[Thu Dec 21 20:22:35 2017]  [<ffffffff81215616>] vfs_fstatat+0x66/0xc0
[Thu Dec 21 20:22:35 2017]  [<ffffffff81215bd1>] SYSC_newlstat+0x31/0x60
[Thu Dec 21 20:22:35 2017]  [<ffffffff81215d0e>] SyS_newlstat+0xe/0x10
[Thu Dec 21 20:22:35 2017]  [<ffffffff818446b2>]
entry_SYSCALL_64_fastpath+0x16/0x71
[Thu Dec 21 20:33:10 2017] perf interrupt took too long (7066 > 5000),
lowering kernel.perf_event_max_sample_rate to 25000
[Thu Dec 21 22:05:02 2017] perf interrupt took too long (10271 > 10000),
lowering kernel.perf_event_max_sample_rate to 12500
[Fri Dec 22 00:00:01 2017] Process accounting resumed
[Fri Dec 22 00:10:25 2017] perf interrupt took too long (20273 > 20000),
lowering kernel.perf_event_max_sample_rate to 6250
[Fri Dec 22 07:19:10 2017] perf interrupt took too long (41761 > 38461),
lowering kernel.perf_event_max_sample_rate to 3250
[Fri Dec 22 23:59:58 2017] Process accounting resumed
[Sat Dec 23 07:19:11 2017] perf interrupt took too long (76936 > 71428),
lowering kernel.perf_event_max_sample_rate to 1750
[Sat Dec 23 23:59:56 2017] Process accounting resumed
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20171227/a538a7f8/attachment.html