[Ocfs2-users] Soft lockup preceding directory becoming inaccessible.

Frans Lawaetz frans at tributes.com
Mon Oct 27 11:52:15 PDT 2008


Hi,

I experienced an OCFS related OS crash and, potentially, some sort of
cluster-wide locking problem that I am having trouble diagnosing.

The symptoms first appeared as alerts about very high load across the
Rails / OCFS cluster (four nodes).  Further debugging showed that one
directory in particular was causing rails processes to stop responding
(even to a kill -9).  Changing into this directory would result in my
shell immediately hanging.  Strangely, I could run a "find" down this
directory but a "du" would lock.  The directory itself is tiny,
comprised of only 147 files, 11 MB in size.  About half these files get
updated every hour or so as their cache contents expire.  40 different
rails processes spread out over the four different OCFS nodes can
opportunistically update one of these cache files if they are the first
to run a query against it (meaning that there may have been
near-simultaneous attempts at an update).

I rebooted three of the nodes, two of which successfully came back
online.  A fourth node in the cluster (that which I issued the reboot
commands from) experienced a kernel panic at this juncture and froze.

After all nodes were cleanly rebooted I unmounted the partition
everywhere and ran a fsck on the filesystem.  It finished with no errors
reported.  The data in the cache files, however, was somewhat corrupted
(not totally random but looking a lot like a failed write).

The closest thing to a smoking gun is the output below, tens of pages of
which were recorded to the node's syslog facility (no other nodes
reported these lockups).  The time aligns with when things first started
to go wrong.

I am running CentOS 5, kernel 2.6.18-53.1.21.el5 #1 SMP x86_64. 
Ocfs_tools 1.2.7-1 and OCFS 1.2.8-2.

Oct 26 15:37:39 web2 kernel: BUG: soft lockup detected on CPU#1!
Oct 26 15:37:39 web2 kernel:
Oct 26 15:37:39 web2 kernel: Call Trace:
Oct 26 15:37:39 web2 kernel:  <IRQ>  [<ffffffff800b6475>]
softlockup_tick+0xd5/0xe7
Oct 26 15:37:39 web2 kernel:  [<ffffffff80091040>] tasklet_action+0x62/0xac
Oct 26 15:37:39 web2 kernel:  [<ffffffff8009413e>]
update_process_times+0x42/0x68
Oct 26 15:37:39 web2 kernel:  [<ffffffff80075724>]
smp_local_timer_interrupt+0x2c/0x61
Oct 26 15:37:39 web2 kernel:  [<ffffffff8005d2fc>] call_softirq+0x1c/0x28
Oct 26 15:37:39 web2 kernel:  [<ffffffff80075dec>]
smp_apic_timer_interrupt+0x41/0x47
Oct 26 15:37:39 web2 kernel:  [<ffffffff8005cc8e>]
apic_timer_interrupt+0x66/0x6c
Oct 26 15:37:39 web2 kernel:  <EOI>  [<ffffffff884644fb>]
:ocfs2:ocfs2_get_block+0x0/0x967
Oct 26 15:37:39 web2 kernel:  [<ffffffff800d82a5>]
cont_prepare_write+0x12c/0x26e
Oct 26 15:37:39 web2 kernel:  [<ffffffff800d822c>]
cont_prepare_write+0xb3/0x26e
Oct 26 15:37:39 web2 kernel:  [<ffffffff884644fb>]
:ocfs2:ocfs2_get_block+0x0/0x967
Oct 26 15:37:39 web2 kernel:  [<ffffffff88465172>]
:ocfs2:ocfs2_prepare_write+0x176/0x1d5
Oct 26 15:37:39 web2 kernel:  [<ffffffff8000f9cb>]
generic_file_buffered_write+0x25a/0x6d8
Oct 26 15:37:39 web2 kernel:  [<ffffffff8846281c>]
:ocfs2:ocfs2_insert_extent+0xb8d/0xc77
Oct 26 15:37:39 web2 kernel:  [<ffffffff8000ddd9>] current_fs_time+0x3b/0x40
Oct 26 15:37:39 web2 kernel:  [<ffffffff80015df1>]
__generic_file_aio_write_nolock+0x36c/0x3b8
Oct 26 15:37:39 web2 kernel:  [<ffffffff88472d53>]
:ocfs2:ocfs2_extend_file+0xc89/0xd17
Oct 26 15:37:39 web2 kernel:  [<ffffffff800bf380>]
generic_file_aio_write_nolock+0x20/0x6c
Oct 26 15:37:39 web2 kernel:  [<ffffffff800bf74c>]
generic_file_write_nolock+0x8f/0xa8
Oct 26 15:37:39 web2 kernel:  [<ffffffff8009c4a2>]
autoremove_wake_function+0x0/0x2e
Oct 26 15:37:39 web2 kernel:  [<ffffffff800635bf>] __down_read+0x12/0x92
Oct 26 15:37:39 web2 kernel:  [<ffffffff88473435>]
:ocfs2:ocfs2_file_write+0x1cb/0x28a
Oct 26 15:37:39 web2 kernel:  [<ffffffff80016203>] vfs_write+0xce/0x174
Oct 26 15:37:39 web2 kernel:  [<ffffffff80016ad0>] sys_write+0x45/0x6e
Oct 26 15:37:39 web2 kernel:  [<ffffffff8005c28d>] tracesys+0xd5/0xe0
Oct 26 15:37:39 web2 kernel:
Oct 26 15:38:10 web2 kernel: BUG: soft lockup detected on CPU#3!
Oct 26 15:38:10 web2 kernel:
Oct 26 15:38:10 web2 kernel: Call Trace:
Oct 26 15:38:10 web2 kernel:  <IRQ>  [<ffffffff800b6475>]
softlockup_tick+0xd5/0xe7
Oct 26 15:38:10 web2 kernel:  [<ffffffff80091040>] tasklet_action+0x62/0xac
Oct 26 15:38:10 web2 kernel:  [<ffffffff8009413e>]
update_process_times+0x42/0x68

The only other syslog messages that might be of interest were reported
on a different system, some hours later, when I attempted to reboot the
cluster nodes.

Oct 26 18:33:51 web3 kernel: o2net: connection to node web1 (num 1) at
x.x.x.x:7777 has been idle for 30.0 seconds, shutting it down.
Oct 26 18:33:51 web3 kernel: (0,0):o2net_idle_timer:1426 here are some
times that might help debug the situation: (tmr 1225046001.897649 no
w 1225046031.898315 dr 1225046001.897645 adv
1225046001.897651:1225046001.897652 func (efd9a54a:502)
1225046001.881165:1225046001.881184)
Oct 26 18:33:51 web3 kernel: o2net: no longer connected to node web1
(num 1) at 10.5.7.10:7777
Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR:
status = -112
Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_broadcast_vote:731 ERROR:
status = -112
Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_do_request_vote:804 ERROR:
status = -112
Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_query_inode_wipe:760 ERROR:
status = -112
Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_delete_inode:851 ERROR:
status = -112
Oct 26 18:33:51 web3 kernel:
(30119,0):dlm_send_remote_unlock_request:360 ERROR: status = -107
Oct 26 18:33:51 web3 kernel:
(30119,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Oct 26 18:33:51 web3 kernel: (30119,0):dlm_wait_for_node_death:365
E7CF726AF2D745CC82E521F820198D83: waiting 5000ms for notification of dea
th of node 1
Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR:
status = -112
Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR:
status = -107
Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR:
status = -107
Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR:
status = -107
Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR:
status = -107

[and pages more of these status codes]

Thanks in advance,
Frans




More information about the Ocfs2-users mailing list