[Ocfs2-users] Soft lockup preceding directory becoming inaccessible.

Mon Oct 27 13:45:17 PDT 2008

find and du are not the same. As in, a find may only need to
read the directory entries and thus only lock the dir inode.
OTOH, du needs to read all the inodes in the dir tree.

The stack shows a soft lockup that could have been triggered if the
process was extending, ftruncate(), the file to a very large size.
As ocfs2 1.2 does not support sparse files, it would not only need
to allocate the space on disk, it would also need to clear out the
pages in memory. If this is happening freq, look into upgrading to
ocfs2 1.4 as it supports sparse files. For more, refer to the 1.4
user's guide.

The last issue looks similar to a bugzilla we fixed in 1.2.9-1.
So if not 1.4, atleast upgrade to 1.2.9-1.

Sunil

Frans Lawaetz wrote:
> Hi,
>
> I experienced an OCFS related OS crash and, potentially, some sort of
> cluster-wide locking problem that I am having trouble diagnosing.
>
> The symptoms first appeared as alerts about very high load across the
> Rails / OCFS cluster (four nodes).  Further debugging showed that one
> directory in particular was causing rails processes to stop responding
> (even to a kill -9).  Changing into this directory would result in my
> shell immediately hanging.  Strangely, I could run a "find" down this
> directory but a "du" would lock.  The directory itself is tiny,
> comprised of only 147 files, 11 MB in size.  About half these files get
> updated every hour or so as their cache contents expire.  40 different
> rails processes spread out over the four different OCFS nodes can
> opportunistically update one of these cache files if they are the first
> to run a query against it (meaning that there may have been
> near-simultaneous attempts at an update).
>
> I rebooted three of the nodes, two of which successfully came back
> online.  A fourth node in the cluster (that which I issued the reboot
> commands from) experienced a kernel panic at this juncture and froze.
>
> After all nodes were cleanly rebooted I unmounted the partition
> everywhere and ran a fsck on the filesystem.  It finished with no errors
> reported.  The data in the cache files, however, was somewhat corrupted
> (not totally random but looking a lot like a failed write).
>
> The closest thing to a smoking gun is the output below, tens of pages of
> which were recorded to the node's syslog facility (no other nodes
> reported these lockups).  The time aligns with when things first started
> to go wrong.
>
> I am running CentOS 5, kernel 2.6.18-53.1.21.el5 #1 SMP x86_64. 
> Ocfs_tools 1.2.7-1 and OCFS 1.2.8-2.
>
> Oct 26 15:37:39 web2 kernel: BUG: soft lockup detected on CPU#1!
> Oct 26 15:37:39 web2 kernel:
> Oct 26 15:37:39 web2 kernel: Call Trace:
> Oct 26 15:37:39 web2 kernel:  <IRQ>  [<ffffffff800b6475>]
> softlockup_tick+0xd5/0xe7
> Oct 26 15:37:39 web2 kernel:  [<ffffffff80091040>] tasklet_action+0x62/0xac
> Oct 26 15:37:39 web2 kernel:  [<ffffffff8009413e>]
> update_process_times+0x42/0x68
> Oct 26 15:37:39 web2 kernel:  [<ffffffff80075724>]
> smp_local_timer_interrupt+0x2c/0x61
> Oct 26 15:37:39 web2 kernel:  [<ffffffff8005d2fc>] call_softirq+0x1c/0x28
> Oct 26 15:37:39 web2 kernel:  [<ffffffff80075dec>]
> smp_apic_timer_interrupt+0x41/0x47
> Oct 26 15:37:39 web2 kernel:  [<ffffffff8005cc8e>]
> apic_timer_interrupt+0x66/0x6c
> Oct 26 15:37:39 web2 kernel:  <EOI>  [<ffffffff884644fb>]
> :ocfs2:ocfs2_get_block+0x0/0x967
> Oct 26 15:37:39 web2 kernel:  [<ffffffff800d82a5>]
> cont_prepare_write+0x12c/0x26e
> Oct 26 15:37:39 web2 kernel:  [<ffffffff800d822c>]
> cont_prepare_write+0xb3/0x26e
> Oct 26 15:37:39 web2 kernel:  [<ffffffff884644fb>]
> :ocfs2:ocfs2_get_block+0x0/0x967
> Oct 26 15:37:39 web2 kernel:  [<ffffffff88465172>]
> :ocfs2:ocfs2_prepare_write+0x176/0x1d5
> Oct 26 15:37:39 web2 kernel:  [<ffffffff8000f9cb>]
> generic_file_buffered_write+0x25a/0x6d8
> Oct 26 15:37:39 web2 kernel:  [<ffffffff8846281c>]
> :ocfs2:ocfs2_insert_extent+0xb8d/0xc77
> Oct 26 15:37:39 web2 kernel:  [<ffffffff8000ddd9>] current_fs_time+0x3b/0x40
> Oct 26 15:37:39 web2 kernel:  [<ffffffff80015df1>]
> __generic_file_aio_write_nolock+0x36c/0x3b8
> Oct 26 15:37:39 web2 kernel:  [<ffffffff88472d53>]
> :ocfs2:ocfs2_extend_file+0xc89/0xd17
> Oct 26 15:37:39 web2 kernel:  [<ffffffff800bf380>]
> generic_file_aio_write_nolock+0x20/0x6c
> Oct 26 15:37:39 web2 kernel:  [<ffffffff800bf74c>]
> generic_file_write_nolock+0x8f/0xa8
> Oct 26 15:37:39 web2 kernel:  [<ffffffff8009c4a2>]
> autoremove_wake_function+0x0/0x2e
> Oct 26 15:37:39 web2 kernel:  [<ffffffff800635bf>] __down_read+0x12/0x92
> Oct 26 15:37:39 web2 kernel:  [<ffffffff88473435>]
> :ocfs2:ocfs2_file_write+0x1cb/0x28a
> Oct 26 15:37:39 web2 kernel:  [<ffffffff80016203>] vfs_write+0xce/0x174
> Oct 26 15:37:39 web2 kernel:  [<ffffffff80016ad0>] sys_write+0x45/0x6e
> Oct 26 15:37:39 web2 kernel:  [<ffffffff8005c28d>] tracesys+0xd5/0xe0
> Oct 26 15:37:39 web2 kernel:
> Oct 26 15:38:10 web2 kernel: BUG: soft lockup detected on CPU#3!
> Oct 26 15:38:10 web2 kernel:
> Oct 26 15:38:10 web2 kernel: Call Trace:
> Oct 26 15:38:10 web2 kernel:  <IRQ>  [<ffffffff800b6475>]
> softlockup_tick+0xd5/0xe7
> Oct 26 15:38:10 web2 kernel:  [<ffffffff80091040>] tasklet_action+0x62/0xac
> Oct 26 15:38:10 web2 kernel:  [<ffffffff8009413e>]
> update_process_times+0x42/0x68
>
> The only other syslog messages that might be of interest were reported
> on a different system, some hours later, when I attempted to reboot the
> cluster nodes.
>
> Oct 26 18:33:51 web3 kernel: o2net: connection to node web1 (num 1) at
> x.x.x.x:7777 has been idle for 30.0 seconds, shutting it down.
> Oct 26 18:33:51 web3 kernel: (0,0):o2net_idle_timer:1426 here are some
> times that might help debug the situation: (tmr 1225046001.897649 no
> w 1225046031.898315 dr 1225046001.897645 adv
> 1225046001.897651:1225046001.897652 func (efd9a54a:502)
> 1225046001.881165:1225046001.881184)
> Oct 26 18:33:51 web3 kernel: o2net: no longer connected to node web1
> (num 1) at 10.5.7.10:7777
> Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR:
> status = -112
> Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_broadcast_vote:731 ERROR:
> status = -112
> Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_do_request_vote:804 ERROR:
> status = -112
> Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_query_inode_wipe:760 ERROR:
> status = -112
> Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_delete_inode:851 ERROR:
> status = -112
> Oct 26 18:33:51 web3 kernel:
> (30119,0):dlm_send_remote_unlock_request:360 ERROR: status = -107
> Oct 26 18:33:51 web3 kernel:
> (30119,0):dlm_send_remote_convert_request:398 ERROR: status = -107
> Oct 26 18:33:51 web3 kernel: (30119,0):dlm_wait_for_node_death:365
> E7CF726AF2D745CC82E521F820198D83: waiting 5000ms for notification of dea
> th of node 1
> Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR:
> status = -112
> Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR:
> status = -107
> Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR:
> status = -107
> Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR:
> status = -107
> Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR:
> status = -107
>
> [and pages more of these status codes]
>
> Thanks in advance,
> Frans
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>