[Ocfs2-users] OCFS2 hanging on writes

Fri Oct 26 07:02:17 PDT 2012

Hello Herbert,
Thanks for your help!
Here is the process stack I get when a "dd" process is hanging:
[<ffffffff81199f56>] __find_get_block_slow+0xc6/0x150[<ffffffffa070d8e9>] ocfs2_metadata_cache_unlock+0x19/0x30 [ocfs2][<ffffffffa070dcd7>] ocfs2_buffer_cached+0xa7/0x1a0 [ocfs2][<ffffffffa070e7ec>] ocfs2_set_buffer_uptodate+0x2c/0x100 [ocfs2][<ffffffffffffffff>] 0xffffffffffffffff

Regarding the full stack dump, I found relevant lines by comparing full stack dumps while the system was frozen and not frozen.  Here are the diff that were identified.
kworker/12:1    S ffff881fc9af65e0     0    90      2 0x00000000 ffff881fc9af9e40 0000000000000046 ffff881fc9af9de0 ffffffff81057602 0000000000012200 ffff881fc9af9fd8 ffff881fc9af8010 0000000000012200 ffff881fc9af9fd8 0000000000012200 ffff881982b4e680 ffff881fc9af6040Call Trace: [<ffffffff81057602>] ? complete+0x52/0x60 [<ffffffff81085887>] ? move_linked_works+0x67/0x90 [<ffffffff81503eaf>] schedule+0x3f/0x60 [<ffffffff81088a1e>] worker_thread+0x24e/0x3c0 [<ffffffff810887d0>] ? manage_workers+0x120/0x120 [<ffffffff8108d546>] kthread+0x96/0xa0 [<ffffffff8150f304>] kernel_thread_helper+0x4/0x10 [<ffffffff8108d4b0>] ? kthread_worker_fn+0x1a0/0x1a0 [<ffffffff8150f300>] ? gs_change+0x13/0x13

dd              R  running task        0 25826  14680 0x00000080 ffff881fc4de7388 ffffffff8110c65e ffff881fc4de7388 ffffffff81505e7e ffff881fc4de73d8 ffffffff81199f56 ffff881fc4de6010 0000000000012200 ffff881fc4de7fd8 0000000000000000 ffff881fc4de73c8 ffffffff81505e7eCall Trace: [<ffffffff8110c65e>] ? find_get_page+0x1e/0xa0 [<ffffffff81505e7e>] ? _raw_spin_lock+0xe/0x20 [<ffffffff81199f56>] __find_get_block_slow+0xc6/0x150 [<ffffffff81505e7e>] ? _raw_spin_lock+0xe/0x20 [<ffffffffa06be462>] ? ocfs2_inode_cache_unlock+0x12/0x20 [ocfs2] [<ffffffffa070d8e9>] ocfs2_metadata_cache_unlock+0x19/0x30 [ocfs2] [<ffffffffa070dcd7>] ocfs2_buffer_cached+0xa7/0x1a0 [ocfs2] [<ffffffffa070e7ec>] ocfs2_set_buffer_uptodate+0x2c/0x100 [ocfs2] [<ffffffffa06be482>] ? ocfs2_inode_cache_io_unlock+0x12/0x20 [ocfs2] [<ffffffffa06e9a85>] ? ocfs2_block_group_find_clear_bits+0xf5/0x180 [ocfs2] [<ffffffffa06ea568>] ? ocfs2_cluster_group_search+0xa8/0x230 [ocfs2] [<ffffffffa06eac13>] ? ocfs2_read_group_descriptor+0x73/0xb0 [ocfs2] [<ffffffffa06ebe20>] ? ocfs2_search_chain+0x100/0x730 [ocfs2] [<ffffffffa06ec7ec>] ? ocfs2_claim_suballoc_bits+0x39c/0x570 [ocfs2] [<ffffffffa01b16ff>] ? do_get_write_access+0x35f/0x600 [jbd2] [<ffffffffa06eca69>] ? __ocfs2_claim_clusters+0xa9/0x340 [ocfs2] [<ffffffffa070d8e9>] ? ocfs2_metadata_cache_unlock+0x19/0x30 [ocfs2] [<ffffffffa06ecd1d>] ? ocfs2_claim_clusters+0x1d/0x20 [ocfs2] [<ffffffffa06cafdf>] ? ocfs2_local_alloc_new_window+0x6f/0x340 [ocfs2] [<ffffffffa06cb413>] ? ocfs2_local_alloc_slide_window+0x163/0x5c0 [ocfs2] [<ffffffffa06cb9c7>] ? ocfs2_reserve_local_alloc_bits+0x157/0x340 [ocfs2] [<ffffffffa06ca2b8>] ? ocfs2_alloc_should_use_local+0x68/0xd0 [ocfs2] [<ffffffffa06ee475>] ? ocfs2_reserve_clusters_with_limit+0xb5/0x320 [ocfs2] [<ffffffffa06ef3a8>] ? ocfs2_reserve_clusters+0x18/0x20 [ocfs2] [<ffffffffa06efdce>] ? ocfs2_lock_allocators+0x1fe/0x2b0 [ocfs2] [<ffffffffa069d7a3>] ? ocfs2_write_begin_nolock+0x913/0x1100 [ocfs2] [<ffffffffa06be482>] ? ocfs2_inode_cache_io_unlock+0x12/0x20 [ocfs2] [<ffffffffa070d959>] ? ocfs2_metadata_cache_io_unlock+0x19/0x30 [ocfs2] [<ffffffffa069f628>] ? ocfs2_read_blocks+0x2f8/0x6c0 [ocfs2] [<ffffffffa06c7460>] ? ocfs2_journal_access_eb+0x20/0x20 [ocfs2] [<ffffffffa069e086>] ? ocfs2_write_begin+0xf6/0x220 [ocfs2] [<ffffffff8110be43>] ? generic_perform_write+0xc3/0x1c0 [<ffffffffa06b94ad>] ? ocfs2_prepare_inode_for_write+0x10d/0x710 [ocfs2] [<ffffffff8110bf6b>] ? generic_file_buffered_write_iter+0x2b/0x60 [<ffffffffa06ba677>] ? ocfs2_file_write_iter+0x367/0x9b0 [ocfs2] [<ffffffffa06bad48>] ? ocfs2_file_aio_write+0x88/0xa0 [ocfs2] [<ffffffff8116c7c2>] ? do_sync_write+0xe2/0x120 [<ffffffff811feae3>] ? security_file_permission+0x23/0x90 [<ffffffff8116cd58>] ? vfs_write+0xc8/0x190 [<ffffffff8116cf21>] ? sys_write+0x51/0x90 [<ffffffff810cc1bb>] ? audit_syscall_exit+0x25b/0x290 [<ffffffff8150e1c2>] ? system_call_fastpath+0x16/0x1b

Thanks,
Jeff

Date: Thu, 25 Oct 2012 18:39:31 -0700
From: herbert.van.den.bergh at oracle.com
To: jpaterson23 at hotmail.com
CC: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] OCFS2 hanging on writes

    Hello Jeff,

    You might want to check what the writer process is waiting on when
    it's frozen.  The wchan column of ps might be enough, but if not,
    then perhaps a kernel stack trace of the process from
    /proc/<pid>/stack or from echo t > /proc/sysrq-trigger . 
    The latter will show other blocked processes as well, which may be
    helpful in determining the cause of the freeze.

    Thanks,

    Herbert.

    On 10/25/2012 06:32 PM, Jeff Paterson wrote:

        Hello,

            I would need help with our
                  OCFS2 (1.8.0) filesystem.  We are having problems with
                  it since a couple days.  When we write onto it, it
                  hangs.

            The "hanging pattern" is
                  easily reproductible.  If I write a 1GB file on the
                  filesystem, it does the following:
                    - write ~200 MB of
                  data on the disk in 1 second
                    - freeze for about
                  10 seconds
                    - write ~200 MB of
                  data on the disk in 1 second
                    - freeze for about
                  10 seconds
                    - write ~200 MB of
                  data on the disk in 1 second
                    - freeze for about
                  10 seconds
                    (and so on)

            When the freezes occur:
                    - other writes
                  operations (from other processes) on the same node
                  also freeze
                    - writes operations
                  on other nodes are not affected by the freezes on
                  another node

            Read
                    operations (on any cluster node, even the one with
                    frozen writes) don't seem to be affected by the
                    freezes.  One sure thing, read operations alone don't cause the
                  filesystem freeze.

                  For info, before the problem began to appear we
                    could sustain 640 MB/s writes without any freeze.

                  I tried to mount the filesystem on a single node
                    to avoid issues that could happen with inter-node
                    communications and the problem was still there.

                  Filesystem details

                      The filesystem has 18 TB and it is currently
                        72% full.
                      Mount options are the following:
rw,nodev,_netdev,noatime,errors=panic,data=writeback,noacl,nouser_xattr,commit=60,heartbeat=local
                      All Features: backup-super
                        strict-journal-super sparse extended-slotmap
                        inline-data metaecc indexed-dirs refcount
                        discontig-bg unwritten

                  There is nothing special in the systems logs
                    beside application errors caused by the freezes.

                  Would a fsck.ocfs2 help?   How long would it take
                    for 18 TB?

                  Is there a flag I can enable in debugfs.ocfs2 to
                    get a better idea of what is happening and why it is
                    freezing like that?

                  Any help would be greatly appreciated.

                  Thanks in advance,

                  Jeff

      _______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20121026/4ce94ff5/attachment-0001.html