[Ocfs2-users] soft lockup of process that doing I/O on an OCFS2 filesystem
Robin Lee
robinlee.sysu at gmail.com
Fri Jul 12 02:05:37 PDT 2019
On Fri, Jul 12, 2019 at 4:12 PM Gang He <GHe at suse.com> wrote:
>
> Hi Robin,
>
> If you do not use Xen based virtual machines (e.g. use KVM based VM), can you encounter this soft-lockup?
> Since we did not encounter this bug before, I am just afraid that the problem is related to Xen environment.
This issue is not simply reproducible. And we don't have the same
deployment without Xen.
> Second, what is you doing when trigger this soft-lockup? From the back-trace, the user is submitting asynchronous I/O blocks.
> But, what is the file size? This file was cloned by other files, or not?
OCFS2 fs is holding the VM images. So the files are large. And all the
I/O goes through libaio.
"This file was cloned by other files", what does this mean?
>
> Thanks
> Gang
>
> -----Original Message-----
> From: Robin Lee [mailto:robinlee.sysu at gmail.com]
> Sent: 2019年7月12日 14:05
> To: Gang He <GHe at suse.com>
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] soft lockup of process that doing I/O on an OCFS2 filesystem
>
> Just met a new occurrence. We have updated ocfs2 modules to 4.4.180.
>
> crash report with call trace:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_cheese_b002d7c8735b8e29dc4d0911516083d7&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=wXmkJNAUtutY0U9inuQWCbzSSRji5zLpyR0a_Mek4jM&m=nSIz57E29aybHSBq5-kx4bULGmAuU2Txd60hvSbShPA&s=1h2L0Y2Th7YHlTge99IE28Qb6yU3Nr9an32JnXUvceA&e=
>
> On Mon, Jun 10, 2019 at 5:15 PM Gang He <ghe at suse.com> wrote:
> >
> > Hi Robin,
> >
> > From the back-trace, it looks certain function spends too much time before a break.
> > Could you help to summarize how to reproduce this problem? e.g. maybe write a big file (with clone operation)?
> >
> >
> > Thanks
> > Gang
> >
> > >>> On 2019/6/4 at 18:01, in message
> > <CAG8B0OyB2LCkcuyL0_b5UJ1dqiB9S1D9a7a+j0JOLjSf9JNYrg at mail.gmail.com>,
> > Robin Lee <robinlee.sysu at gmail.com> wrote:
> > > Hi,
> > >
> > > In an OCFS2 cluster of XenServer 7.1.1 hosts, we met a soft lockup
> > > and the kernel crashed.
> > > The kernel is based on Linux 4.4.27.
> > > The cluster has 9 hosts and 8 OCFS2 filesystems.
> > >
> > > dmesg from kdump console log:
> > >
> > > [ 713656.182631] EMERG: NMI watchdog: BUG: soft lockup - CPU#0
> > > stuck for 23s! [tapdisk:30306]
> > > [ 713656.182681] WARN: Modules linked in: iptable_filter ip_tables
> > > xfs usb_storage tun ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb
> > > ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue dm_service_time nbd
> > > vport_geneve(O) vport_vxlan(O) vport_gre(O) openvswitch(O)
> > > nf_conntrack_ipv6 nf_nat_ipv6 nf_nat_ipv4 nf_defrag_ipv6 nf_nat gre
> > > 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 xt_limit
> > > nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_multiport xt_conntrack
> > > dm_multipath nls_iso8859_1 nls_cp437 vfat fat x86_pkg_temp_thermal
> > > coretemp crc32_pclmul aesni_intel aes_x86_64 ablk_helper cryptd lrw
> > > gf128mul glue_helper lpc_ich igb(O) ptp sb_edac edac_core mfd_core
> > > pps_core i2c_i801 ipmi_devintf dm_mod sg ipmi_si ipmi_msghandler
> > > shpchp hed nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc
> > > x_tables hid_generic usbhid hid
> > > [ 713656.182777] WARN: sr_mod sd_mod cdrom megaraid_sas(O) ahci
> > > libahci libata lpfc(O) ehci_pci ehci_hcd scsi_transport_fc
> > > scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ip_vs
> > > libcrc32c nf_conntrack efivarfs ipv6 autofs4 [last unloaded:
> > > ip_tables]
> > > [ 713656.182804] WARN: CPU: 0 PID: 30306 Comm: tapdisk Tainted: G
> > > O 4.4.0+2 #1
> > > [ 713656.182806] WARN: Hardware name: Inspur NF8460M3/NF8460M3, BIOS
> > > 4.1.01 11/29/2016
> > > [ 713656.182809] WARN: task: ffff88016cea9c00 ti: ffff880057190000
> > > task.ti: ffff880057190000
> > > [ 713656.182812] WARN: RIP: e030:[<ffffffff810013aa>]
> > > [<ffffffff810013aa>] xen_hypercall_sched_op+0xa/0x20
> > > [ 713656.182828] WARN: RSP: e02b:ffff8801c6803b80 EFLAGS: 00000202
> > > [ 713656.182830] WARN: RAX: 0000000000000000 RBX: 0000000000000000
> > > RCX: ffffffff810013aa
> > > [ 713656.182831] WARN: RDX: 0000000000000000 RSI: ffff8801c6803ba0
> > > RDI: 0000000000000003
> > > [ 713656.182832] WARN: RBP: ffff8801c6803bc0 R08: 0000000000000000
> > > R09: ffff8801c5800490
> > > [ 713656.182834] WARN: R10: ffff8801c5800680 R11: 0000000000000202
> > > R12: ffff880196d7c1ec
> > > [ 713656.182836] WARN: R13: 0000000000000003 R14: ffff8801c707c500
> > > R15: 0000000000000003
> > > [ 713656.182849] WARN: FS: 00007fb8fefa8780(0000)
> > > GS:ffff8801c6800000(0000) knlGS:0000000000000000
> > > [ 713656.182851] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 713656.182853] WARN: CR2: 00007f916c44c000 CR3: 0000000059614000
> > > CR4: 0000000000042660
> > > [ 713656.182856] WARN: Stack:
> > > [ 713656.182857] WARN: 0000000000000000 00000000fffffffa
> > > ffffffff813c43fe 00000002810bf707
> > > [ 713656.182860] WARN: ffff8801c6803b9c ffffffff00000001
> > > 0000000000000000 0000000000000079
> > > [ 713656.182863] WARN: ffff8801c6803bd0 ffffffff813c5410
> > > ffff8801c6803bf8 ffffffff81014bc3
> > > [ 713656.182866] WARN: Call Trace:
> > > [ 713656.182870] WARN: <IRQ>
> > > [ 713656.182882] WARN: [<ffffffff813c43fe>] ?
> > > xen_poll_irq_timeout+0x3e/0x50
> > > [ 713656.182897] WARN: [<ffffffff813c5410>] xen_poll_irq+0x10/0x20
> > > [ 713656.182904] WARN: [<ffffffff81014bc3>] xen_qlock_wait+0x33/0x40
> > > [ 713656.182911] WARN: [<ffffffff810b1abc>]
> > > __pv_queued_spin_lock_slowpath+0x17c/0x210
> > > [ 713656.182918] WARN: [<ffffffff81156cb7>]
> > > queued_spin_lock_slowpath+0xb/0xf
> > > [ 713656.182927] WARN: [<ffffffff815a060f>] _raw_spin_lock+0x1f/0x30
> > > [ 713656.182931] WARN: [<ffffffff8159e885>]
> > > __mutex_unlock_slowpath+0x25/0x50
> > > [ 713656.182934] WARN: [<ffffffff8159e8cb>] mutex_unlock+0x1b/0x20
> > > [ 713656.183008] WARN: [<ffffffffa07286f6>]
> > > ocfs2_dio_end_io+0x46/0x80 [ocfs2]
> > > [ 713656.183014] WARN: [<ffffffff811ed1e6>] dio_complete+0x96/0x160
> > > [ 713656.183016] WARN: [<ffffffff811ed38f>] dio_bio_end_aio+0xdf/0x100
> > > [ 713656.183037] WARN: [<ffffffff812ea1a6>] bio_endio+0x46/0x70
> > > [ 713656.183041] WARN: [<ffffffff812f16c7>] blk_update_request+0x197/0x2c0
> > > [ 713656.183053] WARN: [<ffffffffa02c323e>] end_clone_bio+0x5e/0x70
> > > [dm_mod]
> > > [ 713656.183056] WARN: [<ffffffff812ea1a6>] bio_endio+0x46/0x70
> > > [ 713656.183058] WARN: [<ffffffff812f16c7>] blk_update_request+0x197/0x2c0
> > > [ 713656.183103] WARN: [<ffffffffa00d08a4>]
> > > scsi_end_request+0x34/0x1a0 [scsi_mod]
> > > [ 713656.183112] WARN: [<ffffffffa00d37c9>]
> > > scsi_io_completion+0x239/0x610 [scsi_mod]
> > > [ 713656.183116] WARN: [<ffffffff810c0021>] ?
> > > handle_irq_event_percpu+0x171/0x1e0
> > > [ 713656.183123] WARN: [<ffffffffa00caf98>]
> > > scsi_finish_command+0x108/0x120 [scsi_mod]
> > > [ 713656.183132] WARN: [<ffffffffa00d2ed8>]
> > > scsi_softirq_done+0x128/0x140 [scsi_mod]
> > > [ 713656.183136] WARN: [<ffffffff812f8948>] blk_done_softirq+0x78/0x90
> > > [ 713656.183143] WARN: [<ffffffff81076199>] __do_softirq+0x129/0x290
> > > [ 713656.183146] WARN: [<ffffffff810764d2>] irq_exit+0x42/0x90
> > > [ 713656.183148] WARN: [<ffffffff813c52b5>] xen_evtchn_do_upcall+0x35/0x50
> > > [ 713656.183154] WARN: [<ffffffff815a26ae>]
> > > xen_do_hypervisor_callback+0x1e/0x40
> > > [ 713656.183156] WARN: <EOI>
> > > [ 713656.183159] WARN: [<ffffffff810b16a2>] ?
> > > __raw_callee_save___pv_queued_spin_unlock+0x2/0x20
> > > [ 713656.183162] WARN: [<ffffffff8159e9ae>] ?
> > > __mutex_lock_slowpath+0x9e/0x100
> > > [ 713656.183165] WARN: [<ffffffff8159ea2f>] ? mutex_lock+0x1f/0x2f
> > > [ 713656.183191] WARN: [<ffffffffa0746c3e>] ?
> > > ocfs2_file_write_iter+0xa9e/0xd00 [ocfs2]
> > > [ 713656.183195] WARN: [<ffffffff8100bb6e>] ? xen_pte_val+0xe/0x10
> > > [ 713656.183218] WARN: [<ffffffffa07461a0>] ?
> > > ocfs2_check_range_for_refcount+0x130/0x130 [ocfs2]
> > > [ 713656.183238] WARN: [<ffffffff811fc099>] ? aio_run_iocb+0x229/0x270
> > > [ 713656.183243] WARN: [<ffffffff81065f94>] ?
> > > get_user_pages_fast+0x104/0x160
> > > [ 713656.183272] WARN: [<ffffffff813c1e45>] ? gnttab_batch_copy+0x25/0xe0
> > > [ 713656.183281] WARN: [<ffffffff811a3813>] ? kmem_cache_alloc+0xb3/0x140
> > > [ 713656.183289] WARN: [<ffffffff811cf7d3>] ? __fdget+0x13/0x20
> > > [ 713656.183292] WARN: [<ffffffff811fd6b9>] ? do_io_submit+0x389/0x4a0
> > > [ 713656.183296] WARN: [<ffffffff811fd7e0>] ? SyS_io_submit+0x10/0x20
> > > [ 713656.183304] WARN: [<ffffffff815a096e>] ?
> > > entry_SYSCALL_64_fastpath+0x12/0x71
> > > [ 713656.183307] WARN: Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b
> > > 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53
> > > b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc
> > > cc cc cc cc cc cc [ 713656.183346] EMERG: Kernel panic - not
> > > syncing: softlockup: hung tasks
> > > [ 713656.183356] WARN: CPU: 0 PID: 30306 Comm: tapdisk Tainted: G
> > > O L 4.4.0+2 #1
> > > [ 713656.183362] WARN: Hardware name: Inspur NF8460M3/NF8460M3, BIOS
> > > 4.1.01 11/29/2016
> > > [ 713656.183368] WARN: 0000000000000000 ffff8801c68037c0
> > > ffffffff8131ac13 ffffffff8180f35f
> > > [ 713656.183376] WARN: 00000000000a242d ffff8801c6803840
> > > ffffffff81156825 ffff8801c6803b80
> > > [ 713656.183384] WARN: 0000000000000008 ffff8801c6803850
> > > ffff8801c68037f0 00000000000000c5
> > > [ 713656.183393] WARN: Call Trace:
> > > [ 713656.183396] WARN: <IRQ> [<ffffffff8131ac13>] dump_stack+0x63/0x90
> > > [ 713656.183410] WARN: [<ffffffff81156825>] panic+0xcb/0x241
> > > [ 713656.183421] WARN: [<ffffffff8110bf94>] watchdog_timer_fn+0x1a4/0x1d0
> > > [ 713656.183427] WARN: [<ffffffff8110bdf0>] ? watchdog+0x40/0x40
> > > [ 713656.183440] WARN: [<ffffffff810d1904>]
> > > __hrtimer_run_queues+0x134/0x250
> > > [ 713656.183447] WARN: [<ffffffff810d1d36>] hrtimer_interrupt+0xa6/0x180
> > > [ 713656.183453] WARN: [<ffffffff8100c71e>] xen_timer_interrupt+0x2e/0x130
> > > [ 713656.183459] WARN: [<ffffffff810bff2f>]
> > > handle_irq_event_percpu+0x7f/0x1e0
> > > [ 713656.183467] WARN: [<ffffffff810c347a>] handle_percpu_irq+0x3a/0x50
> > > [ 713656.183491] WARN: [<ffffffff810bf732>] generic_handle_irq+0x22/0x30
> > > [ 713656.183510] WARN: [<ffffffff813c648b>]
> > > __evtchn_fifo_handle_events+0x14b/0x170
> > > [ 713656.183516] WARN: [<ffffffff813c64c0>]
> > > evtchn_fifo_handle_events+0x10/0x20
> > > [ 713656.183523] WARN: [<ffffffff813c34da>]
> > > __xen_evtchn_do_upcall+0x4a/0x80
> > > [ 713656.183529] WARN: [<ffffffff813c52b0>] xen_evtchn_do_upcall+0x30/0x50
> > > [ 713656.183534] WARN: [<ffffffff815a26ae>]
> > > xen_do_hypervisor_callback+0x1e/0x40
> > > [ 713656.183540] WARN: [<ffffffff810013aa>] ?
> > > xen_hypercall_sched_op+0xa/0x20
> > > [ 713656.183546] WARN: [<ffffffff810013aa>] ?
> > > xen_hypercall_sched_op+0xa/0x20
> > > [ 713656.183551] WARN: [<ffffffff813c43fe>] ?
> > > xen_poll_irq_timeout+0x3e/0x50
> > > [ 713656.183557] WARN: [<ffffffff813c5410>] ? xen_poll_irq+0x10/0x20
> > > [ 713656.183562] WARN: [<ffffffff81014bc3>] ? xen_qlock_wait+0x33/0x40
> > > [ 713656.183568] WARN: [<ffffffff810b1abc>] ?
> > > __pv_queued_spin_lock_slowpath+0x17c/0x210
> > > [ 713656.183575] WARN: [<ffffffff81156cb7>] ?
> > > queued_spin_lock_slowpath+0xb/0xf
> > > [ 713656.183581] WARN: [<ffffffff815a060f>] ? _raw_spin_lock+0x1f/0x30
> > > [ 713656.183586] WARN: [<ffffffff8159e885>] ?
> > > __mutex_unlock_slowpath+0x25/0x50
> > > [ 713656.183593] WARN: [<ffffffff8159e8cb>] ? mutex_unlock+0x1b/0x20
> > > [ 713656.183618] WARN: [<ffffffffa07286f6>] ?
> > > ocfs2_dio_end_io+0x46/0x80 [ocfs2]
> > > [ 713656.183624] WARN: [<ffffffff811ed1e6>] ? dio_complete+0x96/0x160
> > > [ 713656.183630] WARN: [<ffffffff811ed38f>] ? dio_bio_end_aio+0xdf/0x100
> > > [ 713656.183639] WARN: [<ffffffff812ea1a6>] ? bio_endio+0x46/0x70
> > > [ 713656.183645] WARN: [<ffffffff812f16c7>] ?
> > > blk_update_request+0x197/0x2c0
> > > [ 713656.183655] WARN: [<ffffffffa02c323e>] ?
> > > end_clone_bio+0x5e/0x70 [dm_mod]
> > > [ 713656.183660] WARN: [<ffffffff812ea1a6>] ? bio_endio+0x46/0x70
> > > [ 713656.183666] WARN: [<ffffffff812f16c7>] ?
> > > blk_update_request+0x197/0x2c0
> > > [ 713656.183679] WARN: [<ffffffffa00d08a4>] ?
> > > scsi_end_request+0x34/0x1a0 [scsi_mod]
> > > [ 713656.183691] WARN: [<ffffffffa00d37c9>] ?
> > > scsi_io_completion+0x239/0x610 [scsi_mod]
> > > [ 713656.183698] WARN: [<ffffffff810c0021>] ?
> > > handle_irq_event_percpu+0x171/0x1e0
> > > [ 713656.183709] WARN: [<ffffffffa00caf98>] ?
> > > scsi_finish_command+0x108/0x120 [scsi_mod]
> > > [ 713656.183721] WARN: [<ffffffffa00d2ed8>] ?
> > > scsi_softirq_done+0x128/0x140 [scsi_mod]
> > > [ 713656.183728] WARN: [<ffffffff812f8948>] ? blk_done_softirq+0x78/0x90
> > > [ 713656.183741] WARN: [<ffffffff81076199>] ? __do_softirq+0x129/0x290
> > > [ 713656.183753] WARN: [<ffffffff810764d2>] ? irq_exit+0x42/0x90
> > > [ 713656.183764] WARN: [<ffffffff813c52b5>] ?
> > > xen_evtchn_do_upcall+0x35/0x50
> > > [ 713656.183776] WARN: [<ffffffff815a26ae>] ?
> > > xen_do_hypervisor_callback+0x1e/0x40
> > > [ 713656.183786] WARN: <EOI> [<ffffffff810b16a2>] ?
> > > __raw_callee_save___pv_queued_spin_unlock+0x2/0x20
> > > [ 713656.183802] WARN: [<ffffffff8159e9ae>] ?
> > > __mutex_lock_slowpath+0x9e/0x100
> > > [ 713656.183814] WARN: [<ffffffff8159ea2f>] ? mutex_lock+0x1f/0x2f
> > > [ 713656.183847] WARN: [<ffffffffa0746c3e>] ?
> > > ocfs2_file_write_iter+0xa9e/0xd00 [ocfs2]
> > > [ 713656.183860] WARN: [<ffffffff8100bb6e>] ? xen_pte_val+0xe/0x10
> > > [ 713656.183892] WARN: [<ffffffffa07461a0>] ?
> > > ocfs2_check_range_for_refcount+0x130/0x130 [ocfs2]
> > > [ 713656.183905] WARN: [<ffffffff811fc099>] ? aio_run_iocb+0x229/0x270
> > > [ 713656.183916] WARN: [<ffffffff81065f94>] ?
> > > get_user_pages_fast+0x104/0x160
> > > [ 713656.183929] WARN: [<ffffffff813c1e45>] ? gnttab_batch_copy+0x25/0xe0
> > > [ 713656.183942] WARN: [<ffffffff811a3813>] ? kmem_cache_alloc+0xb3/0x140
> > > [ 713656.183953] WARN: [<ffffffff811cf7d3>] ? __fdget+0x13/0x20
> > > [ 713656.183964] WARN: [<ffffffff811fd6b9>] ? do_io_submit+0x389/0x4a0
> > > [ 713656.183976] WARN: [<ffffffff811fd7e0>] ? SyS_io_submit+0x10/0x20
> > > [ 713656.183992] WARN: [<ffffffff815a096e>] ?
> > > entry_SYSCALL_64_fastpath+0x12/0x71
> > >
> > > -robin
> > >
> > > _______________________________________________
> > > Ocfs2-users mailing list
> > > Ocfs2-users at oss.oracle.com
> > > https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
More information about the Ocfs2-users
mailing list