[Ocfs2-users] soft lockup of process that doing I/O on an OCFS2 filesystem

Gang He GHe at suse.com
Fri Jul 12 00:42:16 PDT 2019


Hi Robin,

If you do not use Xen based virtual machines (e.g. use KVM based VM), can you encounter this soft-lockup?
Since we did not encounter this bug before, I am just afraid that the problem is related to Xen environment.
Second, what is you doing when trigger this soft-lockup? From the back-trace, the user is submitting asynchronous I/O blocks.
But, what is the file size? This file was cloned by other files, or not?

Thanks
Gang

-----Original Message-----
From: Robin Lee [mailto:robinlee.sysu at gmail.com] 
Sent: 2019年7月12日 14:05
To: Gang He <GHe at suse.com>
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] soft lockup of process that doing I/O on an OCFS2 filesystem

Just met a new occurrence. We have updated ocfs2 modules to 4.4.180.

crash report with call trace:
https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_cheese_b002d7c8735b8e29dc4d0911516083d7&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=wXmkJNAUtutY0U9inuQWCbzSSRji5zLpyR0a_Mek4jM&m=5KiogXVKwd-JqBkqrXBoJc4VSDfVyTtGGfDhwd6r8hU&s=2mEsqdlDoI0SfCdR_jDhPm2h_pV8o4M_aP6md8udmrM&e= 

On Mon, Jun 10, 2019 at 5:15 PM Gang He <ghe at suse.com> wrote:
>
> Hi Robin,
>
> From the back-trace, it looks certain function spends too much time before a break.
> Could you help to summarize how to reproduce this problem? e.g. maybe write a big file (with clone operation)?
>
>
> Thanks
> Gang
>
> >>> On 2019/6/4 at 18:01, in message
> <CAG8B0OyB2LCkcuyL0_b5UJ1dqiB9S1D9a7a+j0JOLjSf9JNYrg at mail.gmail.com>, 
> Robin Lee <robinlee.sysu at gmail.com> wrote:
> > Hi,
> >
> > In an OCFS2 cluster of XenServer 7.1.1 hosts, we met a soft lockup 
> > and the kernel crashed.
> > The kernel is based on Linux 4.4.27.
> > The cluster has 9 hosts and 8 OCFS2 filesystems.
> >
> > dmesg from kdump console log:
> >
> > [ 713656.182631]  EMERG: NMI watchdog: BUG: soft lockup - CPU#0 
> > stuck for 23s! [tapdisk:30306]
> > [ 713656.182681]   WARN: Modules linked in: iptable_filter ip_tables
> > xfs usb_storage tun ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb 
> > ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue dm_service_time nbd
> > vport_geneve(O) vport_vxlan(O) vport_gre(O) openvswitch(O)
> > nf_conntrack_ipv6 nf_nat_ipv6 nf_nat_ipv4 nf_defrag_ipv6 nf_nat gre 
> > 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 xt_limit
> > nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_multiport xt_conntrack 
> > dm_multipath nls_iso8859_1 nls_cp437 vfat fat x86_pkg_temp_thermal 
> > coretemp crc32_pclmul aesni_intel aes_x86_64 ablk_helper cryptd lrw 
> > gf128mul glue_helper lpc_ich igb(O) ptp sb_edac edac_core mfd_core 
> > pps_core i2c_i801 ipmi_devintf dm_mod sg ipmi_si ipmi_msghandler 
> > shpchp hed nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc 
> > x_tables hid_generic usbhid hid
> > [ 713656.182777]   WARN:  sr_mod sd_mod cdrom megaraid_sas(O) ahci
> > libahci libata lpfc(O) ehci_pci ehci_hcd scsi_transport_fc 
> > scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ip_vs 
> > libcrc32c nf_conntrack efivarfs ipv6 autofs4 [last unloaded:
> > ip_tables]
> > [ 713656.182804]   WARN: CPU: 0 PID: 30306 Comm: tapdisk Tainted: G
> >        O    4.4.0+2 #1
> > [ 713656.182806]   WARN: Hardware name: Inspur NF8460M3/NF8460M3, BIOS
> > 4.1.01 11/29/2016
> > [ 713656.182809]   WARN: task: ffff88016cea9c00 ti: ffff880057190000
> > task.ti: ffff880057190000
> > [ 713656.182812]   WARN: RIP: e030:[<ffffffff810013aa>]
> > [<ffffffff810013aa>] xen_hypercall_sched_op+0xa/0x20
> > [ 713656.182828]   WARN: RSP: e02b:ffff8801c6803b80  EFLAGS: 00000202
> > [ 713656.182830]   WARN: RAX: 0000000000000000 RBX: 0000000000000000
> > RCX: ffffffff810013aa
> > [ 713656.182831]   WARN: RDX: 0000000000000000 RSI: ffff8801c6803ba0
> > RDI: 0000000000000003
> > [ 713656.182832]   WARN: RBP: ffff8801c6803bc0 R08: 0000000000000000
> > R09: ffff8801c5800490
> > [ 713656.182834]   WARN: R10: ffff8801c5800680 R11: 0000000000000202
> > R12: ffff880196d7c1ec
> > [ 713656.182836]   WARN: R13: 0000000000000003 R14: ffff8801c707c500
> > R15: 0000000000000003
> > [ 713656.182849]   WARN: FS:  00007fb8fefa8780(0000)
> > GS:ffff8801c6800000(0000) knlGS:0000000000000000
> > [ 713656.182851]   WARN: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 713656.182853]   WARN: CR2: 00007f916c44c000 CR3: 0000000059614000
> > CR4: 0000000000042660
> > [ 713656.182856]   WARN: Stack:
> > [ 713656.182857]   WARN:  0000000000000000 00000000fffffffa
> > ffffffff813c43fe 00000002810bf707
> > [ 713656.182860]   WARN:  ffff8801c6803b9c ffffffff00000001
> > 0000000000000000 0000000000000079
> > [ 713656.182863]   WARN:  ffff8801c6803bd0 ffffffff813c5410
> > ffff8801c6803bf8 ffffffff81014bc3
> > [ 713656.182866]   WARN: Call Trace:
> > [ 713656.182870]   WARN:  <IRQ>
> > [ 713656.182882]   WARN:  [<ffffffff813c43fe>] ?
> > xen_poll_irq_timeout+0x3e/0x50
> > [ 713656.182897]   WARN:  [<ffffffff813c5410>] xen_poll_irq+0x10/0x20
> > [ 713656.182904]   WARN:  [<ffffffff81014bc3>] xen_qlock_wait+0x33/0x40
> > [ 713656.182911]   WARN:  [<ffffffff810b1abc>]
> > __pv_queued_spin_lock_slowpath+0x17c/0x210
> > [ 713656.182918]   WARN:  [<ffffffff81156cb7>]
> > queued_spin_lock_slowpath+0xb/0xf
> > [ 713656.182927]   WARN:  [<ffffffff815a060f>] _raw_spin_lock+0x1f/0x30
> > [ 713656.182931]   WARN:  [<ffffffff8159e885>]
> > __mutex_unlock_slowpath+0x25/0x50
> > [ 713656.182934]   WARN:  [<ffffffff8159e8cb>] mutex_unlock+0x1b/0x20
> > [ 713656.183008]   WARN:  [<ffffffffa07286f6>]
> > ocfs2_dio_end_io+0x46/0x80 [ocfs2]
> > [ 713656.183014]   WARN:  [<ffffffff811ed1e6>] dio_complete+0x96/0x160
> > [ 713656.183016]   WARN:  [<ffffffff811ed38f>] dio_bio_end_aio+0xdf/0x100
> > [ 713656.183037]   WARN:  [<ffffffff812ea1a6>] bio_endio+0x46/0x70
> > [ 713656.183041]   WARN:  [<ffffffff812f16c7>] blk_update_request+0x197/0x2c0
> > [ 713656.183053]   WARN:  [<ffffffffa02c323e>] end_clone_bio+0x5e/0x70
> > [dm_mod]
> > [ 713656.183056]   WARN:  [<ffffffff812ea1a6>] bio_endio+0x46/0x70
> > [ 713656.183058]   WARN:  [<ffffffff812f16c7>] blk_update_request+0x197/0x2c0
> > [ 713656.183103]   WARN:  [<ffffffffa00d08a4>]
> > scsi_end_request+0x34/0x1a0 [scsi_mod]
> > [ 713656.183112]   WARN:  [<ffffffffa00d37c9>]
> > scsi_io_completion+0x239/0x610 [scsi_mod]
> > [ 713656.183116]   WARN:  [<ffffffff810c0021>] ?
> > handle_irq_event_percpu+0x171/0x1e0
> > [ 713656.183123]   WARN:  [<ffffffffa00caf98>]
> > scsi_finish_command+0x108/0x120 [scsi_mod]
> > [ 713656.183132]   WARN:  [<ffffffffa00d2ed8>]
> > scsi_softirq_done+0x128/0x140 [scsi_mod]
> > [ 713656.183136]   WARN:  [<ffffffff812f8948>] blk_done_softirq+0x78/0x90
> > [ 713656.183143]   WARN:  [<ffffffff81076199>] __do_softirq+0x129/0x290
> > [ 713656.183146]   WARN:  [<ffffffff810764d2>] irq_exit+0x42/0x90
> > [ 713656.183148]   WARN:  [<ffffffff813c52b5>] xen_evtchn_do_upcall+0x35/0x50
> > [ 713656.183154]   WARN:  [<ffffffff815a26ae>]
> > xen_do_hypervisor_callback+0x1e/0x40
> > [ 713656.183156]   WARN:  <EOI>
> > [ 713656.183159]   WARN:  [<ffffffff810b16a2>] ?
> > __raw_callee_save___pv_queued_spin_unlock+0x2/0x20
> > [ 713656.183162]   WARN:  [<ffffffff8159e9ae>] ?
> > __mutex_lock_slowpath+0x9e/0x100
> > [ 713656.183165]   WARN:  [<ffffffff8159ea2f>] ? mutex_lock+0x1f/0x2f
> > [ 713656.183191]   WARN:  [<ffffffffa0746c3e>] ?
> > ocfs2_file_write_iter+0xa9e/0xd00 [ocfs2]
> > [ 713656.183195]   WARN:  [<ffffffff8100bb6e>] ? xen_pte_val+0xe/0x10
> > [ 713656.183218]   WARN:  [<ffffffffa07461a0>] ?
> > ocfs2_check_range_for_refcount+0x130/0x130 [ocfs2]
> > [ 713656.183238]   WARN:  [<ffffffff811fc099>] ? aio_run_iocb+0x229/0x270
> > [ 713656.183243]   WARN:  [<ffffffff81065f94>] ?
> > get_user_pages_fast+0x104/0x160
> > [ 713656.183272]   WARN:  [<ffffffff813c1e45>] ? gnttab_batch_copy+0x25/0xe0
> > [ 713656.183281]   WARN:  [<ffffffff811a3813>] ? kmem_cache_alloc+0xb3/0x140
> > [ 713656.183289]   WARN:  [<ffffffff811cf7d3>] ? __fdget+0x13/0x20
> > [ 713656.183292]   WARN:  [<ffffffff811fd6b9>] ? do_io_submit+0x389/0x4a0
> > [ 713656.183296]   WARN:  [<ffffffff811fd7e0>] ? SyS_io_submit+0x10/0x20
> > [ 713656.183304]   WARN:  [<ffffffff815a096e>] ?
> > entry_SYSCALL_64_fastpath+0x12/0x71
> > [ 713656.183307]   WARN: Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b
> > 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53
> > b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc 
> > cc cc cc cc cc cc [ 713656.183346]  EMERG: Kernel panic - not 
> > syncing: softlockup: hung tasks
> > [ 713656.183356]   WARN: CPU: 0 PID: 30306 Comm: tapdisk Tainted: G
> >        O L  4.4.0+2 #1
> > [ 713656.183362]   WARN: Hardware name: Inspur NF8460M3/NF8460M3, BIOS
> > 4.1.01 11/29/2016
> > [ 713656.183368]   WARN:  0000000000000000 ffff8801c68037c0
> > ffffffff8131ac13 ffffffff8180f35f
> > [ 713656.183376]   WARN:  00000000000a242d ffff8801c6803840
> > ffffffff81156825 ffff8801c6803b80
> > [ 713656.183384]   WARN:  0000000000000008 ffff8801c6803850
> > ffff8801c68037f0 00000000000000c5
> > [ 713656.183393]   WARN: Call Trace:
> > [ 713656.183396]   WARN:  <IRQ>  [<ffffffff8131ac13>] dump_stack+0x63/0x90
> > [ 713656.183410]   WARN:  [<ffffffff81156825>] panic+0xcb/0x241
> > [ 713656.183421]   WARN:  [<ffffffff8110bf94>] watchdog_timer_fn+0x1a4/0x1d0
> > [ 713656.183427]   WARN:  [<ffffffff8110bdf0>] ? watchdog+0x40/0x40
> > [ 713656.183440]   WARN:  [<ffffffff810d1904>]
> > __hrtimer_run_queues+0x134/0x250
> > [ 713656.183447]   WARN:  [<ffffffff810d1d36>] hrtimer_interrupt+0xa6/0x180
> > [ 713656.183453]   WARN:  [<ffffffff8100c71e>] xen_timer_interrupt+0x2e/0x130
> > [ 713656.183459]   WARN:  [<ffffffff810bff2f>]
> > handle_irq_event_percpu+0x7f/0x1e0
> > [ 713656.183467]   WARN:  [<ffffffff810c347a>] handle_percpu_irq+0x3a/0x50
> > [ 713656.183491]   WARN:  [<ffffffff810bf732>] generic_handle_irq+0x22/0x30
> > [ 713656.183510]   WARN:  [<ffffffff813c648b>]
> > __evtchn_fifo_handle_events+0x14b/0x170
> > [ 713656.183516]   WARN:  [<ffffffff813c64c0>]
> > evtchn_fifo_handle_events+0x10/0x20
> > [ 713656.183523]   WARN:  [<ffffffff813c34da>]
> > __xen_evtchn_do_upcall+0x4a/0x80
> > [ 713656.183529]   WARN:  [<ffffffff813c52b0>] xen_evtchn_do_upcall+0x30/0x50
> > [ 713656.183534]   WARN:  [<ffffffff815a26ae>]
> > xen_do_hypervisor_callback+0x1e/0x40
> > [ 713656.183540]   WARN:  [<ffffffff810013aa>] ?
> > xen_hypercall_sched_op+0xa/0x20
> > [ 713656.183546]   WARN:  [<ffffffff810013aa>] ?
> > xen_hypercall_sched_op+0xa/0x20
> > [ 713656.183551]   WARN:  [<ffffffff813c43fe>] ?
> > xen_poll_irq_timeout+0x3e/0x50
> > [ 713656.183557]   WARN:  [<ffffffff813c5410>] ? xen_poll_irq+0x10/0x20
> > [ 713656.183562]   WARN:  [<ffffffff81014bc3>] ? xen_qlock_wait+0x33/0x40
> > [ 713656.183568]   WARN:  [<ffffffff810b1abc>] ?
> > __pv_queued_spin_lock_slowpath+0x17c/0x210
> > [ 713656.183575]   WARN:  [<ffffffff81156cb7>] ?
> > queued_spin_lock_slowpath+0xb/0xf
> > [ 713656.183581]   WARN:  [<ffffffff815a060f>] ? _raw_spin_lock+0x1f/0x30
> > [ 713656.183586]   WARN:  [<ffffffff8159e885>] ?
> > __mutex_unlock_slowpath+0x25/0x50
> > [ 713656.183593]   WARN:  [<ffffffff8159e8cb>] ? mutex_unlock+0x1b/0x20
> > [ 713656.183618]   WARN:  [<ffffffffa07286f6>] ?
> > ocfs2_dio_end_io+0x46/0x80 [ocfs2]
> > [ 713656.183624]   WARN:  [<ffffffff811ed1e6>] ? dio_complete+0x96/0x160
> > [ 713656.183630]   WARN:  [<ffffffff811ed38f>] ? dio_bio_end_aio+0xdf/0x100
> > [ 713656.183639]   WARN:  [<ffffffff812ea1a6>] ? bio_endio+0x46/0x70
> > [ 713656.183645]   WARN:  [<ffffffff812f16c7>] ?
> > blk_update_request+0x197/0x2c0
> > [ 713656.183655]   WARN:  [<ffffffffa02c323e>] ?
> > end_clone_bio+0x5e/0x70 [dm_mod]
> > [ 713656.183660]   WARN:  [<ffffffff812ea1a6>] ? bio_endio+0x46/0x70
> > [ 713656.183666]   WARN:  [<ffffffff812f16c7>] ?
> > blk_update_request+0x197/0x2c0
> > [ 713656.183679]   WARN:  [<ffffffffa00d08a4>] ?
> > scsi_end_request+0x34/0x1a0 [scsi_mod]
> > [ 713656.183691]   WARN:  [<ffffffffa00d37c9>] ?
> > scsi_io_completion+0x239/0x610 [scsi_mod]
> > [ 713656.183698]   WARN:  [<ffffffff810c0021>] ?
> > handle_irq_event_percpu+0x171/0x1e0
> > [ 713656.183709]   WARN:  [<ffffffffa00caf98>] ?
> > scsi_finish_command+0x108/0x120 [scsi_mod]
> > [ 713656.183721]   WARN:  [<ffffffffa00d2ed8>] ?
> > scsi_softirq_done+0x128/0x140 [scsi_mod]
> > [ 713656.183728]   WARN:  [<ffffffff812f8948>] ? blk_done_softirq+0x78/0x90
> > [ 713656.183741]   WARN:  [<ffffffff81076199>] ? __do_softirq+0x129/0x290
> > [ 713656.183753]   WARN:  [<ffffffff810764d2>] ? irq_exit+0x42/0x90
> > [ 713656.183764]   WARN:  [<ffffffff813c52b5>] ?
> > xen_evtchn_do_upcall+0x35/0x50
> > [ 713656.183776]   WARN:  [<ffffffff815a26ae>] ?
> > xen_do_hypervisor_callback+0x1e/0x40
> > [ 713656.183786]   WARN:  <EOI>  [<ffffffff810b16a2>] ?
> > __raw_callee_save___pv_queued_spin_unlock+0x2/0x20
> > [ 713656.183802]   WARN:  [<ffffffff8159e9ae>] ?
> > __mutex_lock_slowpath+0x9e/0x100
> > [ 713656.183814]   WARN:  [<ffffffff8159ea2f>] ? mutex_lock+0x1f/0x2f
> > [ 713656.183847]   WARN:  [<ffffffffa0746c3e>] ?
> > ocfs2_file_write_iter+0xa9e/0xd00 [ocfs2]
> > [ 713656.183860]   WARN:  [<ffffffff8100bb6e>] ? xen_pte_val+0xe/0x10
> > [ 713656.183892]   WARN:  [<ffffffffa07461a0>] ?
> > ocfs2_check_range_for_refcount+0x130/0x130 [ocfs2]
> > [ 713656.183905]   WARN:  [<ffffffff811fc099>] ? aio_run_iocb+0x229/0x270
> > [ 713656.183916]   WARN:  [<ffffffff81065f94>] ?
> > get_user_pages_fast+0x104/0x160
> > [ 713656.183929]   WARN:  [<ffffffff813c1e45>] ? gnttab_batch_copy+0x25/0xe0
> > [ 713656.183942]   WARN:  [<ffffffff811a3813>] ? kmem_cache_alloc+0xb3/0x140
> > [ 713656.183953]   WARN:  [<ffffffff811cf7d3>] ? __fdget+0x13/0x20
> > [ 713656.183964]   WARN:  [<ffffffff811fd6b9>] ? do_io_submit+0x389/0x4a0
> > [ 713656.183976]   WARN:  [<ffffffff811fd7e0>] ? SyS_io_submit+0x10/0x20
> > [ 713656.183992]   WARN:  [<ffffffff815a096e>] ?
> > entry_SYSCALL_64_fastpath+0x12/0x71
> >
> > -robin
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > https://oss.oracle.com/mailman/listinfo/ocfs2-users




More information about the Ocfs2-users mailing list