[Ocfs2-devel] [PATCH] ocfs2: Do not lock/unlock() inode DLM lock
Zhen Ren
zren at suse.com
Thu Jan 7 18:16:18 PST 2016
Hi!
>>>
> Hi Eric,
>
> On 01/07/2016 10:33 AM, Eric Ren wrote:
> > Hi,
> >
> > On Tue, Dec 29, 2015 at 07:31:16PM -0700, He Gang wrote:
> >> Hello Goldwyn,
> >>
> >> When read path can't get the DLM lock immediately (NONBLOCK way), next get
> the lock with BLOCK way, this behavior will cost some time (several msecs).
> >> It looks make sense to delete that two line code.
> >> But why there are two line code existed? I just worry about, if we delete
> two line code, when read path can't get the DLM lock with NONBLOCK way, read
> path will retry to get this DLM lock repeatedly, this will lead to cost too
> much CPU (Not waiting in sleep).
> >> I just worry about this possibility, Eric will test this case, and give a
> feedback.
> > Sorry for the late reply. After applying this patch, the performance is
> improved very much,
> > but "soft lockup" occurred several times(log message is placed at the end).
> The two lines:
>
> Did you perform this test on the upstream kernel? I performed mine on
> upstream and I did not receive any softlockups. AFAICS, you performed it
> on 3.0 based kernels (perhaps SLE11SP3) which does not have the patch
> you have mentioned.
I did not test on upstream kernel yet. But I did on sles12 sp1 which's kernel version is 3.12.49, so far
also have not seen "soft lookup" on it. I checked SLE11SP3 kernel and it really have this patch - "ocfs2:
Avoid livelock in ocfs2_readpage() ".
So, why newer kernel is without the issue of soft lookup is strange for me.
>
>
> >
> > if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
> > ocfs2_inode_unlock(inode, ex);
> >
> > is similar with the lines of this commit:
> >
> > commit c7e25e6e0b0486492c5faaf6312b37413642c48e
> > Author: Jan Kara <jack at suse.cz>
> > Date: Thu Jun 23 22:51:47 2011 +0200
> >
> > ocfs2: Avoid livelock in ocfs2_readpage()
> >
> > When someone writes to an inode, readers accessing the same inode via
> > ocfs2_readpage() just busyloop trying to get ip_alloc_sem because
> > do_generic_file_read() looks up the page again and retries ->readpage()
> > when previous attempt failed with AOP_TRUNCATED_PAGE. When there are
> enough
> > readers, they can occupy all CPUs and in non-preempt kernel the system
> is
> > deadlocked because writer holding ip_alloc_sem is never run to release
> the
> > semaphore. Fix the problem by making reader block on ip_alloc_sem to
> break
> > the busy loop.
> >
> > Signed-off-by: Jan Kara <jack at suse.cz>
> > Signed-off-by: Joel Becker <jlbec at evilplan.org>
> > diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> > index 4c1ec8f..ba3ca1e 100644
> > --- a/fs/ocfs2/aops.c
> > +++ b/fs/ocfs2/aops.c
> > @@ -290,7 +290,15 @@ static int ocfs2_readpage(struct file *file, struct
> page *page)
> > }
> >
> > if (down_read_trylock(&oi->ip_alloc_sem) == 0) {
> > + /*
> > + * Unlock the page and cycle ip_alloc_sem so that we don't
> > + * busyloop waiting for ip_alloc_sem to unlock
> > + */
> > ret = AOP_TRUNCATED_PAGE;
> > + unlock_page(page);
> > + unlock = 0;
> > + down_read(&oi->ip_alloc_sem);
> > + up_read(&oi->ip_alloc_sem);
> > goto out_inode_unlock;
> > }
> >
> > So, this patch removing the two lines of code may result in "busy wait"
> potentially that
> > reading thread repeatedly checks to see if cluster lock is available.
> Please correct me if
> > I'm wrong.
>
> Correct. In older kernels, not upstream. Right?
Right, but SLE11SP1 kernel has already included the patch above.
Thanks,
Eric
>
> >
> > This is the "soft lookup" log:
> > Dec 30 17:30:40 n2 kernel: [ 248.084012] BUG: soft lockup - CPU#3 stuck for
> 23s! [iomaker:4061]
> > Dec 30 17:30:40 n2 kernel: [ 248.084015] Modules linked in: ocfs2(FN) jbd2
> ocfs2_nodemanager(F) quota_tree ocfs2_stack_user(FN) ocfs2_stackglue(FN)
> dlm(F) sctp libcrc32c configfs edd sg sd_mod crc_t10dif crc32c iscsi_tcp
> libiscsi_tcp libiscsi scsi_transport_iscsi af_packet mperf fuse loop dm_mod
> ipv6 ipv6_lib 8139too pcspkr acpiphp rtc_cmos pci_hotplug floppy 8139cp
> button i2c_piix4 mii virtio_balloon ext3 jbd mbcache ttm drm_kms_helper drm
> i2c_core sysimgblt sysfillrect syscopyarea uhci_hcd ehci_hcd processor
> thermal_sys hwmon usbcore usb_common intel_agp intel_gtt scsi_dh_alua
> scsi_dh_emc scsi_dh_hp_sw scsi_dh_rdac scsi_dh virtio_pci ata_generic
> ata_piix libata scsi_mod virtio_blk virtio virtio_ring
> > Dec 30 17:30:40 n2 kernel: [ 248.084049] Supported: No, Unsupported
> modules are loaded
> > Dec 30 17:30:40 n2 kernel: [ 248.084050] CPU 3
> > Dec 30 17:30:40 n2 kernel: [ 248.084051] Modules linked in: ocfs2(FN) jbd2
> ocfs2_nodemanager(F) quota_tree ocfs2_stack_user(FN) ocfs2_stackglue(FN)
> dlm(F) sctp libcrc32c configfs edd sg sd_mod crc_t10dif crc32c iscsi_tcp
> libiscsi_tcp libiscsi scsi_transport_iscsi af_packet mperf fuse loop dm_mod
> ipv6 ipv6_lib 8139too pcspkr acpiphp rtc_cmos pci_hotplug floppy 8139cp
> button i2c_piix4 mii virtio_balloon ext3 jbd mbcache ttm drm_kms_helper drm
> i2c_core sysimgblt sysfillrect syscopyarea uhci_hcd ehci_hcd processor
> thermal_sys hwmon usbcore usb_common intel_agp intel_gtt scsi_dh_alua
> scsi_dh_emc scsi_dh_hp_sw scsi_dh_rdac scsi_dh virtio_pci ata_generic
> ata_piix libata scsi_mod virtio_blk virtio virtio_ring
> > Dec 30 17:30:40 n2 kernel: [ 248.084074] Supported: No, Unsupported
> modules are loaded
> > Dec 30 17:30:40 n2 kernel: [ 248.084075]
> > Dec 30 17:30:40 n2 kernel: [ 248.084077] Pid: 4061, comm: iomaker Tainted:
> GF N 3.0.76-0.11-default #1 Bochs Bochs
> > Dec 30 17:30:40 n2 kernel: [ 248.084080] RIP: 0010:[<ffffffff8145c7a9>]
> [<ffffffff8145c7a9>] _raw_spin_lock+0x9/0x20
> > Dec 30 17:30:40 n2 kernel: [ 248.084087] RSP: 0018:ffff8800db3fbb10
> EFLAGS: 00000206
> > Dec 30 17:30:40 n2 kernel: [ 248.084088] RAX: 0000000071967196 RBX:
> ffff880114c1c000 RCX: 0000000000000004
> > Dec 30 17:30:40 n2 kernel: [ 248.084090] RDX: 0000000000007195 RSI:
> 0000000000000000 RDI: ffff880114c1c0d0
> > Dec 30 17:30:40 n2 kernel: [ 248.084091] RBP: ffff880114c1c0d0 R08:
> 0000000000000000 R09: 0000000000000003
> > Dec 30 17:30:40 n2 kernel: [ 248.084092] R10: 0000000000000100 R11:
> 0000000000014819 R12: ffffffff81464f6e
> > Dec 30 17:30:40 n2 kernel: [ 248.084094] R13: ffff880037b34908 R14:
> ffff880037b342c0 R15: ffff880037b34908
> > Dec 30 17:30:40 n2 kernel: [ 248.084096] FS: 00007fdbe0f7a720(0000)
> GS:ffff88011fd80000(0000) knlGS:0000000000000000
> > Dec 30 17:30:40 n2 kernel: [ 248.084097] CS: 0010 DS: 0000 ES: 0000 CR0:
> 000000008005003b
> > Dec 30 17:30:40 n2 kernel: [ 248.084099] CR2: 00007ff1df54b000 CR3:
> 00000000db2da000 CR4: 00000000000006e0
> > Dec 30 17:30:40 n2 kernel: [ 248.084103] DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
> > Dec 30 17:30:40 n2 kernel: [ 248.084106] DR3: 0000000000000000 DR6:
> 00000000ffff0ff0 DR7: 0000000000000400
> > Dec 30 17:30:40 n2 kernel: [ 248.084108] Process iomaker (pid: 4061,
> threadinfo ffff8800db3fa000, task ffff880037b342c0)
> > Dec 30 17:30:40 n2 kernel: [ 248.084109] Stack:
> > Dec 30 17:30:40 n2 kernel: [ 248.084112] ffffffffa04ab9b5
> 0000000000000000 00000039c2f58103 0000000000000000
> > Dec 30 17:30:40 n2 kernel: [ 248.084115] 0000000000000000
> ffff880114c1c0d0 ffff880114c1c0d0 ffff880114c1c0d0
> > Dec 30 17:30:40 n2 kernel: [ 248.084117] 0000000000000000
> ffff880114c1c000 ffff8800dba0ec38 0000000000000004
> > Dec 30 17:30:40 n2 kernel: [ 248.084120] Call Trace:
> > Dec 30 17:30:40 n2 kernel: [ 248.084145] [<ffffffffa04ab9b5>]
> ocfs2_wait_for_recovery+0x25/0xc0 [ocfs2]
> > Dec 30 17:30:40 n2 kernel: [ 248.084182] [<ffffffffa049aed8>]
> ocfs2_inode_lock_full_nested+0x298/0x510 [ocfs2]
> > Dec 30 17:30:40 n2 kernel: [ 248.084204] [<ffffffffa049b161>]
> ocfs2_inode_lock_with_page+0x11/0x40 [ocfs2]
> > Dec 30 17:30:40 n2 kernel: [ 248.084225] [<ffffffffa048170f>]
> ocfs2_readpage+0x8f/0x210 [ocfs2]
> > Dec 30 17:30:40 n2 kernel: [ 248.084234] [<ffffffff810f938e>]
> do_generic_file_read+0x13e/0x490
> > Dec 30 17:30:40 n2 kernel: [ 248.084238] [<ffffffff810f9d3c>]
> generic_file_aio_read+0xfc/0x260
> > Dec 30 17:30:40 n2 kernel: [ 248.084250] [<ffffffffa04a168b>]
> ocfs2_file_aio_read+0x14b/0x390 [ocfs2]
> > Dec 30 17:30:40 n2 kernel: [ 248.084265] [<ffffffff811584b8>]
> do_sync_read+0xc8/0x110
> > Dec 30 17:30:40 n2 kernel: [ 248.084268] [<ffffffff81158c67>]
> vfs_read+0xc7/0x130
> > Dec 30 17:30:40 n2 kernel: [ 248.084271] [<ffffffff81158dd3>]
> sys_read+0x53/0xa0
> > Dec 30 17:30:40 n2 kernel: [ 248.084275] [<ffffffff81464592>]
> system_call_fastpath+0x16/0x1b
> > Dec 30 17:30:40 n2 kernel: [ 248.084280] [<00007fdbe03532a0>]
> 0x7fdbe035329f
> > Dec 30 17:30:40 n2 kernel: [ 248.084281] Code: 00 75 04 f0 0f b1 17 0f 94
> c2 0f b6 c2 85 c0 ba 01 00 00 00 75 02 31 d2 89 d0 c3 0f 1f 80 00 00 00 00 b8
> 00 00 01 00 f0 0f c1 07 <0f> b7 d0 c1 e8 10 39 c2 74 07 f3 90 0f b7 17 eb f5 c3
> 0f 1f 44
> > Dec 30 17:30:40 n2 kernel: [ 248.084307] Call Trace:
> > Dec 30 17:30:40 n2 kernel: [ 248.084319] [<ffffffffa04ab9b5>]
> ocfs2_wait_for_recovery+0x25/0xc0 [ocfs2]
> > Dec 30 17:30:40 n2 kernel: [ 248.084344] [<ffffffffa049aed8>]
> ocfs2_inode_lock_full_nested+0x298/0x510 [ocfs2]
> > Dec 30 17:30:40 n2 kernel: [ 248.084366] [<ffffffffa049b161>]
> ocfs2_inode_lock_with_page+0x11/0x40 [ocfs2]
> > Dec 30 17:30:40 n2 kernel: [ 248.084386] [<ffffffffa048170f>]
> ocfs2_readpage+0x8f/0x210 [ocfs2]
> > Dec 30 17:30:40 n2 kernel: [ 248.084394] [<ffffffff810f938e>]
> do_generic_file_read+0x13e/0x490
> > Dec 30 17:30:40 n2 kernel: [ 248.084397] [<ffffffff810f9d3c>]
> generic_file_aio_read+0xfc/0x260
> > Dec 30 17:30:40 n2 kernel: [ 248.084409] [<ffffffffa04a168b>]
> ocfs2_file_aio_read+0x14b/0x390 [ocfs2]
> > Dec 30 17:30:40 n2 kernel: [ 248.084423] [<ffffffff811584b8>]
> do_sync_read+0xc8/0x110
> > Dec 30 17:30:40 n2 kernel: [ 248.084426] [<ffffffff81158c67>]
> vfs_read+0xc7/0x130
> > Dec 30 17:30:40 n2 kernel: [ 248.084429] [<ffffffff81158dd3>]
> sys_read+0x53/0xa0
> > Dec 30 17:30:40 n2 kernel: [ 248.084431] [<ffffffff81464592>]
> system_call_fastpath+0x16/0x1b
> > Dec 30 17:30:40 n2 kernel: [ 248.084435] [<00007fdbe03532a0>]
> 0x7fdbe035329f
> >
> > Thanks,
> > Eric
> >
> >>
> >> Thanks
> >> Gang
> >>
> >>
> >>>>>
> >>> From: Goldwyn Rodrigues <rgoldwyn at suse.com>
> >>>
> >>> DLM does not cache locks. So, blocking lock and unlock
> >>> will only make the performance worse where contention over
> >>> the locks is high.
> >>>
> >>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn at suse.com>
> >>> ---
> >>> fs/ocfs2/dlmglue.c | 8 --------
> >>> 1 file changed, 8 deletions(-)
> >>>
> >>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> >>> index 20276e3..f92612e 100644
> >>> --- a/fs/ocfs2/dlmglue.c
> >>> +++ b/fs/ocfs2/dlmglue.c
> >>> @@ -2432,12 +2432,6 @@ bail:
> >>> * done this we have to return AOP_TRUNCATED_PAGE so the aop method
> >>> * that called us can bubble that back up into the VFS who will then
> >>> * immediately retry the aop call.
> >>> - *
> >>> - * We do a blocking lock and immediate unlock before returning, though, so
> >>> that
> >>> - * the lock has a great chance of being cached on this node by the time the
> >>> VFS
> >>> - * calls back to retry the aop. This has a potential to livelock as
> nodes
> >>> - * ping locks back and forth, but that's a risk we're willing to take to
> >>> avoid
> >>> - * the lock inversion simply.
> >>> */
> >>> int ocfs2_inode_lock_with_page(struct inode *inode,
> >>> struct buffer_head **ret_bh,
> >>> @@ -2449,8 +2443,6 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
> >>> ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
> >>> if (ret == -EAGAIN) {
> >>> unlock_page(page);
> >>> - if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
> >>> - ocfs2_inode_unlock(inode, ex);
> >>> ret = AOP_TRUNCATED_PAGE;
> >>> }
> >>>
> >>> --
> >>> 2.6.2
> >>
> >>
> >> _______________________________________________
> >> Ocfs2-devel mailing list
> >> Ocfs2-devel at oss.oracle.com
> >> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> >>
More information about the Ocfs2-devel
mailing list