[Ocfs2-users] Deadlocks on OCFS2 file I/O with Ruby Application

Srinivas Eeda srinivas.eeda at oracle.com
Thu Jan 9 09:53:21 PST 2014


can you please email a complete messages file with all process stacks 
(sysrq-t output). I want to see how a process ended up from jbd2 layer 
to o2net/ocfs2 layer. It might be due to memory pressure, but that 
should be avoided.

Was your system running low on memory when it happened?


On 01/09/2014 08:34 AM, Smith, Darren (Academic) wrote:
>
> Nobody can help me out here?
>
> *From:*Smith, Darren (Academic)
> *Sent:* Thursday, December 19, 2013 4:25 PM
> *To:* 'ocfs2-users at oss.oracle.com'
> *Subject:* Deadlocks on OCFS2 file I/O with Ruby Application
>
> Hello:
>
> We are running a Ruby/Passenger app called Canvas (by Instructure).  
> It runs from 3 application servers that use shared storage (iSCSI, 
> EqualLogic Volume, LVM2, OCFS2) for user uploaded files.  We have been 
> experiencing strange behavior where a ruby and jbd2 process will go 
> into a 'D' (I/O wait) state randomly with some I/O operations on the 
> shared storage (a file upload most commonly).  I turned on some kernel 
> debugging options for one of the application servers and captured the 
> following 'possible circular locking dependency' error during one of 
> these times.  Would anyone be able to provide any insight into 
> possible ways to troubleshoot this further?
>
> Our distro is Gentoo Linux using hardened kernel sources v3.11.2 on 
> this particular app server.  We do not have GrSecurity or PaX 
> enabled.  The other app servers are running kernel 3.2.30.  Was 
> testing newer kernel to see if it helped -- which it doesn't appear to 
> have.
>
> Thanks for any help.
>
> [  858.570825] ======================================================
>
> [  858.570878] [ INFO: possible circular locking dependency detected ]
>
> [  858.570932] 3.11.2-hardenedcanvas-2013-12-16 #4 Not tainted
>
> [  858.570980] -------------------------------------------------------
>
> [  858.571033] ruby/3710 is trying to acquire lock:
>
> [  858.571073] (&sc->sc_send_lock){+.+...}, at: [<ffffffff81355567>] 
> o2net_send_message_vec+0x543/0x8dd
>
> [  858.571165]
>
> but task is already holding lock:
>
> [  858.571213]  (jbd2_handle){+.+...}, at: [<ffffffff8123986e>] 
> start_this_handle+0x4e8/0x57d
>
> [  858.571294]
>
> which lock already depends on the new lock.
>
> [  858.571361]
>
> the existing dependency chain (in reverse order) is:
>
> [  858.571423]
>
> -> #4 (jbd2_handle){+.+...}:
>
> [  858.571464] [<ffffffff810c761c>] lock_acquire+0xbf/0xfe
>
> [  858.571539] [<ffffffff812398e0>] start_this_handle+0x55a/0x57d
>
> [  858.571615] [<ffffffff81239b0c>] jbd2__journal_start+0x9f/0x15c
>
> [  858.571691] [<ffffffff8121feab>] __ext4_journal_start_sb+0x89/0xa6
>
> [  858.571771] [<ffffffff812005c1>] ext4_dirty_inode+0x25/0x54
>
> [  858.571844] [<ffffffff8117a338>] __mark_inode_dirty+0x54/0x210
>
> [  858.571920] [<ffffffff8116d0c7>] update_time+0x9c/0xa7
>
> [  858.571989] [<ffffffff8116d16e>] file_update_time+0x9c/0xbd
>
> [  858.572060] [<ffffffff8120077f>] ext4_page_mkwrite+0x50/0x2a1
>
> [  858.572133] [<ffffffff81130997>] __do_fault+0x125/0x2f5
>
> [  858.572203] [<ffffffff81132d9a>] handle_pte_fault+0x1d4/0x4f2
>
> [  858.572277] [<ffffffff81133459>] handle_mm_fault+0x1af/0x1c9
>
> [  858.572349] [<ffffffff81029a3c>] __do_page_fault+0x317/0x35e
>
> [  858.572423] [<ffffffff81029aab>] do_page_fault+0x9/0xb
>
> [  858.572491] [<ffffffff816e4e45>] page_fault+0x25/0x30
>
> [  858.572560]
>
> -> #3 (sb_pagefaults){.+.+..}:
>
> [  858.572643] [<ffffffff810c761c>] lock_acquire+0xbf/0xfe
>
> [  858.572712] [<ffffffff81159900>] __sb_start_write+0x111/0x161
>
> [  858.572786] [<ffffffff81200773>] ext4_page_mkwrite+0x44/0x2a1
>
> [  858.572860]        [<ffffffff81130997>] __do_fault+0x125/0x2f5
>
> [  858.572928] [<ffffffff81132d9a>] handle_pte_fault+0x1d4/0x4f2
>
> [  858.573002] [<ffffffff81133459>] handle_mm_fault+0x1af/0x1c9
>
> [  858.573075] [<ffffffff81029a3c>] __do_page_fault+0x317/0x35e
>
> [  858.573148] [<ffffffff81029aab>] do_page_fault+0x9/0xb
>
> [  858.573216] [<ffffffff816e4e45>] page_fault+0x25/0x30
>
> [  858.573284]
>
> -> #2 (&mm->mmap_sem){++++++}:
>
> [  858.573361] [<ffffffff810c761c>] lock_acquire+0xbf/0xfe
>
> [  858.573430] [<ffffffff81130bc9>] might_fault+0x62/0x85
>
> [  858.573498] [<ffffffff8163e41b>] skb_do_copy_data_nocache+0x155/0x168
>
> [  858.573578] [<ffffffff8163f72a>] tcp_sendmsg+0x398/0x7e1
>
> [  858.573648] [<ffffffff8165ee8b>] inet_sendmsg+0x5e/0x67
>
> [  858.573718] [<ffffffff815f76d3>] sock_aio_write+0x101/0x111
>
> [  858.573790] [<ffffffff81157544>] do_sync_write+0x5d/0x89
>
> [  858.573860] [<ffffffff81158035>] vfs_write+0x105/0x14c
>
> [  858.573929] [<ffffffff8115832d>] SyS_write+0x57/0x83
>
> [  858.573996] [<ffffffff816e5424>] system_call_fastpath+0x1c/0x21
>
> [  858.574072]
>
> -> #1 (sk_lock-AF_INET){+.+.+.}:
>
> [  858.574150] [<ffffffff810c761c>] lock_acquire+0xbf/0xfe
>
> [  858.574219] [<ffffffff815fc22e>] lock_sock_nested+0x69/0x79
>
> [  858.574291] [<ffffffff8163d8bc>] lock_sock+0xb/0xd
>
> [  858.574356] [<ffffffff8163efe4>] tcp_sendpage+0x55/0x3da
>
> [  858.574426] [<ffffffff8165ef7a>] inet_sendpage+0x82/0x97
>
> [  858.574496] [<ffffffff81354b02>] o2net_sendpage+0xae/0x21b
>
> [  858.574568] [<ffffffff81356d71>] o2net_accept_many+0x383/0x40a
>
> [  858.574642] [<ffffffff81099dc6>] process_one_work+0x1e9/0x358
>
> [  858.574716] [<ffffffff8109a31f>] worker_thread+0x159/0x1ee
>
> [  858.574787] [<ffffffff810a02b1>] kthread+0xac/0xb4
>
> [  858.574853] [<ffffffff816e5372>] ret_from_fork+0x72/0xa0
>
> [  858.574924]
>
> -> #0 (&sc->sc_send_lock){+.+...}:
>
> [  858.575003] [<ffffffff810c6e3c>] __lock_acquire+0xb65/0xe65
>
> [  858.575077] [<ffffffff810c761c>] lock_acquire+0xbf/0xfe
>
> [  858.575146] [<ffffffff816e1732>] mutex_lock_nested+0x3f/0x353
>
> [  858.575219] [<ffffffff81355567>] o2net_send_message_vec+0x543/0x8dd
>
> [  858.575296] [<ffffffff81355925>] o2net_send_message+0x24/0x26
>
> [  858.575369] [<ffffffff81369390>] dlm_do_master_request+0xca/0x630
>
> [  858.575446] [<ffffffff8136c727>] dlm_get_lock_resource+0xa58/0xe1d
>
> [  858.575523] [<ffffffff813747d7>] dlmlock+0x865/0x11d3
>
> [  858.575593] [<ffffffff81349ad8>] o2cb_dlm_lock+0x59/0x69
>
> [  858.575665] [<ffffffff81349029>] ocfs2_dlm_lock+0x28/0x2a
>
> [  858.575736] [<ffffffff812ecef9>] ocfs2_lock_create+0xfa/0x270
>
> [  858.575811] [<ffffffff812ed68d>] ocfs2_create_new_lock+0x7a/0x89
>
> [  858.575887] [<ffffffff812ed81d>] 
> ocfs2_create_new_inode_locks+0x181/0x2ae
>
> [  858.575988] [<ffffffff813068fc>] 
> __ocfs2_mknod_locked.isra.21+0x3b1/0x4d4
>
> [  858.576088] [<ffffffff81306b39>] ocfs2_mknod_locked.isra.22+0x11a/0x12d
>
> [  858.576186] [<ffffffff813072d2>] ocfs2_mknod+0x786/0xe09
>
> [  858.576257] [<ffffffff813079d2>] ocfs2_mkdir+0x7d/0x11f
>
> [  858.576326] [<ffffffff81162ba7>] vfs_mkdir+0x5d/0x9e
>
> [  858.576393] [<ffffffff811650f9>] SyS_mkdirat+0x8f/0xe2
>
> [  858.576462] [<ffffffff81165160>] SyS_mkdir+0x14/0x16
>
> [  858.576528] [<ffffffff816e5424>] system_call_fastpath+0x1c/0x21
>
> [  858.576604]
>
> other info that might help us debug this:
>
> [  858.576721] Chain exists of:
>
>   &sc->sc_send_lock --> sb_pagefaults --> jbd2_handle
>
> [  858.576847]  Possible unsafe locking scenario:
>
> [  858.576931] CPU0                    CPU1
>
> [  858.576986] ----                    ----
>
> [  858.577042]   lock(jbd2_handle);
>
> [ 858.577090] lock(sb_pagefaults);
>
> [ 858.577161]                                lock(jbd2_handle);
>
> [  858.577229] lock(&sc->sc_send_lock);
>
> [  858.577282]
>
> *** DEADLOCK ***
>
> [  858.577383] 11 locks held by ruby/3710:
>
> [  858.577432]  #0: (sb_writers#11){.+.+.+}, at: [<ffffffff811714f9>] 
> mnt_want_write+0x1f/0x46
>
> [  858.577551]  #1: (&type->i_mutex_dir_key#3/1){+.+.+.}, at: 
> [<ffffffff8116279e>] kern_path_create+0x7f/0x14a
>
> [  858.577681]  #2:  (Meta){++++++}, at: [<ffffffff81306c68>] 
> ocfs2_mknod+0x11c/0xe09
>
> [  858.577789]  #3: 
> (&ocfs2_sysfile_lock_key[args->fi_sysfile_type]#3){+.+.+.}, at: 
> [<ffffffff8131e853>] ocfs2_reserve_suballoc_bits+0xc3/0x12ce
>
> [  858.577942]  #4: (inode_alloc:%04d){+.+.+.}, at: 
> [<ffffffff8131e869>] ocfs2_reserve_suballoc_bits+0xd9/0x12ce
>
> [  858.578071]  #5: 
> (&ocfs2_sysfile_lock_key[args->fi_sysfile_type]#4){+.+.+.}, at: 
> [<ffffffff8131e853>] ocfs2_reserve_suballoc_bits+0xc3/0x12ce
>
> [  858.578225]  #6: (extent_alloc:%04d){+.+.+.}, at: 
> [<ffffffff8131e869>] ocfs2_reserve_suballoc_bits+0xd9/0x12ce
>
> [  858.578354]  #7: 
> (&ocfs2_sysfile_lock_key[args->fi_sysfile_type]#5){+.+...}, at: 
> [<ffffffff8130162c>] ocfs2_reserve_local_alloc_bits+0xad/0xc93
>
> [  858.578509]  #8: (sb_internal#2){.+.+..}, at: [<ffffffff812fcb30>] 
> ocfs2_start_trans+0xa7/0x1be
>
> [  858.578632]  #9: (&journal->j_trans_barrier){.+.+..}, at: 
> [<ffffffff812fcb40>] ocfs2_start_trans+0xb7/0x1be
>
> [  858.578760]  #10: (jbd2_handle){+.+...}, at: [<ffffffff8123986e>] 
> start_this_handle+0x4e8/0x57d
>
> [  858.578880]
>
> stack backtrace:
>
> [  858.578952] CPU: 3 PID: 3710 Comm: ruby Not tainted 
> 3.11.2-hardenedcanvas-2013-12-16 #4
>
> [  858.579054] Hardware name: VMware, Inc. VMware Virtual 
> Platform/440BX Desktop Reference Platform, BIOS 6.00 07/09/2012
>
> [  858.579178]  ffffffff85305fa0 ffff880516e27348 ffffffff816dd690 
> 0000000000000006
>
> [  858.579283]  ffffffff8531ace0 ffff880516e27398 ffffffff816d9328 
> 0000000000000002
>
> [  858.579387]  ffffffff856d9850 ffff8804ffa8a190 ffff8804ffa89600 
> ffff8804ffa8a190
>
> [  858.579492] Call Trace:
>
> [  858.580924]  [<ffffffff816dd690>] dump_stack+0x46/0x58
>
> [  858.580994]  [<ffffffff816d9328>] print_circular_bug+0x1f8/0x209
>
> [  858.581071]  [<ffffffff810c6e3c>] __lock_acquire+0xb65/0xe65
>
> [  858.581145]  [<ffffffff810c60e4>] ? mark_lock+0x2d/0x220
>
> [  858.581215]  [<ffffffff81355567>] ? o2net_send_message_vec+0x543/0x8dd
>
> [  858.581296]  [<ffffffff810c761c>] lock_acquire+0xbf/0xfe
>
> [  858.581366]  [<ffffffff81355567>] ? o2net_send_message_vec+0x543/0x8dd
>
> [  858.581447]  [<ffffffff810c7c21>] ? 
> debug_check_no_locks_freed+0x132/0x147
>
> [  858.581531]  [<ffffffff81355567>] ? o2net_send_message_vec+0x543/0x8dd
>
> [  858.581612]  [<ffffffff816e1732>] mutex_lock_nested+0x3f/0x353
>
> [  858.581687]  [<ffffffff81355567>] ? o2net_send_message_vec+0x543/0x8dd
>
> [  858.581770]  [<ffffffff810c45bd>] ? lockdep_init_map+0xc2/0x147
>
> [  858.581848]  [<ffffffff81009715>] ? paravirt_read_tsc+0x9/0xd
>
> [  858.581922]  [<ffffffff810097bc>] ? read_tsc+0x9/0x19
>
> [  858.581991]  [<ffffffff810bac88>] ? 
> timekeeping_get_ns.constprop.8+0x13/0x3a
>
> [  858.582077]  [<ffffffff81355567>] o2net_send_message_vec+0x543/0x8dd
>
> [  858.582163]  [<ffffffff810c60e4>] ? mark_lock+0x2d/0x220
>
> [  858.582233]  [<ffffffff810c792d>] ? mark_held_locks+0x6d/0x95
>
> [  858.582309]  [<ffffffff810f2470>] ? rcu_irq_exit+0x7c/0x90
>
> [  858.582381]  [<ffffffff81355925>] o2net_send_message+0x24/0x26
>
> [  858.582457]  [<ffffffff81369390>] dlm_do_master_request+0xca/0x630
>
> [  858.582536]  [<ffffffff81009a93>] ? sched_clock+0x9/0xd
>
> [  858.582606]  [<ffffffff8136c727>] dlm_get_lock_resource+0xa58/0xe1d
>
> [  858.582684]  [<ffffffff810c7539>] ? lock_release+0x1ad/0x1d1
>
> [  858.582758]  [<ffffffff813d0718>] ? do_raw_spin_unlock+0x8f/0x98
>
> [  858.582835]  [<ffffffff81373ee0>] ? dlm_new_lock+0xbc/0x14e
>
> [  858.582908]  [<ffffffff813495fc>] ? 
> ocfs2_cluster_connect_agnostic+0x3b/0x3b
>
> [  858.582993]  [<ffffffff813747d7>] dlmlock+0x865/0x11d3
>
> [  858.583062]  [<ffffffff8134960d>] ? o2dlm_lock_ast_wrapper+0x11/0x11
>
> [  858.583142]  [<ffffffff810c60e4>] ? mark_lock+0x2d/0x220
>
> [  858.583211]  [<ffffffff81349ad8>] o2cb_dlm_lock+0x59/0x69
>
> [  858.583282]  [<ffffffff813495fc>] ? 
> ocfs2_cluster_connect_agnostic+0x3b/0x3b
>
> [  858.583369]  [<ffffffff8134960d>] ? o2dlm_lock_ast_wrapper+0x11/0x11
>
> [  858.583448]  [<ffffffff810c7aa9>] ? 
> trace_hardirqs_on_caller+0x154/0x18b
>
> [  858.583531]  [<ffffffff81349029>] ocfs2_dlm_lock+0x28/0x2a
>
> [  858.583603]  [<ffffffff812ecef9>] ocfs2_lock_create+0xfa/0x270
>
> [  858.583679]  [<ffffffff816e47be>] ? 
> _raw_spin_unlock_irqrestore+0x40/0x4d
>
> [  858.583763]  [<ffffffff810c7aa9>] ? 
> trace_hardirqs_on_caller+0x154/0x18b
>
> [  858.583846]  [<ffffffff812ed68d>] ocfs2_create_new_lock+0x7a/0x89
>
> [  858.583924]  [<ffffffff812ed81d>] 
> ocfs2_create_new_inode_locks+0x181/0x2ae
>
> [  858.584010]  [<ffffffff813068fc>] 
> __ocfs2_mknod_locked.isra.21+0x3b1/0x4d4
>
> [  858.584095]  [<ffffffff81306b39>] 
> ocfs2_mknod_locked.isra.22+0x11a/0x12d
>
> [  858.584178]  [<ffffffff813072d2>] ocfs2_mknod+0x786/0xe09
>
> [  858.584249]  [<ffffffff813079d2>] ocfs2_mkdir+0x7d/0x11f
>
> [  858.584319]  [<ffffffff81162ba7>] vfs_mkdir+0x5d/0x9e
>
> [  858.584386]  [<ffffffff811650f9>] SyS_mkdirat+0x8f/0xe2
>
> [  858.584456]  [<ffffffff81165160>] SyS_mkdir+0x14/0x16
>
> [  858.584524]  [<ffffffff816e5424>] system_call_fastpath+0x1c/0x21
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20140109/ccc7c511/attachment-0001.html 


More information about the Ocfs2-users mailing list