[Ocfs2-users] Hardware error or ocfs2 error?

Sunil Mushran sunil.mushran at oracle.com
Thu Apr 29 10:21:27 PDT 2010


Cannot say for sure. It could be a deadlock (bug) too. As in, I don't 
want to
blame any one entity without knowing more.

If it were up to me, I'd start with the dlm. See which node holds the lock
that others are waiting on. Then see why that node is unable to downconvert
that lock. As in, if the lock has holders try to determine the pids holding
that lock and see where they are stuck. In mainline you can do
"cat /proc/PID/stack" to look at the stack of a PID.

Marco wrote:
> Hello,
>
>  today I noticed the following on *only* one node: 
>
> ----- cut here -----
> Apr 29 11:01:18 node06 kernel: [2569440.616036] INFO: task ocfs2_wq:5214 blocked for more than 120 seconds.
> Apr 29 11:01:18 node06 kernel: [2569440.616056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Apr 29 11:01:18 node06 kernel: [2569440.616080] ocfs2_wq      D 0000000000000002     0  5214      2 0x00000000
> Apr 29 11:01:18 node06 kernel: [2569440.616101]  ffff88014fa63880 0000000000000046 ffffffffa01878a5 ffffffffa020f0fc
> Apr 29 11:01:18 node06 kernel: [2569440.616131]  0000000000000000 000000000000f8a0 ffff88014baebfd8 00000000000155c0
> Apr 29 11:01:18 node06 kernel: [2569440.616161]  00000000000155c0 ffff88014ca38e20 ffff88014ca39118 00000001a0187b86
> Apr 29 11:01:18 node06 kernel: [2569440.616192] Call Trace:
> Apr 29 11:01:18 node06 kernel: [2569440.616223]  [<ffffffffa01878a5>] ? scsi_done+0x0/0xc [scsi_mod]
> Apr 29 11:01:18 node06 kernel: [2569440.616245]  [<ffffffffa020f0fc>] ? qla2xxx_queuecommand+0x171/0x1de [qla2xxx]
> Apr 29 11:01:18 node06 kernel: [2569440.616273]  [<ffffffffa018d290>] ? scsi_request_fn+0x429/0x506 [scsi_mod]
> Apr 29 11:01:18 node06 kernel: [2569440.616291]  [<ffffffffa02ab0a7>] ? o2dlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_o2cb]
> Apr 29 11:01:18 node06 kernel: [2569440.616317]  [<ffffffffa02ab090>] ? o2dlm_lock_ast_wrapper+0x0/0x17 [ocfs2_stack_o2cb]
> Apr 29 11:01:18 node06 kernel: [2569440.616345]  [<ffffffff812ee253>] ? schedule_timeout+0x2e/0xdd
> Apr 29 11:01:18 node06 kernel: [2569440.616362]  [<ffffffff8118d99a>] ? vsnprintf+0x40a/0x449
> Apr 29 11:01:18 node06 kernel: [2569440.616378]  [<ffffffff812ee118>] ? wait_for_common+0xde/0x14f
> Apr 29 11:01:18 node06 kernel: [2569440.616396]  [<ffffffff8104a188>] ? default_wake_function+0x0/0x9
> Apr 29 11:01:18 node06 kernel: [2569440.616421]  [<ffffffffa0fbac46>] ? __ocfs2_cluster_lock+0x8a4/0x8c5 [ocfs2]
> Apr 29 11:01:18 node06 kernel: [2569440.616445]  [<ffffffff812ee517>] ? out_of_line_wait_on_bit+0x6b/0x77
> Apr 29 11:01:18 node06 kernel: [2569440.616468]  [<ffffffffa0fbe8ff>] ? ocfs2_inode_lock_full_nested+0x1a3/0xb2c [ocfs2]
> Apr 29 11:01:18 node06 kernel: [2569440.616497]  [<ffffffffa0ffacc1>] ? ocfs2_lock_global_qf+0x28/0x81 [ocfs2]
> Apr 29 11:01:18 node06 kernel: [2569440.616519]  [<ffffffffa0ffacc1>] ? ocfs2_lock_global_qf+0x28/0x81 [ocfs2]
> Apr 29 11:01:18 node06 kernel: [2569440.616540]  [<ffffffffa0ffb3a3>] ? ocfs2_acquire_dquot+0x8d/0x105 [ocfs2]
> Apr 29 11:01:18 node06 kernel: [2569440.616557]  [<ffffffff812ee7b5>] ? mutex_lock+0xd/0x31
> Apr 29 11:01:18 node06 kernel: [2569440.616574]  [<ffffffff8112c2b2>] ? dqget+0x2ce/0x318
> Apr 29 11:01:18 node06 kernel: [2569440.616589]  [<ffffffff8112cbad>] ? dquot_initialize+0x51/0x115
> Apr 29 11:01:18 node06 kernel: [2569440.616611]  [<ffffffffa0fcaab8>] ? ocfs2_delete_inode+0x0/0x1640 [ocfs2]
> Apr 29 11:01:18 node06 kernel: [2569440.616630]  [<ffffffff810fee1f>] ? generic_delete_inode+0xd7/0x168
> Apr 29 11:01:18 node06 kernel: [2569440.616652]  [<ffffffffa0fca061>] ? ocfs2_drop_inode+0xc0/0x123 [ocfs2]
> Apr 29 11:01:18 node06 kernel: [2569440.616669]  [<ffffffff810fdfa8>] ? iput+0x27/0x60
> Apr 29 11:01:18 node06 kernel: [2569440.616689]  [<ffffffffa0fd0a8f>] ? ocfs2_complete_recovery+0x82b/0xa3f [ocfs2]
> Apr 29 11:01:18 node06 kernel: [2569440.616715]  [<ffffffff8106144b>] ? worker_thread+0x188/0x21d
> Apr 29 11:01:18 node06 kernel: [2569440.616736]  [<ffffffffa0fd0264>] ? ocfs2_complete_recovery+0x0/0xa3f [ocfs2]
> Apr 29 11:01:18 node06 kernel: [2569440.616761]  [<ffffffff81064a36>] ? autoremove_wake_function+0x0/0x2e
> Apr 29 11:01:18 node06 kernel: [2569440.616778]  [<ffffffff810612c3>] ? worker_thread+0x0/0x21d
> Apr 29 11:01:18 node06 kernel: [2569440.616793]  [<ffffffff81064769>] ? kthread+0x79/0x81
> Apr 29 11:01:18 node06 kernel: [2569440.616810]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
> Apr 29 11:01:18 node06 kernel: [2569440.616825]  [<ffffffff810646f0>] ? kthread+0x0/0x81
> Apr 29 11:01:18 node06 kernel: [2569440.616840]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20
> ----- cut here -----
>
>  On all the others I had the following:
>
> ----- cut here -----
> Apr 29 11:00:23 node01 kernel: [2570880.752038] INFO: task o2quot/0:2971 blocked for more than 120 seconds.
> Apr 29 11:00:23 node01 kernel: [2570880.752059] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Apr 29 11:00:23 node01 kernel: [2570880.752083] o2quot/0      D 0000000000000000     0  2971      2 0x00000000
> Apr 29 11:00:23 node01 kernel: [2570880.752104]  ffffffff814451f0 0000000000000046 0000000000000000 0000000000000002
> Apr 29 11:00:23 node01 kernel: [2570880.752134]  ffff880249e28d20 000000000000f8a0 ffff88024cda3fd8 00000000000155c0
> Apr 29 11:00:23 node01 kernel: [2570880.752164]  00000000000155c0 ffff88024ce4e9f0 ffff88024ce4ece8 000000004cda3a60
> Apr 29 11:00:23 node01 kernel: [2570880.752195] Call Trace:
> Apr 29 11:00:23 node01 kernel: [2570880.752214]  [<ffffffff812ee253>] ? schedule_timeout+0x2e/0xdd
> Apr 29 11:00:23 node01 kernel: [2570880.752233]  [<ffffffff8110baff>] ? __find_get_block+0x176/0x186
> Apr 29 11:00:23 node01 kernel: [2570880.752261]  [<ffffffffa04fd29c>] ? ocfs2_validate_quota_block+0x0/0x88 [ocfs2]
> Apr 29 11:00:23 node01 kernel: [2570880.752286]  [<ffffffff812ee118>] ? wait_for_common+0xde/0x14f
> Apr 29 11:00:23 node01 kernel: [2570880.752304]  [<ffffffff8104a188>] ? default_wake_function+0x0/0x9
> Apr 29 11:00:23 node01 kernel: [2570880.752326]  [<ffffffffa04bbc46>] ? __ocfs2_cluster_lock+0x8a4/0x8c5 [ocfs2]
> Apr 29 11:00:23 node01 kernel: [2570880.752351]  [<ffffffff81044e0e>] ? find_busiest_group+0x3af/0x874
> Apr 29 11:00:23 node01 kernel: [2570880.752373]  [<ffffffffa04bf8ff>] ? ocfs2_inode_lock_full_nested+0x1a3/0xb2c [ocfs2]
> Apr 29 11:00:23 node01 kernel: [2570880.752402]  [<ffffffffa04fbcc1>] ? ocfs2_lock_global_qf+0x28/0x81 [ocfs2]
> Apr 29 11:00:23 node01 kernel: [2570880.752424]  [<ffffffffa04fbcc1>] ? ocfs2_lock_global_qf+0x28/0x81 [ocfs2]
> Apr 29 11:00:23 node01 kernel: [2570880.752446]  [<ffffffffa04fc8f8>] ? ocfs2_sync_dquot_helper+0xca/0x300 [ocfs2]
> Apr 29 11:00:23 node01 kernel: [2570880.752474]  [<ffffffffa04fc82e>] ? ocfs2_sync_dquot_helper+0x0/0x300 [ocfs2]
> Apr 29 11:00:23 node01 kernel: [2570880.752500]  [<ffffffff8112ce8e>] ? dquot_scan_active+0x78/0xd0
> Apr 29 11:00:23 node01 kernel: [2570880.752521]  [<ffffffffa04fbc2b>] ? qsync_work_fn+0x24/0x42 [ocfs2]
> Apr 29 11:00:23 node01 kernel: [2570880.752539]  [<ffffffff8106144b>] ? worker_thread+0x188/0x21d
> Apr 29 11:00:23 node01 kernel: [2570880.752559]  [<ffffffffa04fbc07>] ? qsync_work_fn+0x0/0x42 [ocfs2]
> Apr 29 11:00:23 node01 kernel: [2570880.752576]  [<ffffffff81064a36>] ? autoremove_wake_function+0x0/0x2e
> Apr 29 11:00:23 node01 kernel: [2570880.752593]  [<ffffffff810612c3>] ? worker_thread+0x0/0x21d
> Apr 29 11:00:23 node01 kernel: [2570880.752608]  [<ffffffff81064769>] ? kthread+0x79/0x81
> Apr 29 11:00:23 node01 kernel: [2570880.752625]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
> Apr 29 11:00:23 node01 kernel: [2570880.752640]  [<ffffffff810646f0>] ? kthread+0x0/0x81
> Apr 29 11:00:23 node01 kernel: [2570880.752655]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20
> ----- cut here -----
>
>  By looking at the timestamps it seems that o2quot got stuck before
> ocfs2_wq, but right now I can't guarantee that they are 100% exact...
>
> Am I right if I think it has been a hardware failure?
>
> Best regards,
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>   




More information about the Ocfs2-users mailing list