[Ocfs2-users] Hardware error or ocfs2 error?

Sérgio Surkamp sergio at gruposinternet.com.br
Thu Apr 29 14:56:29 PDT 2010


At my job we use QLA FC boards, and had some problems when migrated
from FreeBSD to SLES (in mid 2007). We found two problems that time:

1. Old board firmware - fixed by patching it to the latest;
2. The stock SLES10 qla2xxx driver was outdated - fixed by downloading
driver from vendor, compiling and installing it*;

* SLES10SP3 has already the newer version as stock driver. SLES10,
  SLES10SP1 and SLES10SP2 has outdated drivers.

Can't tell if you have the same problem as I can't remember the kernel
error messages, but I remember that was very weird as sometimes works
perfect and sometimes the board wasn't even detected.

Can you tell more about your environment?

Regards,
Sérgio

Em Thu, 29 Apr 2010 12:56:38 +0200
Marco <bozzolan at gmail.com> escreveu:

> Hello,
> 
>  today I noticed the following on *only* one node: 
> 
> ----- cut here -----
> Apr 29 11:01:18 node06 kernel: [2569440.616036] INFO: task
> ocfs2_wq:5214 blocked for more than 120 seconds. Apr 29 11:01:18
> node06 kernel: [2569440.616056] "echo 0
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr
> > 29 11:01:18 node06 kernel: [2569440.616080] ocfs2_wq      D
> > 0000000000000002     0  5214      2 0x00000000 Apr 29 11:01:18
> > node06 kernel: [2569440.616101]  ffff88014fa63880 0000000000000046
> > ffffffffa01878a5 ffffffffa020f0fc
> Apr 29 11:01:18 node06 kernel: [2569440.616131]  0000000000000000
> 000000000000f8a0 ffff88014baebfd8 00000000000155c0 Apr 29 11:01:18
> node06 kernel: [2569440.616161]  00000000000155c0 ffff88014ca38e20
> ffff88014ca39118 00000001a0187b86 Apr 29 11:01:18 node06 kernel:
> [2569440.616192] Call Trace: Apr 29 11:01:18 node06 kernel:
> [2569440.616223]  [<ffffffffa01878a5>] ? scsi_done+0x0/0xc [scsi_mod]
> Apr 29 11:01:18 node06 kernel: [2569440.616245]
> [<ffffffffa020f0fc>] ? qla2xxx_queuecommand+0x171/0x1de [qla2xxx] Apr
> 29 11:01:18 node06 kernel: [2569440.616273]  [<ffffffffa018d290>] ?
> scsi_request_fn+0x429/0x506 [scsi_mod] Apr 29 11:01:18 node06 kernel:
> [2569440.616291]  [<ffffffffa02ab0a7>] ?
> o2dlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_o2cb] Apr 29
> 11:01:18 node06 kernel: [2569440.616317]  [<ffffffffa02ab090>] ?
> o2dlm_lock_ast_wrapper+0x0/0x17 [ocfs2_stack_o2cb] Apr 29 11:01:18
> node06 kernel: [2569440.616345]  [<ffffffff812ee253>] ?
> schedule_timeout+0x2e/0xdd Apr 29 11:01:18 node06 kernel:
> [2569440.616362]  [<ffffffff8118d99a>] ? vsnprintf+0x40a/0x449 Apr 29
> 11:01:18 node06 kernel: [2569440.616378]  [<ffffffff812ee118>] ?
> wait_for_common+0xde/0x14f Apr 29 11:01:18 node06 kernel:
> [2569440.616396]  [<ffffffff8104a188>] ?
> default_wake_function+0x0/0x9 Apr 29 11:01:18 node06 kernel:
> [2569440.616421]  [<ffffffffa0fbac46>] ?
> __ocfs2_cluster_lock+0x8a4/0x8c5 [ocfs2] Apr 29 11:01:18 node06
> kernel: [2569440.616445]  [<ffffffff812ee517>] ?
> out_of_line_wait_on_bit+0x6b/0x77 Apr 29 11:01:18 node06 kernel:
> [2569440.616468]  [<ffffffffa0fbe8ff>] ?
> ocfs2_inode_lock_full_nested+0x1a3/0xb2c [ocfs2] Apr 29 11:01:18
> node06 kernel: [2569440.616497]  [<ffffffffa0ffacc1>] ?
> ocfs2_lock_global_qf+0x28/0x81 [ocfs2] Apr 29 11:01:18 node06 kernel:
> [2569440.616519]  [<ffffffffa0ffacc1>] ?
> ocfs2_lock_global_qf+0x28/0x81 [ocfs2] Apr 29 11:01:18 node06 kernel:
> [2569440.616540]  [<ffffffffa0ffb3a3>] ?
> ocfs2_acquire_dquot+0x8d/0x105 [ocfs2] Apr 29 11:01:18 node06 kernel:
> [2569440.616557]  [<ffffffff812ee7b5>] ? mutex_lock+0xd/0x31 Apr 29
> 11:01:18 node06 kernel: [2569440.616574]  [<ffffffff8112c2b2>] ?
> dqget+0x2ce/0x318 Apr 29 11:01:18 node06 kernel: [2569440.616589]
> [<ffffffff8112cbad>] ? dquot_initialize+0x51/0x115 Apr 29 11:01:18
> node06 kernel: [2569440.616611]  [<ffffffffa0fcaab8>] ?
> ocfs2_delete_inode+0x0/0x1640 [ocfs2] Apr 29 11:01:18 node06 kernel:
> [2569440.616630]  [<ffffffff810fee1f>] ?
> generic_delete_inode+0xd7/0x168 Apr 29 11:01:18 node06 kernel:
> [2569440.616652]  [<ffffffffa0fca061>] ? ocfs2_drop_inode+0xc0/0x123
> [ocfs2] Apr 29 11:01:18 node06 kernel: [2569440.616669]
> [<ffffffff810fdfa8>] ? iput+0x27/0x60 Apr 29 11:01:18 node06 kernel:
> [2569440.616689]  [<ffffffffa0fd0a8f>] ?
> ocfs2_complete_recovery+0x82b/0xa3f [ocfs2] Apr 29 11:01:18 node06
> kernel: [2569440.616715]  [<ffffffff8106144b>] ?
> worker_thread+0x188/0x21d Apr 29 11:01:18 node06 kernel:
> [2569440.616736]  [<ffffffffa0fd0264>] ?
> ocfs2_complete_recovery+0x0/0xa3f [ocfs2] Apr 29 11:01:18 node06
> kernel: [2569440.616761]  [<ffffffff81064a36>] ?
> autoremove_wake_function+0x0/0x2e Apr 29 11:01:18 node06 kernel:
> [2569440.616778]  [<ffffffff810612c3>] ? worker_thread+0x0/0x21d Apr
> 29 11:01:18 node06 kernel: [2569440.616793]  [<ffffffff81064769>] ?
> kthread+0x79/0x81 Apr 29 11:01:18 node06 kernel: [2569440.616810]
> [<ffffffff81011baa>] ? child_rip+0xa/0x20 Apr 29 11:01:18 node06
> kernel: [2569440.616825]  [<ffffffff810646f0>] ? kthread+0x0/0x81 Apr
> 29 11:01:18 node06 kernel: [2569440.616840]  [<ffffffff81011ba0>] ?
> child_rip+0x0/0x20 ----- cut here -----
> 
>  On all the others I had the following:
> 
> ----- cut here -----
> Apr 29 11:00:23 node01 kernel: [2570880.752038] INFO: task
> o2quot/0:2971 blocked for more than 120 seconds. Apr 29 11:00:23
> node01 kernel: [2570880.752059] "echo 0
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr
> > 29 11:00:23 node01 kernel: [2570880.752083] o2quot/0      D
> > 0000000000000000     0  2971      2 0x00000000 Apr 29 11:00:23
> > node01 kernel: [2570880.752104]  ffffffff814451f0 0000000000000046
> > 0000000000000000 0000000000000002
> Apr 29 11:00:23 node01 kernel: [2570880.752134]  ffff880249e28d20
> 000000000000f8a0 ffff88024cda3fd8 00000000000155c0 Apr 29 11:00:23
> node01 kernel: [2570880.752164]  00000000000155c0 ffff88024ce4e9f0
> ffff88024ce4ece8 000000004cda3a60 Apr 29 11:00:23 node01 kernel:
> [2570880.752195] Call Trace: Apr 29 11:00:23 node01 kernel:
> [2570880.752214]  [<ffffffff812ee253>] ? schedule_timeout+0x2e/0xdd
> Apr 29 11:00:23 node01 kernel: [2570880.752233]
> [<ffffffff8110baff>] ? __find_get_block+0x176/0x186 Apr 29 11:00:23
> node01 kernel: [2570880.752261]  [<ffffffffa04fd29c>] ?
> ocfs2_validate_quota_block+0x0/0x88 [ocfs2] Apr 29 11:00:23 node01
> kernel: [2570880.752286]  [<ffffffff812ee118>] ?
> wait_for_common+0xde/0x14f Apr 29 11:00:23 node01 kernel:
> [2570880.752304]  [<ffffffff8104a188>] ?
> default_wake_function+0x0/0x9 Apr 29 11:00:23 node01 kernel:
> [2570880.752326]  [<ffffffffa04bbc46>] ?
> __ocfs2_cluster_lock+0x8a4/0x8c5 [ocfs2] Apr 29 11:00:23 node01
> kernel: [2570880.752351]  [<ffffffff81044e0e>] ?
> find_busiest_group+0x3af/0x874 Apr 29 11:00:23 node01 kernel:
> [2570880.752373]  [<ffffffffa04bf8ff>] ?
> ocfs2_inode_lock_full_nested+0x1a3/0xb2c [ocfs2] Apr 29 11:00:23
> node01 kernel: [2570880.752402]  [<ffffffffa04fbcc1>] ?
> ocfs2_lock_global_qf+0x28/0x81 [ocfs2] Apr 29 11:00:23 node01 kernel:
> [2570880.752424]  [<ffffffffa04fbcc1>] ?
> ocfs2_lock_global_qf+0x28/0x81 [ocfs2] Apr 29 11:00:23 node01 kernel:
> [2570880.752446]  [<ffffffffa04fc8f8>] ?
> ocfs2_sync_dquot_helper+0xca/0x300 [ocfs2] Apr 29 11:00:23 node01
> kernel: [2570880.752474]  [<ffffffffa04fc82e>] ?
> ocfs2_sync_dquot_helper+0x0/0x300 [ocfs2] Apr 29 11:00:23 node01
> kernel: [2570880.752500]  [<ffffffff8112ce8e>] ?
> dquot_scan_active+0x78/0xd0 Apr 29 11:00:23 node01 kernel:
> [2570880.752521]  [<ffffffffa04fbc2b>] ? qsync_work_fn+0x24/0x42
> [ocfs2] Apr 29 11:00:23 node01 kernel: [2570880.752539]
> [<ffffffff8106144b>] ? worker_thread+0x188/0x21d Apr 29 11:00:23
> node01 kernel: [2570880.752559]  [<ffffffffa04fbc07>] ?
> qsync_work_fn+0x0/0x42 [ocfs2] Apr 29 11:00:23 node01 kernel:
> [2570880.752576]  [<ffffffff81064a36>] ?
> autoremove_wake_function+0x0/0x2e Apr 29 11:00:23 node01 kernel:
> [2570880.752593]  [<ffffffff810612c3>] ? worker_thread+0x0/0x21d Apr
> 29 11:00:23 node01 kernel: [2570880.752608]  [<ffffffff81064769>] ?
> kthread+0x79/0x81 Apr 29 11:00:23 node01 kernel: [2570880.752625]
> [<ffffffff81011baa>] ? child_rip+0xa/0x20 Apr 29 11:00:23 node01
> kernel: [2570880.752640]  [<ffffffff810646f0>] ? kthread+0x0/0x81 Apr
> 29 11:00:23 node01 kernel: [2570880.752655]  [<ffffffff81011ba0>] ?
> child_rip+0x0/0x20 ----- cut here -----
> 
>  By looking at the timestamps it seems that o2quot got stuck before
> ocfs2_wq, but right now I can't guarantee that they are 100% exact...
> 
> Am I right if I think it has been a hardware failure?
> 
> Best regards,
> 
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users


-- 
  .:''''':.
.:'        `     Sérgio Surkamp | Gerente de Rede
::    ........   sergio at gruposinternet.com.br
`:.        .:'
  `:,   ,.:'     *Grupos Internet S.A.*
    `: :'        R. Lauro Linhares, 2123 Torre B - Sala 201
     : :         Trindade - Florianópolis - SC
     :.'
     ::          +55 48 3234-4109
     :
     '           http://www.gruposinternet.com.br



More information about the Ocfs2-users mailing list