[Ocfs2-users] huge "something" problem

Alexandre Racine Alexandre.Racine at mhicc.org
Mon Jun 2 11:24:55 PDT 2008


Hi Sunil,

> What about the other node?

I also found what is below on the SRV1 server witch is the master SGE
(Sun Grid Engine) server of the cluster in the /var/log/messages file...
This server rebooted when this error occurred. I don't know if it is
related.

> Do this by hand on both nodes.
> 1. debugfs.ocfs2 -R "fs_locks" /dev/sdX >/tmp/locks.out
> 2. grep -i busy /tmp/locks.out

This gives no results. The /tmp/locks.out file itself does have 200000+
lines...



-------- /var/log/messages
May 30 21:01:44 SRV1 Assertion failure in journal_dirty_metadata() at
fs/jbd/transaction.c:1161: "jh->b_next_transaction == transaction"
May 30 21:01:44 SRV1 ------------[ cut here ]------------
May 30 21:01:44 SRV1 kernel BUG at fs/jbd/transaction.c:1161!
May 30 21:01:44 SRV1 invalid opcode: 0000 [1] SMP
May 30 21:01:44 SRV1 CPU 6
May 30 21:01:44 SRV1 Modules linked in: ocfs2_dlmfs ocfs2 ocfs2_dlm
ocfs2_nodemanager configfs iscsi_tcp libiscsi scsi_transport_iscsi
May 30 21:01:44 SRV1 Pid: 29383, comm: alohomora_m Not tainted
2.6.23-gentoo-r8 #3
May 30 21:01:44 SRV1 RIP: 0010:[<ffffffff802fc2cf>]
[<ffffffff802fc2cf>] journal_dirty_metadata+0x143/0x1be
May 30 21:01:44 SRV1 RSP: 0018:ffff81069b3f3918  EFLAGS: 00010292
May 30 21:01:44 SRV1 RAX: 0000000000000077 RBX: ffff8102c85a98e8 RCX:
0000000000000046
May 30 21:01:44 SRV1 RDX: 0000000000000000 RSI: 0000000000000096 RDI:
ffffffff807acd60
May 30 21:01:44 SRV1 RBP: ffff8102bf8c4b50 R08: ffffffff807acd68 R09:
0000000000000001
May 30 21:01:44 SRV1 R10: ffff81069b3f388e R11: ffffffff80264dff R12:
ffff81054401f5a0
May 30 21:01:44 SRV1 R13: ffff8107e2645a00 R14: ffff810700c21540 R15:
ffff81069b3f3a48
May 30 21:01:44 SRV1 FS:  0000000000000000(0000)
GS:ffff8107e2570140(0063) knlGS:00000000f7d468d0
May 30 21:01:44 SRV1 CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
May 30 21:01:44 SRV1 CR2: 000000000a29c004 CR3: 000000034a521000 CR4:
00000000000006e0
May 30 21:01:44 SRV1 DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
May 30 21:01:44 SRV1 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
May 30 21:01:44 SRV1 Process alohomora_m (pid: 29383, threadinfo
ffff81069b3f2000, task ffff81047b0b27d0)
May 30 21:01:44 SRV1 Stack:  ffff8107c6582630 ffff8102c85a98e8
ffff81054401f5a0 ffff81020b335000
May 30 21:01:44 SRV1 ffff8107c6582630 ffffffff880aa2f3 ffff81020b335000
ffff8107bd23be40
May 30 21:01:44 SRV1 ffff810700c21540 ffffffff8808a73b ffff8102521e8780
ffff81069b3f3a28
May 30 21:01:44 SRV1 Call Trace:
May 30 21:01:44 SRV1 [<ffffffff880aa2f3>]
:ocfs2:ocfs2_journal_dirty+0x6a/0x11d
May 30 21:01:44 SRV1 [<ffffffff8808a73b>]
:ocfs2:ocfs2_do_insert_extent+0x905/0xb43
May 30 21:01:44 SRV1 [<ffffffff8808ecd3>]
:ocfs2:ocfs2_insert_extent+0x600/0x70d
May 30 21:01:44 SRV1 [<ffffffff8809ec09>]
:ocfs2:ocfs2_do_extend_allocation+0x370/0x54a
May 30 21:01:44 SRV1 [<ffffffff880907eb>]
:ocfs2:ocfs2_write_begin_nolock+0x96b/0x1222
May 30 21:01:44 SRV1 [<ffffffff880928a9>]
:ocfs2:ocfs2_write_begin+0x1a9/0x274
May 30 21:01:44 SRV1 [<ffffffff880a41a4>]
:ocfs2:ocfs2_file_aio_write+0x65f/0xa2f
May 30 21:01:44 SRV1 [<ffffffff880968f9>]
:ocfs2:ocfs2_should_refresh_lock_res+0x84/0x1ba
May 30 21:01:44 SRV1 [<ffffffff80264dff>] zone_statistics+0x3f/0x60
May 30 21:01:44 SRV1 [<ffffffff8025f5eb>] __alloc_pages+0x60/0x2c2
May 30 21:01:44 SRV1 [<ffffffff8027f908>] do_sync_write+0xc9/0x10c
May 30 21:01:44 SRV1 [<ffffffff80247a26>]
autoremove_wake_function+0x0/0x2e
May 30 21:01:44 SRV1 [<ffffffff80280048>] vfs_write+0xad/0x136
May 30 21:01:44 SRV1 [<ffffffff80280585>] sys_write+0x45/0x6e
May 30 21:01:44 SRV1 [<ffffffff80224e70>] sysenter_do_call+0x1b/0x67
May 30 21:01:44 SRV1 [<ffffffff880aa3a6>]
:ocfs2:ocfs2_journal_dirty_data+0x0/0x7e

May 30 21:01:44 SRV1 Code: 0f 0b eb fe 48 83 7d 18 00 74 2c 49 c7 c0 eb
8e 71 80 b9 90
May 30 21:01:44 SRV1 RIP  [<ffffffff802fc2cf>]
journal_dirty_metadata+0x143/0x1be
May 30 21:01:44 SRV1 RSP <ffff81069b3f3918>
May 30 21:01:55 SRV1 BUG: soft lockup - CPU#1 stuck for 11s!
[kjournald:6953]
May 30 21:01:55 SRV1 CPU 1:
May 30 21:01:55 SRV1 Modules linked in: ocfs2_dlmfs ocfs2 ocfs2_dlm
ocfs2_nodemanager configfs iscsi_tcp libiscsi scsi_transport_iscsi
May 30 21:01:55 SRV1 Pid: 6953, comm: kjournald Tainted: G      D
2.6.23-gentoo-r8 #3
May 30 21:01:55 SRV1 RIP: 0010:[<ffffffff80300f63>]
[<ffffffff80300f63>] journal_write_metadata_buffer+0x68/0x319
May 30 21:01:55 SRV1 RSP: 0018:ffff8107d8fbdde0  EFLAGS: 00000206
May 30 21:01:55 SRV1 RAX: 0000000000398029 RBX: ffff8107e2645a00 RCX:
00000000000007e0
May 30 21:01:55 SRV1 RDX: 00000000000007e1 RSI: ffffffff808c3480 RDI:
ffff8101fd2afa88
May 30 21:01:55 SRV1 RBP: 0000000000000000 R08: ffff8107da27c518 R09:
ffff8107da27c520
May 30 21:01:55 SRV1 R10: ffff8101fd2af9b8 R11: ffffffff88092288 R12:
000000000000af45
May 30 21:01:55 SRV1 R13: 000000000000af45 R14: ffff8107da27c530 R15:
0000000000000000
May 30 21:01:55 SRV1 FS:  0000000000000000(0000)
GS:ffff8107e24d3a40(0000) knlGS:0000000000000000
May 30 21:01:55 SRV1 CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
May 30 21:01:55 SRV1 CR2: 000000000a2e202c CR3: 000000034c7d6000 CR4:
00000000000006e0
May 30 21:01:55 SRV1 DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
May 30 21:01:55 SRV1 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
May 30 21:01:55 SRV1
May 30 21:01:55 SRV1 Call Trace:
May 30 21:01:55 SRV1 [<ffffffff8030010d>] journal_bmap+0x22/0x7a
May 30 21:01:55 SRV1 [<ffffffff802fcf83>]
journal_commit_transaction+0x795/0x1016
May 30 21:01:55 SRV1 [<ffffffff80247a26>]
autoremove_wake_function+0x0/0x2e
May 30 21:01:55 SRV1 [<ffffffff8030066f>] kjournald+0xb9/0x212
May 30 21:01:55 SRV1 [<ffffffff80247a26>]
autoremove_wake_function+0x0/0x2e
May 30 21:01:55 SRV1 [<ffffffff803005b6>] kjournald+0x0/0x212
May 30 21:01:55 SRV1 [<ffffffff80247908>] kthread+0x47/0x73
May 30 21:01:55 SRV1 [<ffffffff8020c188>] child_rip+0xa/0x12
May 30 21:01:55 SRV1 [<ffffffff802478c1>] kthread+0x0/0x73
May 30 21:01:55 SRV1 [<ffffffff8020c17e>] child_rip+0x0/0x12
May 30 21:01:55 SRV1
May 30 21:02:07 SRV1 BUG: soft lockup - CPU#1 stuck for 11s!
[kjournald:6953]
May 30 21:02:07 SRV1 CPU 1:
May 30 21:02:07 SRV1 Modules linked in: ocfs2_dlmfs ocfs2 ocfs2_dlm
ocfs2_nodemanager configfs iscsi_tcp libiscsi scsi_transport_iscsi
May 30 21:02:07 SRV1 Pid: 6953, comm: kjournald Tainted: G      D
2.6.23-gentoo-r8 #3
May 30 21:02:07 SRV1 RIP: 0010:[<ffffffff80300f63>]
[<ffffffff80300f63>] journal_write_metadata_buffer+0x68/0x319
May 30 21:02:07 SRV1 RSP: 0018:ffff8107d8fbdde0  EFLAGS: 00000206
May 30 21:02:07 SRV1 RAX: 0000000000398029 RBX: ffff8107e2645a00 RCX:
00000000000007e0
May 30 21:02:07 SRV1 RDX: 00000000000007e1 RSI: ffffffff808c3480 RDI:
ffff8101fd2afa88
May 30 21:02:07 SRV1 RBP: 0000000000000000 R08: ffff8107da27c518 R09:
ffff8107da27c520
May 30 21:02:07 SRV1 R10: ffff8101fd2af9b8 R11: ffffffff88092288 R12:
000000000000af45
May 30 21:02:07 SRV1 R13: 000000000000af45 R14: ffff8107da27c530 R15:
0000000000000000
May 30 21:02:07 SRV1 FS:  0000000000000000(0000)
GS:ffff8107e24d3a40(0000) knlGS:0000000000000000
May 30 21:02:07 SRV1 CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
May 30 21:02:07 SRV1 CR2: 000000000a2e202c CR3: 000000034c7d6000 CR4:
00000000000006e0
May 30 21:02:07 SRV1 DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
May 30 21:02:07 SRV1 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400

May 30 21:02:07 SRV1 Call Trace:
May 30 21:02:07 SRV1 [<ffffffff8030010d>] journal_bmap+0x22/0x7a
May 30 21:02:07 SRV1 [<ffffffff802fcf83>]
journal_commit_transaction+0x795/0x1016
May 30 21:02:07 SRV1 [<ffffffff80247a26>]
autoremove_wake_function+0x0/0x2e
May 30 21:02:07 SRV1 [<ffffffff8030066f>] kjournald+0xb9/0x212
May 30 21:02:07 SRV1 [<ffffffff80247a26>]
autoremove_wake_function+0x0/0x2e
May 30 21:02:07 SRV1 [<ffffffff803005b6>] kjournald+0x0/0x212
May 30 21:02:07 SRV1 [<ffffffff80247908>] kthread+0x47/0x73
May 30 21:02:07 SRV1 [<ffffffff8020c188>] child_rip+0xa/0x12
May 30 21:02:07 SRV1 [<ffffffff802478c1>] kthread+0x0/0x73
May 30 21:02:07 SRV1 [<ffffffff8020c17e>] child_rip+0x0/0x12
May 30 21:02:07 SRV1
May 30 21:02:19 SRV1 BUG: soft lockup - CPU#1 stuck for 11s!
[kjournald:6953]
May 30 21:02:19 SRV1 CPU 1:
May 30 21:02:19 SRV1 Modules linked in: ocfs2_dlmfs ocfs2 ocfs2_dlm
ocfs2_nodemanager configfs iscsi_tcp libiscsi scsi_transport_iscsi
May 30 21:02:19 SRV1 Pid: 6953, comm: kjournald Tainted: G      D
2.6.23-gentoo-r8 #3
May 30 21:02:19 SRV1 RIP: 0010:[<ffffffff80300f63>]
[<ffffffff80300f63>] journal_write_metadata_buffer+0x68/0x319
May 30 21:02:19 SRV1 RSP: 0018:ffff8107d8fbdde0  EFLAGS: 00000206
May 30 21:02:19 SRV1 RAX: 0000000000398029 RBX: ffff8107e2645a00 RCX:
00000000000007e0
May 30 21:02:19 SRV1 RDX: 00000000000007e1 RSI: ffffffff808c3480 RDI:
ffff8101fd2afa88
May 30 21:02:19 SRV1 RBP: 0000000000000000 R08: ffff8107da27c518 R09:
ffff8107da27c520
May 30 21:02:19 SRV1 R10: ffff8101fd2af9b8 R11: ffffffff88092288 R12:
000000000000af45
May 30 21:02:19 SRV1 R13: 000000000000af45 R14: ffff8107da27c530 R15:
0000000000000000
May 30 21:02:19 SRV1 FS:  0000000000000000(0000)
GS:ffff8107e24d3a40(0000) knlGS:0000000000000000
May 30 21:02:19 SRV1 CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
May 30 21:02:19 SRV1 CR2: 000000000a2e202c CR3: 000000034c7d6000 CR4:
00000000000006e0
May 30 21:02:19 SRV1 DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
May 30 21:02:19 SRV1 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
May 30 21:02:19 SRV1
May 30 21:02:19 SRV1 Call Trace:
May 30 21:02:19 SRV1 [<ffffffff8030010d>] journal_bmap+0x22/0x7a
May 30 21:02:19 SRV1 [<ffffffff802fcf83>]
journal_commit_transaction+0x795/0x1016
May 30 21:02:19 SRV1 [<ffffffff80247a26>]
autoremove_wake_function+0x0/0x2e
May 30 21:02:19 SRV1 [<ffffffff8030066f>] kjournald+0xb9/0x212
May 30 21:02:19 SRV1 [<ffffffff80247a26>]
autoremove_wake_function+0x0/0x2e
May 30 21:02:19 SRV1 [<ffffffff803005b6>] kjournald+0x0/0x212
May 30 21:02:19 SRV1 [<ffffffff80247908>] kthread+0x47/0x73
May 30 21:02:19 SRV1 [<ffffffff8020c188>] child_rip+0xa/0x12
May 30 21:02:19 SRV1 [<ffffffff802478c1>] kthread+0x0/0x73
May 30 21:02:19 SRV1 [<ffffffff8020c17e>] child_rip+0x0/0x12


Alexandre Racine
alexandre.racine at mhicc.org
514-461-1300 poste 3303


> -----Original Message-----
> From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com]
> Sent: 2 juin 2008 13:51
> To: Alexandre Racine
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] huge "something" problem
> 
> What about the other node?
> 
> The function indicates that it is waiting on the o2dlm. If so,
> scanlocks should show a "Busy" lockres.
> 
> Do this by hand on both nodes.
> 1. debugfs.ocfs2 -R "fs_locks" /dev/sdX >/tmp/locks.out
> 2. grep -i busy /tmp/locks.out
> 
> Best log a bugzilla with all this information. Even if we cannot
> resolve this right now, the information will help us in the future.
> 
> Repeat the same on both nodes.
> 
> The one so-called bummer is that 2.6.23 does not have any dlm
> debugging. It will be available from 2.6.26 onwards. Meaning,
> even if we get a lockres, we will not be able to dump the state
> in 2.6.23.
> 
> Alexandre Racine wrote:
> > Hi Sunil,
> >
> > Below you will find the result of the command "$ ps -e -o
> > pid,stat,comm,wchan=WIDE-WCHAN-COLUMN" ran 6 times with more or less
> 10
> > sec. delay.
> >
> > Now, from my point of view, I have a load of 8.00 and nfsd is
running
> 8
> > instances witch are all in the "D" state. I am kind of making links
> here
> > ;) What does the "ocfs2_wait_for_mask" means? Is there some problems
> > with OCFS2 and NFS? (NFS is somehow not a perfect FS)
> >
> > Thanks
> >
> >
> >
> >   PID STAT COMMAND         WIDE-WCHAN-COLUMN
> >     1 Ss   init            -
> >     2 S<   kthreadd        kthreadd
> >     3 S<   migration/0     migration_thread
> >     4 S<   ksoftirqd/0     ksoftirqd
> >     5 S<   watchdog/0      watchdog
> >     6 S<   migration/1     migration_thread
> >     7 S<   ksoftirqd/1     ksoftirqd
> >     8 S<   watchdog/1      watchdog
> >     9 S<   migration/2     migration_thread
> >    10 S<   ksoftirqd/2     ksoftirqd
> >    11 S<   watchdog/2      watchdog
> >    12 S<   migration/3     migration_thread
> >    13 S<   ksoftirqd/3     ksoftirqd
> >    14 S<   watchdog/3      watchdog
> >    15 S<   events/0        worker_thread
> >    16 S<   events/1        worker_thread
> >    17 S<   events/2        worker_thread
> >    18 S<   events/3        worker_thread
> >    19 S<   khelper         worker_thread
> >   105 S<   kblockd/0       worker_thread
> >   106 S<   kblockd/1       worker_thread
> >   107 S<   kblockd/2       worker_thread
> >   108 S<   kblockd/3       worker_thread
> >   112 S<   kacpid          worker_thread
> >   113 S<   kacpi_notify    worker_thread
> >   240 S<   ata/0           worker_thread
> >   241 S<   ata/1           worker_thread
> >   242 S<   ata/2           worker_thread
> >   243 S<   ata/3           worker_thread
> >   244 S<   ata_aux         worker_thread
> >   245 S<   ksuspend_usbd   worker_thread
> >   251 S<   khubd           hub_thread
> >   254 S<   kseriod         serio_thread
> >   318 S    pdflush         pdflush
> >   319 S    pdflush         pdflush
> >   320 S<   kswapd0         kswapd
> >   321 S<   aio/0           worker_thread
> >   322 S<   aio/1           worker_thread
> >   323 S<   aio/2           worker_thread
> >   324 S<   aio/3           worker_thread
> >   325 S<   cifsoplockd     -
> >   326 S<   cifsdnotifyd    -
> >   327 S<   jfsIO           jfsIOWait
> >   328 S<   jfsCommit       jfs_lazycommit
> >   329 S<   jfsCommit       jfs_lazycommit
> >   330 S<   jfsCommit       jfs_lazycommit
> >   331 S<   jfsCommit       jfs_lazycommit
> >   332 S<   jfsSync         jfs_sync
> >   333 S<   xfslogd/0       worker_thread
> >   334 S<   xfslogd/1       worker_thread
> >   335 S<   xfslogd/2       worker_thread
> >   336 S<   xfslogd/3       worker_thread
> >   337 S<   xfsdatad/0      worker_thread
> >   338 S<   xfsdatad/1      worker_thread
> >   339 S<   xfsdatad/2      worker_thread
> >   340 S<   xfsdatad/3      worker_thread
> >   341 S<   xfs_mru_cache   worker_thread
> >   697 Ss   sshd            -
> >   700 S    sshd            -
> >   701 Ss+  bash            -
> >  1035 S<   scsi_eh_0       scsi_error_handler
> >  1066 S<   scsi_eh_1       scsi_error_handler
> >  1068 S<   scsi_eh_2       scsi_error_handler
> >  1078 S<   khpsbpkt        hpsbpkt_thread
> >  1157 S<   scsi_eh_3       scsi_error_handler
> >  1158 S<   usb-storage     -
> >  1161 S<   scsi_eh_4       scsi_error_handler
> >  1162 S<   usb-storage     -
> >  1177 S<   kpsmoused       worker_thread
> >  1180 S<   kondemand/0     worker_thread
> >  1181 S<   kondemand/1     worker_thread
> >  1182 S<   kondemand/2     worker_thread
> >  1183 S<   kondemand/3     worker_thread
> >  1199 S<   rpciod/0        worker_thread
> >  1200 S<   rpciod/1        worker_thread
> >  1201 S<   rpciod/2        worker_thread
> >  1202 S<   rpciod/3        worker_thread
> >  1203 S<   reiserfs/0      worker_thread
> >  1204 S<   reiserfs/1      worker_thread
> >  1205 S<   reiserfs/2      worker_thread
> >  1206 S<   reiserfs/3      worker_thread
> >  1298 S<s  udevd           -
> >  2988 Ss   sshd            -
> >  2990 S    sshd            -
> >  2991 Ss   bash            wait
> >  3118 S<   ocfs2_wq        worker_thread
> >  3143 S<   xfsbufd         -
> >  3144 S<   xfssyncd        -
> >  3151 S<   user_dlm        worker_thread
> >  5321 R+   ps              -
> >  5437 Ss   syslog-ng       631360160003

(rest deleted since this email is long!)



More information about the Ocfs2-users mailing list