[Ocfs2-users] huge "something" problem
Alexandre Racine
Alexandre.Racine at mhicc.org
Mon Jun 2 11:24:55 PDT 2008
Hi Sunil,
> What about the other node?
I also found what is below on the SRV1 server witch is the master SGE
(Sun Grid Engine) server of the cluster in the /var/log/messages file...
This server rebooted when this error occurred. I don't know if it is
related.
> Do this by hand on both nodes.
> 1. debugfs.ocfs2 -R "fs_locks" /dev/sdX >/tmp/locks.out
> 2. grep -i busy /tmp/locks.out
This gives no results. The /tmp/locks.out file itself does have 200000+
lines...
-------- /var/log/messages
May 30 21:01:44 SRV1 Assertion failure in journal_dirty_metadata() at
fs/jbd/transaction.c:1161: "jh->b_next_transaction == transaction"
May 30 21:01:44 SRV1 ------------[ cut here ]------------
May 30 21:01:44 SRV1 kernel BUG at fs/jbd/transaction.c:1161!
May 30 21:01:44 SRV1 invalid opcode: 0000 [1] SMP
May 30 21:01:44 SRV1 CPU 6
May 30 21:01:44 SRV1 Modules linked in: ocfs2_dlmfs ocfs2 ocfs2_dlm
ocfs2_nodemanager configfs iscsi_tcp libiscsi scsi_transport_iscsi
May 30 21:01:44 SRV1 Pid: 29383, comm: alohomora_m Not tainted
2.6.23-gentoo-r8 #3
May 30 21:01:44 SRV1 RIP: 0010:[<ffffffff802fc2cf>]
[<ffffffff802fc2cf>] journal_dirty_metadata+0x143/0x1be
May 30 21:01:44 SRV1 RSP: 0018:ffff81069b3f3918 EFLAGS: 00010292
May 30 21:01:44 SRV1 RAX: 0000000000000077 RBX: ffff8102c85a98e8 RCX:
0000000000000046
May 30 21:01:44 SRV1 RDX: 0000000000000000 RSI: 0000000000000096 RDI:
ffffffff807acd60
May 30 21:01:44 SRV1 RBP: ffff8102bf8c4b50 R08: ffffffff807acd68 R09:
0000000000000001
May 30 21:01:44 SRV1 R10: ffff81069b3f388e R11: ffffffff80264dff R12:
ffff81054401f5a0
May 30 21:01:44 SRV1 R13: ffff8107e2645a00 R14: ffff810700c21540 R15:
ffff81069b3f3a48
May 30 21:01:44 SRV1 FS: 0000000000000000(0000)
GS:ffff8107e2570140(0063) knlGS:00000000f7d468d0
May 30 21:01:44 SRV1 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
May 30 21:01:44 SRV1 CR2: 000000000a29c004 CR3: 000000034a521000 CR4:
00000000000006e0
May 30 21:01:44 SRV1 DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
May 30 21:01:44 SRV1 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
May 30 21:01:44 SRV1 Process alohomora_m (pid: 29383, threadinfo
ffff81069b3f2000, task ffff81047b0b27d0)
May 30 21:01:44 SRV1 Stack: ffff8107c6582630 ffff8102c85a98e8
ffff81054401f5a0 ffff81020b335000
May 30 21:01:44 SRV1 ffff8107c6582630 ffffffff880aa2f3 ffff81020b335000
ffff8107bd23be40
May 30 21:01:44 SRV1 ffff810700c21540 ffffffff8808a73b ffff8102521e8780
ffff81069b3f3a28
May 30 21:01:44 SRV1 Call Trace:
May 30 21:01:44 SRV1 [<ffffffff880aa2f3>]
:ocfs2:ocfs2_journal_dirty+0x6a/0x11d
May 30 21:01:44 SRV1 [<ffffffff8808a73b>]
:ocfs2:ocfs2_do_insert_extent+0x905/0xb43
May 30 21:01:44 SRV1 [<ffffffff8808ecd3>]
:ocfs2:ocfs2_insert_extent+0x600/0x70d
May 30 21:01:44 SRV1 [<ffffffff8809ec09>]
:ocfs2:ocfs2_do_extend_allocation+0x370/0x54a
May 30 21:01:44 SRV1 [<ffffffff880907eb>]
:ocfs2:ocfs2_write_begin_nolock+0x96b/0x1222
May 30 21:01:44 SRV1 [<ffffffff880928a9>]
:ocfs2:ocfs2_write_begin+0x1a9/0x274
May 30 21:01:44 SRV1 [<ffffffff880a41a4>]
:ocfs2:ocfs2_file_aio_write+0x65f/0xa2f
May 30 21:01:44 SRV1 [<ffffffff880968f9>]
:ocfs2:ocfs2_should_refresh_lock_res+0x84/0x1ba
May 30 21:01:44 SRV1 [<ffffffff80264dff>] zone_statistics+0x3f/0x60
May 30 21:01:44 SRV1 [<ffffffff8025f5eb>] __alloc_pages+0x60/0x2c2
May 30 21:01:44 SRV1 [<ffffffff8027f908>] do_sync_write+0xc9/0x10c
May 30 21:01:44 SRV1 [<ffffffff80247a26>]
autoremove_wake_function+0x0/0x2e
May 30 21:01:44 SRV1 [<ffffffff80280048>] vfs_write+0xad/0x136
May 30 21:01:44 SRV1 [<ffffffff80280585>] sys_write+0x45/0x6e
May 30 21:01:44 SRV1 [<ffffffff80224e70>] sysenter_do_call+0x1b/0x67
May 30 21:01:44 SRV1 [<ffffffff880aa3a6>]
:ocfs2:ocfs2_journal_dirty_data+0x0/0x7e
May 30 21:01:44 SRV1 Code: 0f 0b eb fe 48 83 7d 18 00 74 2c 49 c7 c0 eb
8e 71 80 b9 90
May 30 21:01:44 SRV1 RIP [<ffffffff802fc2cf>]
journal_dirty_metadata+0x143/0x1be
May 30 21:01:44 SRV1 RSP <ffff81069b3f3918>
May 30 21:01:55 SRV1 BUG: soft lockup - CPU#1 stuck for 11s!
[kjournald:6953]
May 30 21:01:55 SRV1 CPU 1:
May 30 21:01:55 SRV1 Modules linked in: ocfs2_dlmfs ocfs2 ocfs2_dlm
ocfs2_nodemanager configfs iscsi_tcp libiscsi scsi_transport_iscsi
May 30 21:01:55 SRV1 Pid: 6953, comm: kjournald Tainted: G D
2.6.23-gentoo-r8 #3
May 30 21:01:55 SRV1 RIP: 0010:[<ffffffff80300f63>]
[<ffffffff80300f63>] journal_write_metadata_buffer+0x68/0x319
May 30 21:01:55 SRV1 RSP: 0018:ffff8107d8fbdde0 EFLAGS: 00000206
May 30 21:01:55 SRV1 RAX: 0000000000398029 RBX: ffff8107e2645a00 RCX:
00000000000007e0
May 30 21:01:55 SRV1 RDX: 00000000000007e1 RSI: ffffffff808c3480 RDI:
ffff8101fd2afa88
May 30 21:01:55 SRV1 RBP: 0000000000000000 R08: ffff8107da27c518 R09:
ffff8107da27c520
May 30 21:01:55 SRV1 R10: ffff8101fd2af9b8 R11: ffffffff88092288 R12:
000000000000af45
May 30 21:01:55 SRV1 R13: 000000000000af45 R14: ffff8107da27c530 R15:
0000000000000000
May 30 21:01:55 SRV1 FS: 0000000000000000(0000)
GS:ffff8107e24d3a40(0000) knlGS:0000000000000000
May 30 21:01:55 SRV1 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
May 30 21:01:55 SRV1 CR2: 000000000a2e202c CR3: 000000034c7d6000 CR4:
00000000000006e0
May 30 21:01:55 SRV1 DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
May 30 21:01:55 SRV1 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
May 30 21:01:55 SRV1
May 30 21:01:55 SRV1 Call Trace:
May 30 21:01:55 SRV1 [<ffffffff8030010d>] journal_bmap+0x22/0x7a
May 30 21:01:55 SRV1 [<ffffffff802fcf83>]
journal_commit_transaction+0x795/0x1016
May 30 21:01:55 SRV1 [<ffffffff80247a26>]
autoremove_wake_function+0x0/0x2e
May 30 21:01:55 SRV1 [<ffffffff8030066f>] kjournald+0xb9/0x212
May 30 21:01:55 SRV1 [<ffffffff80247a26>]
autoremove_wake_function+0x0/0x2e
May 30 21:01:55 SRV1 [<ffffffff803005b6>] kjournald+0x0/0x212
May 30 21:01:55 SRV1 [<ffffffff80247908>] kthread+0x47/0x73
May 30 21:01:55 SRV1 [<ffffffff8020c188>] child_rip+0xa/0x12
May 30 21:01:55 SRV1 [<ffffffff802478c1>] kthread+0x0/0x73
May 30 21:01:55 SRV1 [<ffffffff8020c17e>] child_rip+0x0/0x12
May 30 21:01:55 SRV1
May 30 21:02:07 SRV1 BUG: soft lockup - CPU#1 stuck for 11s!
[kjournald:6953]
May 30 21:02:07 SRV1 CPU 1:
May 30 21:02:07 SRV1 Modules linked in: ocfs2_dlmfs ocfs2 ocfs2_dlm
ocfs2_nodemanager configfs iscsi_tcp libiscsi scsi_transport_iscsi
May 30 21:02:07 SRV1 Pid: 6953, comm: kjournald Tainted: G D
2.6.23-gentoo-r8 #3
May 30 21:02:07 SRV1 RIP: 0010:[<ffffffff80300f63>]
[<ffffffff80300f63>] journal_write_metadata_buffer+0x68/0x319
May 30 21:02:07 SRV1 RSP: 0018:ffff8107d8fbdde0 EFLAGS: 00000206
May 30 21:02:07 SRV1 RAX: 0000000000398029 RBX: ffff8107e2645a00 RCX:
00000000000007e0
May 30 21:02:07 SRV1 RDX: 00000000000007e1 RSI: ffffffff808c3480 RDI:
ffff8101fd2afa88
May 30 21:02:07 SRV1 RBP: 0000000000000000 R08: ffff8107da27c518 R09:
ffff8107da27c520
May 30 21:02:07 SRV1 R10: ffff8101fd2af9b8 R11: ffffffff88092288 R12:
000000000000af45
May 30 21:02:07 SRV1 R13: 000000000000af45 R14: ffff8107da27c530 R15:
0000000000000000
May 30 21:02:07 SRV1 FS: 0000000000000000(0000)
GS:ffff8107e24d3a40(0000) knlGS:0000000000000000
May 30 21:02:07 SRV1 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
May 30 21:02:07 SRV1 CR2: 000000000a2e202c CR3: 000000034c7d6000 CR4:
00000000000006e0
May 30 21:02:07 SRV1 DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
May 30 21:02:07 SRV1 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
May 30 21:02:07 SRV1 Call Trace:
May 30 21:02:07 SRV1 [<ffffffff8030010d>] journal_bmap+0x22/0x7a
May 30 21:02:07 SRV1 [<ffffffff802fcf83>]
journal_commit_transaction+0x795/0x1016
May 30 21:02:07 SRV1 [<ffffffff80247a26>]
autoremove_wake_function+0x0/0x2e
May 30 21:02:07 SRV1 [<ffffffff8030066f>] kjournald+0xb9/0x212
May 30 21:02:07 SRV1 [<ffffffff80247a26>]
autoremove_wake_function+0x0/0x2e
May 30 21:02:07 SRV1 [<ffffffff803005b6>] kjournald+0x0/0x212
May 30 21:02:07 SRV1 [<ffffffff80247908>] kthread+0x47/0x73
May 30 21:02:07 SRV1 [<ffffffff8020c188>] child_rip+0xa/0x12
May 30 21:02:07 SRV1 [<ffffffff802478c1>] kthread+0x0/0x73
May 30 21:02:07 SRV1 [<ffffffff8020c17e>] child_rip+0x0/0x12
May 30 21:02:07 SRV1
May 30 21:02:19 SRV1 BUG: soft lockup - CPU#1 stuck for 11s!
[kjournald:6953]
May 30 21:02:19 SRV1 CPU 1:
May 30 21:02:19 SRV1 Modules linked in: ocfs2_dlmfs ocfs2 ocfs2_dlm
ocfs2_nodemanager configfs iscsi_tcp libiscsi scsi_transport_iscsi
May 30 21:02:19 SRV1 Pid: 6953, comm: kjournald Tainted: G D
2.6.23-gentoo-r8 #3
May 30 21:02:19 SRV1 RIP: 0010:[<ffffffff80300f63>]
[<ffffffff80300f63>] journal_write_metadata_buffer+0x68/0x319
May 30 21:02:19 SRV1 RSP: 0018:ffff8107d8fbdde0 EFLAGS: 00000206
May 30 21:02:19 SRV1 RAX: 0000000000398029 RBX: ffff8107e2645a00 RCX:
00000000000007e0
May 30 21:02:19 SRV1 RDX: 00000000000007e1 RSI: ffffffff808c3480 RDI:
ffff8101fd2afa88
May 30 21:02:19 SRV1 RBP: 0000000000000000 R08: ffff8107da27c518 R09:
ffff8107da27c520
May 30 21:02:19 SRV1 R10: ffff8101fd2af9b8 R11: ffffffff88092288 R12:
000000000000af45
May 30 21:02:19 SRV1 R13: 000000000000af45 R14: ffff8107da27c530 R15:
0000000000000000
May 30 21:02:19 SRV1 FS: 0000000000000000(0000)
GS:ffff8107e24d3a40(0000) knlGS:0000000000000000
May 30 21:02:19 SRV1 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
May 30 21:02:19 SRV1 CR2: 000000000a2e202c CR3: 000000034c7d6000 CR4:
00000000000006e0
May 30 21:02:19 SRV1 DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
May 30 21:02:19 SRV1 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
May 30 21:02:19 SRV1
May 30 21:02:19 SRV1 Call Trace:
May 30 21:02:19 SRV1 [<ffffffff8030010d>] journal_bmap+0x22/0x7a
May 30 21:02:19 SRV1 [<ffffffff802fcf83>]
journal_commit_transaction+0x795/0x1016
May 30 21:02:19 SRV1 [<ffffffff80247a26>]
autoremove_wake_function+0x0/0x2e
May 30 21:02:19 SRV1 [<ffffffff8030066f>] kjournald+0xb9/0x212
May 30 21:02:19 SRV1 [<ffffffff80247a26>]
autoremove_wake_function+0x0/0x2e
May 30 21:02:19 SRV1 [<ffffffff803005b6>] kjournald+0x0/0x212
May 30 21:02:19 SRV1 [<ffffffff80247908>] kthread+0x47/0x73
May 30 21:02:19 SRV1 [<ffffffff8020c188>] child_rip+0xa/0x12
May 30 21:02:19 SRV1 [<ffffffff802478c1>] kthread+0x0/0x73
May 30 21:02:19 SRV1 [<ffffffff8020c17e>] child_rip+0x0/0x12
Alexandre Racine
alexandre.racine at mhicc.org
514-461-1300 poste 3303
> -----Original Message-----
> From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com]
> Sent: 2 juin 2008 13:51
> To: Alexandre Racine
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] huge "something" problem
>
> What about the other node?
>
> The function indicates that it is waiting on the o2dlm. If so,
> scanlocks should show a "Busy" lockres.
>
> Do this by hand on both nodes.
> 1. debugfs.ocfs2 -R "fs_locks" /dev/sdX >/tmp/locks.out
> 2. grep -i busy /tmp/locks.out
>
> Best log a bugzilla with all this information. Even if we cannot
> resolve this right now, the information will help us in the future.
>
> Repeat the same on both nodes.
>
> The one so-called bummer is that 2.6.23 does not have any dlm
> debugging. It will be available from 2.6.26 onwards. Meaning,
> even if we get a lockres, we will not be able to dump the state
> in 2.6.23.
>
> Alexandre Racine wrote:
> > Hi Sunil,
> >
> > Below you will find the result of the command "$ ps -e -o
> > pid,stat,comm,wchan=WIDE-WCHAN-COLUMN" ran 6 times with more or less
> 10
> > sec. delay.
> >
> > Now, from my point of view, I have a load of 8.00 and nfsd is
running
> 8
> > instances witch are all in the "D" state. I am kind of making links
> here
> > ;) What does the "ocfs2_wait_for_mask" means? Is there some problems
> > with OCFS2 and NFS? (NFS is somehow not a perfect FS)
> >
> > Thanks
> >
> >
> >
> > PID STAT COMMAND WIDE-WCHAN-COLUMN
> > 1 Ss init -
> > 2 S< kthreadd kthreadd
> > 3 S< migration/0 migration_thread
> > 4 S< ksoftirqd/0 ksoftirqd
> > 5 S< watchdog/0 watchdog
> > 6 S< migration/1 migration_thread
> > 7 S< ksoftirqd/1 ksoftirqd
> > 8 S< watchdog/1 watchdog
> > 9 S< migration/2 migration_thread
> > 10 S< ksoftirqd/2 ksoftirqd
> > 11 S< watchdog/2 watchdog
> > 12 S< migration/3 migration_thread
> > 13 S< ksoftirqd/3 ksoftirqd
> > 14 S< watchdog/3 watchdog
> > 15 S< events/0 worker_thread
> > 16 S< events/1 worker_thread
> > 17 S< events/2 worker_thread
> > 18 S< events/3 worker_thread
> > 19 S< khelper worker_thread
> > 105 S< kblockd/0 worker_thread
> > 106 S< kblockd/1 worker_thread
> > 107 S< kblockd/2 worker_thread
> > 108 S< kblockd/3 worker_thread
> > 112 S< kacpid worker_thread
> > 113 S< kacpi_notify worker_thread
> > 240 S< ata/0 worker_thread
> > 241 S< ata/1 worker_thread
> > 242 S< ata/2 worker_thread
> > 243 S< ata/3 worker_thread
> > 244 S< ata_aux worker_thread
> > 245 S< ksuspend_usbd worker_thread
> > 251 S< khubd hub_thread
> > 254 S< kseriod serio_thread
> > 318 S pdflush pdflush
> > 319 S pdflush pdflush
> > 320 S< kswapd0 kswapd
> > 321 S< aio/0 worker_thread
> > 322 S< aio/1 worker_thread
> > 323 S< aio/2 worker_thread
> > 324 S< aio/3 worker_thread
> > 325 S< cifsoplockd -
> > 326 S< cifsdnotifyd -
> > 327 S< jfsIO jfsIOWait
> > 328 S< jfsCommit jfs_lazycommit
> > 329 S< jfsCommit jfs_lazycommit
> > 330 S< jfsCommit jfs_lazycommit
> > 331 S< jfsCommit jfs_lazycommit
> > 332 S< jfsSync jfs_sync
> > 333 S< xfslogd/0 worker_thread
> > 334 S< xfslogd/1 worker_thread
> > 335 S< xfslogd/2 worker_thread
> > 336 S< xfslogd/3 worker_thread
> > 337 S< xfsdatad/0 worker_thread
> > 338 S< xfsdatad/1 worker_thread
> > 339 S< xfsdatad/2 worker_thread
> > 340 S< xfsdatad/3 worker_thread
> > 341 S< xfs_mru_cache worker_thread
> > 697 Ss sshd -
> > 700 S sshd -
> > 701 Ss+ bash -
> > 1035 S< scsi_eh_0 scsi_error_handler
> > 1066 S< scsi_eh_1 scsi_error_handler
> > 1068 S< scsi_eh_2 scsi_error_handler
> > 1078 S< khpsbpkt hpsbpkt_thread
> > 1157 S< scsi_eh_3 scsi_error_handler
> > 1158 S< usb-storage -
> > 1161 S< scsi_eh_4 scsi_error_handler
> > 1162 S< usb-storage -
> > 1177 S< kpsmoused worker_thread
> > 1180 S< kondemand/0 worker_thread
> > 1181 S< kondemand/1 worker_thread
> > 1182 S< kondemand/2 worker_thread
> > 1183 S< kondemand/3 worker_thread
> > 1199 S< rpciod/0 worker_thread
> > 1200 S< rpciod/1 worker_thread
> > 1201 S< rpciod/2 worker_thread
> > 1202 S< rpciod/3 worker_thread
> > 1203 S< reiserfs/0 worker_thread
> > 1204 S< reiserfs/1 worker_thread
> > 1205 S< reiserfs/2 worker_thread
> > 1206 S< reiserfs/3 worker_thread
> > 1298 S<s udevd -
> > 2988 Ss sshd -
> > 2990 S sshd -
> > 2991 Ss bash wait
> > 3118 S< ocfs2_wq worker_thread
> > 3143 S< xfsbufd -
> > 3144 S< xfssyncd -
> > 3151 S< user_dlm worker_thread
> > 5321 R+ ps -
> > 5437 Ss syslog-ng 631360160003
(rest deleted since this email is long!)
More information about the Ocfs2-users
mailing list