[Ocfs2-users] Weird issues with ocfs2

Ed McLain emclain at teklinks.com
Fri Jan 27 10:21:30 CST 2006


Got a weird issue here that I hope someone can help me with.  I have 6
boxes connected to an iscsi SAN, 5 via 100Mb and 1 via 1Gb all on the
same switch, the 5 on the 100Mb are my mx servers while the 1 on Gbit is
my load balancer.  The 5 mx boxes are pxe booted and mount there root
fileystem read only from the iscsi san which is an ocfs2 volume.  The
load balancer mounts the same volume as read-write and runs periodic
virus/spam updates on the volume throughout the day.  My problem,
however, came this morning at around 4 am when the system runs nightly
maintenance on itself and on the iscsi volume.  It looks as though just
shortly after it ran the system froze, but I can tell why or how.  To
make matters more confusing, all of the mx servers could no longer read
from the iscsi volume after this happened, it was like everything got
fenced at once and nothing had access.  Another weird thing is the logs
on the mx servers, they seem to all be different with different events
happening.  I hope somebody out there can help shed some light on this
for me.

Load Balancer: 
Jan 27 04:06:12 mx kernel: (26880,2):dlm_create_lock_handler:441 ERROR:
dlm status = DLM_IVLOCKID
Jan 27 04:06:12 mx last message repeated 2 times

MX Servers

**mx1**
Jan 27 04:06:11 mx1 kernel: (9582,1):dlm_do_assert_master:1505 ERROR:
during assert master of D000000000000000005320be20df6c3 to 2, got -22.
Jan 27 04:06:11 mx1 kernel: (9582,1):dlm_dump_lock_resources:297 struct
dlm_ctxt: B125E362AD5B43109B40EA0D3DBB7D33, node=1, key=73257850
Jan 27 04:06:11 mx1 kernel: (9582,1):dlm_print_one_lock_resource:223
lockres: M000000000000000009e5212792ca0a, owner=1, state=0
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:238
lockres: M000000000000000009e5212792ca0a, owner=1, state=0
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:240
last used: 0, on purge list: no
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:241
granted queue:
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:251
type=3, conv=-1, node=0, cookie=3822148708753997824,
ast=(empty=y,pend=n), bast=(empty=y,pend=n)
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:254
converting queue:
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:267
blocked queue:
Jan 27 04:06:11 mx1 kernel: (9582,1):dlm_print_one_lock_resource:223
lockres: M000000000000000009e5252792ca0e, owner=0, state=0
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:238
lockres: M000000000000000009e5252792ca0e, owner=0, state=0
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:240
last used: 0, on purge list: no
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:241
granted queue:
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:251
type=3, conv=-1, node=1, cookie=726205439913492481,
ast=(empty=y,pend=n), bast=(empty=y,pend=n)
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:254
converting queue:
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:267
blocked queue:
Jan 27 04:06:11 mx1 kernel: (9582,1):dlm_print_one_lock_resource:223
lockres: M00000000000000000a724f2792f1c0, owner=1, state=0
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:238
lockres: M00000000000000000a724f2792f1c0, owner=1, state=0
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:240
last used: 0, on purge list: no
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:241
granted queue:
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:251
type=3, conv=-1, node=1, cookie=1446781380292771841,
ast=(empty=y,pend=n), bast=(empty=y,pend=n)
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:251
type=3, conv=-1, node=5, cookie=1590896568368627717,
ast=(empty=y,pend=n), bast=(empty=y,pend=n)
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:251
type=3, conv=-1, node=2, cookie=1807069350482411522,
ast=(empty=y,pend=n), bast=(empty=y,pend=n)
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:251
type=3, conv=-1, node=3, cookie=7139331309289078787,
ast=(empty=y,pend=n), bast=(empty=y,pend=n)
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:251
type=3, conv=-1, node=4, cookie=2384093052739256324,
ast=(empty=y,pend=n), bast=(empty=y,pend=n)
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:254
converting queue:
Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:267
blocked queue:
Jan 27 04:06:11 mx1 kernel: (9582,1):dlm_print_one_lock_resource:223
lockres: D000000000000000009e52f2792ca18, owner=0, state=0
*** REPEATS TILL ***
Jan 27 04:06:26 mx1 kernel: ------------[ cut here ]------------
Jan 27 04:06:26 mx1 kernel: kernel BUG
at /usr/src/redhat/BUILD/ocfs2-1.1.8/fs/ocfs2/dlm/dlmmaster.c:1507!
Jan 27 04:06:26 mx1 kernel: invalid operand: 0000 [#3]
Jan 27 04:06:26 mx1 kernel: SMP
Jan 27 04:06:26 mx1 kernel: Modules linked in: ipv6 autofs4 microcode
uhci_hcd ehci_hcd video button battery ac ocfs2 debugfs ocfs2_dlmfs
ocfs2_dlm ocfs2_nodemanager configfs crc32c libcrc32c md5 iscsi_sfnet
scsi_transport_iscsi 8021q ext3 jbd aic79xx aic7xxx ata_piix libata
sd_mod scsi_mod tg3 e1000
Jan 27 04:06:26 mx1 kernel: CPU:    1
Jan 27 04:06:26 mx1 kernel: EIP:    0060:[<f89abef0>]    Not tainted VLI
Jan 27 04:06:26 mx1 kernel: EFLAGS: 00010296   (2.6.10-1.771_FC2tek-mx)
Jan 27 04:06:26 mx1 kernel: EIP is at dlm_do_assert_master+0x2d1/0x324
[ocfs2_dlm]
Jan 27 04:06:26 mx1 kernel: eax: f65ed030   ebx: 00000000   ecx:
d788cbe0   edx: 00000400
Jan 27 04:06:26 mx1 kernel: esi: e9012f7f   edi: d788ccb7   ebp:
0000001f   esp: d788cc4c
Jan 27 04:06:26 mx1 kernel: ds: 007b   es: 007b   ss: 0068
Jan 27 04:06:26 mx1 kernel: Process md5sum (pid: 9600,
threadinfo=d788c000 task=cdd82020)
Jan 27 04:06:26 mx1 kernel: Stack: 00000002 00000000 e9012f60 f65ed000
ffffffea 0000003d 00000000 00000000
Jan 27 04:06:26 mx1 kernel:        00000000 00000000 00000000 00000000
00000000 00000002 00058893 43d9f094
Jan 27 04:06:26 mx1 kernel:        00058898 00001f01 00000000 3030304d
30303030 30303030 30303030 33353030
Jan 27 04:06:26 mx1 kernel: Call Trace:
Jan 27 04:06:26 mx1 kernel:  [<f89aaaae>] dlm_wait_for_lock_mastery
+0x465/0x514 [ocfs2_dlm]
Jan 27 04:06:26 mx1 kernel:  [<f89a6534>] __dlm_insert_lockres+0x78/0x8b
[ocfs2_dlm]
Jan 27 04:06:26 mx1 kernel:  [<f89aa45e>] dlm_get_lock_resource
+0x513/0x6fe [ocfs2_dlm]
Jan 27 04:06:26 mx1 kernel:  [<f89b26af>] dlmlock+0x7b1/0xaaa
[ocfs2_dlm]
Jan 27 04:06:26 mx1 kernel:  [<f89d544f>] ocfs2_lock_create+0xfd/0x2d7
[ocfs2]
Jan 27 04:06:26 mx1 kernel:  [<f89d43ab>] ocfs2_inode_ast_func+0x0/0x53c
[ocfs2]
Jan 27 04:06:26 mx1 kernel:  [<f89d4b16>] ocfs2_inode_bast_func
+0x0/0x130 [ocfs2]
Jan 27 04:06:26 mx1 kernel:  [<f89d5901>] ocfs2_cluster_lock+0x1fa/0x843
[ocfs2]
Jan 27 04:06:26 mx1 kernel:  [<f89d569b>] ocfs2_status_completion_cb
+0x0/0xa [ocfs2]
Jan 27 04:06:26 mx1 kernel:  [<f89d75af>] ocfs2_meta_lock_full
+0x1e8/0x563 [ocfs2]
Jan 27 04:06:26 mx1 kernel:  [<c01452d3>] handle_mm_fault+0xbe/0x176
Jan 27 04:06:26 mx1 kernel:  [<c019d5a2>] selinux_inode_getattr
+0x4a/0x52
Jan 27 04:06:26 mx1 kernel:  [<c0116f44>] do_page_fault+0x191/0x57a
Jan 27 04:06:26 mx1 kernel:  [<f89e3da7>] ocfs2_inode_revalidate
+0x11a/0x1e9 [ocfs2]
Jan 27 04:06:26 mx1 kernel:  [<f89e0099>] ocfs2_getattr+0x0/0x141
[ocfs2]
Jan 27 04:06:26 mx1 kernel:  [<f89e00fc>] ocfs2_getattr+0x63/0x141
[ocfs2]
Jan 27 04:06:26 mx1 kernel:  [<f89e0099>] ocfs2_getattr+0x0/0x141
[ocfs2]
Jan 27 04:06:26 mx1 kernel:  [<c0158a6e>] vfs_getattr+0x35/0x88
Jan 27 04:06:26 mx1 kernel:  [<c0158b54>] vfs_fstat+0x22/0x31
Jan 27 04:06:26 mx1 kernel:  [<c0159094>] sys_fstat64+0xf/0x23
Jan 27 04:06:26 mx1 kernel:  [<c0179e1f>] dnotify_parent+0x1b/0x6c
Jan 27 04:06:26 mx1 kernel:  [<c0116db3>] do_page_fault+0x0/0x57a
Jan 27 04:06:26 mx1 kernel:  [<c0103c57>] syscall_call+0x7/0xb
Jan 27 04:06:26 mx1 kernel: Code: 10 55 68 e1 05 00 00 68 bb cc 9b f8 ff
70 10 ff b2 94 00 00 00 68 66 cd 9b f8 e8 aa 21 77 c7 83 c4 24 8b 44 24
0c e8 8d 9c ff ff <0f> 0b e3 05 46 c0 9b f8 8b 4c 24 34 ba ff 00 00 00
8d 44 24 14
Jan 27 04:07:05 mx1 kernel:  <3>(782,0):dlm_create_lock_handler:441
ERROR: dlm status = DLM_IVLOCKID
Jan 27 04:24:44 mx1 kernel: (0,0):o2net_idle_timer:1310 connection to
node mx.highspd.net (num 0) at 10.20.0.1:7777 has been idle for 10
seconds, shutting it down.
Jan 27 04:24:44 mx1 kernel: (0,0):o2net_idle_timer:1321 here are some
times that might help debug the situation: (tmr 1138357474.529323 now
1138357484.527980 dr 1138357474.529310 adv
1138357474.529336:1138357474.529339 func (045dd37a:503)
1138357014.669140:1138357014.669145)
Jan 27 04:24:44 mx1 kernel: (782,0):o2net_set_nn_state:411 no longer
connected to node mx.highspd.net (num 0) at 10.20.0.1:7777
Jan 27 04:24:49 mx1 kernel: (782,0):ocfs2_dlm_eviction_cb:118 device
(8,17): dlm has evicted node 0
Jan 27 04:24:49 mx1 kernel: (782,0):dlm_begin_reco_handler:2057 ERROR:
recovery master 3 sees 0 as dead, but this node has not yet.  marking 0
as dead
Jan 27 04:24:54 mx1 kernel: (10814,1):ocfs2_replay_journal:1167
Recovering node 0 from slot 0 on device (8,17)


** mx2 **
Same as MX1 but no kernel oops

** mx3 **
Jan 27 04:24:42 mx3 kernel: (0,0):o2net_idle_timer:1310 connection to
node mx.highspd.net (num 0) at 10.20.0.1:7777 has been idle for 10
seconds, shutting it down.
Jan 27 04:24:42 mx3 kernel: (0,0):o2net_idle_timer:1321 here are some
times that might help debug the situation: (tmr 1138357472.621478 now
1138357482.620252 dr 1138357472.621465 adv
1138357472.621490:1138357472.621493 func (045dd37a:502)
1138356977.791621:1138356977.791625)
Jan 27 04:24:42 mx3 kernel: (778,0):o2net_set_nn_state:411 no longer
connected to node mx.highspd.net (num 0) at 10.20.0.1:7777
Jan 27 04:24:51 mx3 kernel: (32525,0):dlm_restart_lock_mastery:1064
ERROR: node down! 0

** mx4 **
Same as MX3

** mx5 ** 
Same as MX1 including kernel oops


-- 
Thanks,
Ed McLain
Sr. Colocation Engineer
TekLinks / High Speed Solutions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20060127/a1b925ee/attachment-0001.html


More information about the Ocfs2-users mailing list