[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

Marek Królikowski admin at wset.edu.pl
Thu Dec 22 08:08:33 PST 2011


Hello
After 24 hours i see TEST-MAIL2 reboot ( possible kernel panic) but 
TEST-MAIL1 got in dmesg:
TEST-MAIL1 ~ #dmesg
[cut]
o2net: accepted connection from node TEST-MAIL2 (num 1) at 172.17.1.252:7777
o2dlm: Node 1 joins domain B24C4493BBC74FEAA3371E2534BB3611
o2dlm: Nodes in domain B24C4493BBC74FEAA3371E2534BB3611: 0 1
o2net: connection to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 has been 
idle for 60.0 seconds, shutting it down.
(swapper,0,0):o2net_idle_timer:1562 Here are some times that might help 
debug the situation: (Timer: 33127732045, Now 33187808090, DataReady 
33127732039, Advance 33127732051-33127732051, Key 0xebb9cd47, Func 506, 
FuncTime 33127732045-33127732048)
o2net: no longer connected to node TEST-MAIL2 (num 1) at 172.17.1.252:7777
(du,5099,12):dlm_do_master_request:1324 ERROR: link to 1 went down!
(du,5099,12):dlm_get_lock_resource:907 ERROR: status = -112
(dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
B24C4493BBC74FEAA3371E2534BB3611: res M000000000000000000000cf023ef70, 
error -112 send AST to node 1
(dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -112
(dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
B24C4493BBC74FEAA3371E2534BB3611: res P000000000000000000000000000000, 
error -107 send AST to node 1
(dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -107
(kworker/u:3,5071,0):o2net_connect_expired:1724 ERROR: no connection 
established with node 1 after 60.0 seconds, giving up and returning errors.
(o2hb-B24C4493BB,14310,0):o2dlm_eviction_cb:267 o2dlm has evicted node 1 
from group B24C4493BBC74FEAA3371E2534BB3611
(ocfs2rec,5504,6):dlm_get_lock_resource:834 
B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at least 
one node (1) to recover before lock mastery can begin
(ocfs2rec,5504,6):dlm_get_lock_resource:888 
B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at least 
one node (1) to recover before lock mastery can begin
(du,5099,12):dlm_restart_lock_mastery:1213 ERROR: node down! 1
(du,5099,12):dlm_wait_for_lock_mastery:1030 ERROR: status = -11
(du,5099,12):dlm_get_lock_resource:888 
B24C4493BBC74FEAA3371E2534BB3611:N000000000020924f: at least one node (1) to 
recover before lock mastery can begin
(dlm_reco_thread,14322,0):dlm_get_lock_resource:834 
B24C4493BBC74FEAA3371E2534BB3611:$RECOVERY: at least one node (1) to recover 
before lock mastery can begin
(dlm_reco_thread,14322,0):dlm_get_lock_resource:868 
B24C4493BBC74FEAA3371E2534BB3611: recovery map is not empty, but must master 
$RECOVERY lock now
(dlm_reco_thread,14322,0):dlm_do_recovery:523 (14322) Node 0 is the Recovery 
Master for the Dead Node 1 for Domain B24C4493BBC74FEAA3371E2534BB3611
(ocfs2rec,5504,6):ocfs2_replay_journal:1549 Recovering node 1 from slot 1 on 
device (253,0)
(ocfs2rec,5504,6):ocfs2_begin_quota_recovery:407 Beginning quota recovery in 
slot 1
(kworker/u:0,2909,0):ocfs2_finish_quota_recovery:599 Finishing quota 
recovery in slot 1

And i try give this command:
debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP allow
debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or directory
debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP off
debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or directory

But not working....


-----Oryginalna wiadomość----- 
From: Srinivas Eeda
Sent: Wednesday, December 21, 2011 8:43 PM
To: Marek Królikowski
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

Those numbers look good. Basically with the fixes backed out and another
fix I gave, you are not seeing that many orphans hanging around and
hence not seeing the process stuck kernel stacks. You can run the test
longer or if you are satisfied, please enable quotas and re-run the test
with the modified kernel. You might see a dead lock which needs to be
fixed(I was not able to reproduce this yet). If the system hangs, please
capture the following and provide me the output

1. echo t > /proc/sysrq-trigger
2. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
allow
3. wait for 10 minutes
4. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
off
5. echo t > /proc/sysrq-trigger




More information about the Ocfs2-users mailing list