[Ocfs2-users] One dead node takes out the other
    Jonathan Steinert 
    jsteinert at sixapart.com
       
    Thu Apr  6 04:28:20 CDT 2006
    
    
  
Anyone aware of this problem, and if so is there a fix available?
I have two nodes, alice and bob. On both I have a shared ocfs2 mount at
/ocfs2. The FS appears to mount and work perfectly fine.
Now on alice I take out an exclusive lock on /dlm/foo/bar and block the
process forever. Next I start a loop on bob that tries to take out the
same lock (trylock exclusive mode) once each second, which fails properly.
Now, I unplug alice completely... machine is off. The trylock process on
bob now hangs permanently, ten seconds pass. The following appears on my
console for bob:
                                                                                                 
(0,0):o2net_idle_timer:1293 connection to node kano (num 0) at
10.10.0.2:7777 has been idle for 10 seconds, shutting it down.
(0,0):o2net_idle_timer:1304 here are some times that might help debug
the situation: (tmr 1144294337.323052 now 1144294347.317365 dr
1144294337.323045 adv 1144294337.323053:1144294337.323053 func
(7b10fddd:505) 1144294324.934836:1144294324.934838)
(2179,0):o2net_set_nn_state:409 no longer connected to node kano (num 0)
at 10.10.0.2:7777
(2492,0):dlm_send_remote_lock_request:264 ERROR: status = -112
(2492,0):dlm_send_remote_lock_request:264 ERROR: status = -107
(2492,0):dlm_send_remote_lock_request:264 ERROR: status = -107
                                                                                                 
The status = -107 message prints approx once every 100ms now forever,
and a few seconds after this all starts scrolling I get:
(2493,0):ocfs2_replay_journal:1180 Recovering node 0 from slot 0 on
device (8,2)
(2492,0):dlm_send_remote_lock_request:264 ERROR: status = -107
kjournald starting.  Commit interval 5 seconds
In the middle of all the scrolling. The trylock process on bob is
permanently hung and the -107 message continues to scroll.
I have tried using the subversion ocfs2/trunk modules under 2.6.16
(changed to use mutexes), the modules that come with mainline 2.6.16 and
the mainline 2.6.16.1. All of these seem to act the same.
OCFS2 Node Manager, DLM, DLMFS all v 1.3.3
OCFS2-Tools v 1.2.0
The bugreports I've found related to this problem say I need to upgrade
to -Tools ver 1.0.3, which I think I'm a little past. (Could be wrong)
Thanks,
Jonathan Steinert
    
    
More information about the Ocfs2-users
mailing list