[Ocfs2-users] One dead node takes out the other

Sunil Mushran Sunil.Mushran at oracle.com
Fri Apr 7 18:44:44 CDT 2006


Please file a bug on oss.oracle.com/bugzilla. We've made
many fixes in mastery/recovery since 1.2.0. We can add
a test to check this issue too.

Jonathan Steinert wrote:
> Anyone aware of this problem, and if so is there a fix available?
>
> I have two nodes, alice and bob. On both I have a shared ocfs2 mount at
> /ocfs2. The FS appears to mount and work perfectly fine.
>
> Now on alice I take out an exclusive lock on /dlm/foo/bar and block the
> process forever. Next I start a loop on bob that tries to take out the
> same lock (trylock exclusive mode) once each second, which fails properly.
>
> Now, I unplug alice completely... machine is off. The trylock process on
> bob now hangs permanently, ten seconds pass. The following appears on my
> console for bob:
>                                                                                                  
> (0,0):o2net_idle_timer:1293 connection to node kano (num 0) at
> 10.10.0.2:7777 has been idle for 10 seconds, shutting it down.
> (0,0):o2net_idle_timer:1304 here are some times that might help debug
> the situation: (tmr 1144294337.323052 now 1144294347.317365 dr
> 1144294337.323045 adv 1144294337.323053:1144294337.323053 func
> (7b10fddd:505) 1144294324.934836:1144294324.934838)
> (2179,0):o2net_set_nn_state:409 no longer connected to node kano (num 0)
> at 10.10.0.2:7777
> (2492,0):dlm_send_remote_lock_request:264 ERROR: status = -112
> (2492,0):dlm_send_remote_lock_request:264 ERROR: status = -107
> (2492,0):dlm_send_remote_lock_request:264 ERROR: status = -107
>                                                                                                  
> The status = -107 message prints approx once every 100ms now forever,
> and a few seconds after this all starts scrolling I get:
>
> (2493,0):ocfs2_replay_journal:1180 Recovering node 0 from slot 0 on
> device (8,2)
> (2492,0):dlm_send_remote_lock_request:264 ERROR: status = -107
> kjournald starting.  Commit interval 5 seconds
>
> In the middle of all the scrolling. The trylock process on bob is
> permanently hung and the -107 message continues to scroll.
>
> I have tried using the subversion ocfs2/trunk modules under 2.6.16
> (changed to use mutexes), the modules that come with mainline 2.6.16 and
> the mainline 2.6.16.1. All of these seem to act the same.
>
> OCFS2 Node Manager, DLM, DLMFS all v 1.3.3
> OCFS2-Tools v 1.2.0
>
> The bugreports I've found related to this problem say I need to upgrade
> to -Tools ver 1.0.3, which I think I'm a little past. (Could be wrong)
>
> Thanks,
> Jonathan Steinert
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>   



More information about the Ocfs2-users mailing list