[Ocfs2-users] another fencing question

Thu Jan 14 12:13:15 PST 2010

Mailing List SVR wrote:
> Hi, 
>
> periodically one of on my two nodes cluster is fenced here are the logs:
>
> Jan 14 07:01:44 nvr1-rc kernel: o2net: no longer connected to node nvr2-
> rc.minint.it (num 0) at 1.1.1.6:7777
> Jan 14 07:01:44 nvr1-rc kernel: (21534,1):dlm_do_master_request:1334 ERROR: 
> link to 0 went down!
> Jan 14 07:01:44 nvr1-rc kernel: (4007,4):dlm_send_proxy_ast_msg:458 ERROR: 
> status = -112
> Jan 14 07:01:44 nvr1-rc kernel: (4007,4):dlm_flush_asts:600 ERROR: status = 
> -112
> Jan 14 07:01:44 nvr1-rc kernel: (21534,1):dlm_get_lock_resource:917 ERROR: 
> status = -112
> Jan 14 07:02:19 nvr1-rc kernel: (3950,5):o2net_connect_expired:1664 ERROR: no 
> connection established with node 0 after 35.0 seconds, giving up and returning 
> errors.
> Jan 14 07:02:54 nvr1-rc kernel: (3950,5):o2net_connect_expired:1664 ERROR: no 
> connection established with node 0 after 35.0 seconds, giving up and returning 
> errors.
> Jan 14 07:03:10 nvr1-rc kernel: (4007,4):dlm_send_proxy_ast_msg:458 ERROR: 
> status = -107
> Jan 14 07:03:10 nvr1-rc kernel: (4007,4):dlm_flush_asts:600 ERROR: status = 
> -107
> Jan 14 07:03:29 nvr1-rc kernel: (3950,5):o2net_connect_expired:1664 ERROR: no 
> connection established with node 0 after 35.0 seconds, giving up and returning 
> errors.
> Jan 14 07:03:50 nvr1-rc kernel: (31,5):o2quo_make_decision:146 ERROR: fencing 
> this node because it is connected to a half-quorum of 1 out of 2 nodes which 
> doesn't include the lowest active node 0
> Jan 14 07:03:50 nvr1-rc kernel: (31,5):o2hb_stop_all_regions:1967 ERROR: 
> stopping heartbeat on all active regions.
>
> I'm sure there are no network connectivity problem but it is possible that 
> there are heavy IO loads, is this the intended behaviour? Why under heavy load 
> the loaded node is fenced?
>
> I'm using ocfs2-1.4.4 on rhel5 kernel-2.6.18-164.6.1.el5

So the network connection snapped. What it means is that the nodes
could not ping each other for 35 seconds. In fact node 1 (this one),
tried to reconnect to node 0 but got no reply back. So the network
issue lasted for over 2 mins.

Switch could be one culprit. See if the switch logs say something. Other
possibility is that node 0 was paging heavily. Or kswapd was pegged
at 100%. This is hard to determine after the fact. Something to keep in
mind the next time you see the same issue. If that is the case, then that
needs to be fixed. Maybe add more memory. Or, if you are running the
database, ensure you are using hugepages. etc.