[Ocfs2-users] cluster rebooting

Sat Mar 14 07:34:17 PDT 2009

Thanks. I will gather the information and file a bugzilla.

On Fri, 13 Mar 2009, Sunil Mushran wrote:

> Impossible to determine the cause with what you have provided. File a
> bugzilla and attach messages from all nodes. No exceptions. If you have
> netconsole setup (you should) attach those logs. That way we'll know if
> the nodes oopsed and if so what the stack was.
>
> Sunil
>
> On Fri, Mar 13, 2009 at 03:01:57PM -0400, andrew at temporalspaces.com wrote:
>> Hi-
>>
>> I have a 16 node cluster that has been rebooting all nodes in the cluster.
>> I
>> recevied a seg-fault from multipathd on one node and then all nodes in the
>> cluster
>> rebooted. Here is the error message that appeared on all nodes:
>>
>> Mar 13 13:30:27 bws01 kernel: ocfs2_dlm: Node 8 leaves domain
>> B24F4E67EBB34CAA99690B112FA6D50E
>> Mar 13 13:30:27 bws01 kernel: ocfs2_dlm: Nodes in domain
>> ("B24F4E67EBB34CAA99690B112FA6D50E"): 0 1 2 3 5 6 7 9 10 13 15 17
>> Mar 13 13:30:33 bws01 kernel: ocfs2_dlm: Node 8 leaves domain
>> F575B164F63E4E888004C70D9F84D779
>> Mar 13 13:30:33 bws01 kernel: ocfs2_dlm: Nodes in domain
>> ("F575B164F63E4E888004C70D9F84D779"): 0 1 2 3 5 6 7 9 10 13 15 16 17
>> Mar 13 13:30:39 bws01 kernel: ocfs2_dlm: Node 8 leaves domain
>> A70D0DC186724FF388CDE65EC540C444
>> Mar 13 13:30:39 bws01 kernel: ocfs2_dlm: Nodes in domain
>> ("A70D0DC186724FF388CDE65EC540C444"): 0 1 2 3 5 6 7 9 10
>> Mar 13 13:30:45 bws01 kernel: ocfs2_dlm: Node 8 leaves domain
>> B31B07823153433C948F63199CE4A31C
>> Mar 13 13:30:45 bws01 kernel: ocfs2_dlm: Nodes in domain
>> ("B31B07823153433C948F63199CE4A31C"): 0 1 2 3 5 6 7 9 10
>> Mar 13 13:31:11 bws01 xinetd[4934]: START: nrpe pid=1065 from=10.10.8.20
>> Mar 13 13:31:11 bws01 xinetd[4934]: EXIT: nrpe status=0 pid=1065
>> duration=0(sec)
>> Mar 13 13:32:27 bws01 kernel: o2net: connection to node bapp05 (num 8) at
>> 10.10.16.15:7777 has been idle for 30.0 seconds, shutting it down.
>> Mar 13 13:32:27 bws01 kernel: (0,0):o2net_idle_timer:1476 here are some
>> times that
>> might help debug the situation: (tmr 1236965517.208305 now
>> 1236965547.207461 dr
>> 1236965517.208295 adv 1236965517.208311:1236965517.208312 func
>> (ee9d109e:513)
>> 1236965445.298207:1236965445.298219)
>> Mar 13 13:32:27 bws01 kernel: o2net: no longer connected to node bapp05
>> (num 8) at
>> 10.10.16.15:7777
>> Mar 13 13:32:55 bws01 xinetd[4934]: START: nrpe pid=1068 from=10.10.8.20
>> Mar 13 13:32:55 bws01 xinetd[4934]: EXIT: nrpe status=0 pid=1068
>> duration=0(sec)
>> Mar 13 13:32:57 bws01 kernel: (4586,0):o2net_connect_expired:1637 ERROR:
>> no
>> connection established with node 8 after 30.0 seconds, giving up and
>> returning
>> errors.
>> Mar 13 13:33:00 bws01 kernel: (4586,0):ocfs2_dlm_eviction_cb:98 device
>> (253,0): dlm
>> has evicted node 8
>>
>> Why would this cause all nodes in the cluster to reboot? Seems to me that
>> it should have kicked out node 8 only...
>>
>> thanks
>> Andrew
>>
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>