[Ocfs2-users] o2net: connect to node has been idle for 10 secs

Sunil Mushran Sunil.Mushran at oracle.com
Mon Aug 7 10:45:31 PDT 2006


I am assuming that the other node was alive and that you are using a 
private interface.
So, the only case that is left is that o2net is timing out for no 
apparent reason.

Now if that is the case, before we go to the more intrusive module 
update route, it may
be better if we start from a tcpdump. Run the following on both nodes.

# tcpdump -i <DEVICE> -C 50 -W 3 -s 10000 -Sw /tmp/tcpdump.log -ttt 
'port 7777' &

This will create and use three 50M files as rotating buffer. When the 
problem happens next,
email me the location of the last log file for both nodes.

Andy Phillips wrote:
> Hello,
>
>      Well we had the same problem again;
>
> o2net: connection to node barney (num 0) at 172.16.6.10:7777
> has been idle for 10 seconds, shutting it down.
>
> kernel: (0,0):o2net_idle_timer:1309 here are some times that might help
> debug the situation: (tmr 1154932284.14757 now 1154932294.13147 dr
> 1154932284.14717 adv 1154932284.14767:1154932284.14768 func (06aac8a1:1)
> 1154932279.15062:1154932279.15068)
>
>     We upgraded to 1.2.3. And it almost immediately died again with the
> same error. Our cron job that touches a file every 3 seconds did not
> seem to make much difference. This is now quite a serious problem for
> us.
>
>     Any suggestions as to how to take this forward? 
>
>     Sunil, what do you need from us to roll a custom debugging build? 
> Can we run the custom build on node 2 and leave the existing build on
> node 1, which is now production?
>
>     Andy
>
>
>   
>>>>> Aug  2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at
>>>>> 172.16.6.10:7777 has been idle for 10 seconds, shutting it down.
>>>>> Aug  2 19:06:27 fred kernel: (0,7):o2net_idle_timer:1309 here are some
>>>>> times that might help debug the situation: (tmr 1154545576.798263 now
>>>>> 1154545586.796978 dr 1154545576.798238 adv
>>>>> 1154545576.798291:1154545576.798293 func (06aac8a1:1)
>>>>> 1154545566.800782:1154545566.800787)
>>>>> Aug  2 19:06:27 fred kernel: o2net: no longer connected to node barney
>>>>> (num 0) at 172.16.6.10:7777
>>>>> Aug  2 19:08:33 fred kernel: (25,7):o2quo_make_decision:143 ERROR:
>>>>> fencing this node because it is connected to
>>>>> a half-quorum of 1 out of 2 nodes which doesn't include the lowest
>>>>> active node 0
>>>>> Aug  2 19:08:33 fred kernel: (25,7):o2hb_stop_all_regions:1908 ERROR:
>>>>> stopping heartbeat on all active regions.
>>>>>           
>>  ________________________________________________________________________
>>     



More information about the Ocfs2-users mailing list