[Ocfs2-users] Unexplained reboots in DRBD82 + OCFS2 setup

Thu Jun 25 10:21:11 PDT 2009

Use netconsole. I have always had success with it. If o2cb is
fencing, you will see the message in the netconsole logs.

If the timeout change did not help, then it could be some other
issue. May not be an ocfs2 issue at all. But the starting point
for any diagnoses would have to be the netconsole logs.

Kris Buytaert wrote:
> On Wed, 2009-06-24 at 12:02 -0700, Sunil Mushran wrote:
>   
>> Do you have a separate network path for drbd traffic? If you do
>> not, then you are probably overloading the network. In this case,
>> I believe drbd is unable to replicate the ios fast enough and thus
>> is blocking the o2cb disk heartbeat. One workaround is to increase
>> the O2CB_HEARTBEAT_THRESHOLD to more than the default of 60 secs.
>> Refer to the ocfs2 faq or ocfs2 1.4 user's guide for more on this.
>>
>>     
> I've already modified the O2CB_HEARTBEAT_TRESHOLD to different values
> (120, 240 etc), with no changes..
>
>
>   
>> And if you want to capture the logs, setup netconsole.
>>
>>     
> /dev/console is a serial device connected to a terminal server,  so far
> the best I got was a partial timestamp before I saw the output of the
> reboot again .. 
>
> It tries to log .. but doesn't finish writing it :(  But mostly there is
> no activity at all on the serial console :( 
>
> Any other ideas ? 
>
> greetings
>
>
> Kris 
>
>
>
>
>   
>> Kris Buytaert wrote:
>>     
>>> We're trying to setup a dual-primary DRBD environment, with a shared
>>> disk with either OCFS2 or GFS.   The environment is a Centos 5.3 with
>>> DRBD82 (but also tried with DRBD83 from testing) .
>>>
>>> Setting up a single primary disk and running bonnie++ on it works.
>>> Setting up a dual-primary disk, only mounting it on one node (ext3) and
>>> running bonnie++  works
>>>
>>> When setting up ocfs2 on the /dev/drbd0 disk and mounting it on both
>>> nodes, basic functionality seems in place but usually less than 5-10
>>> minutes after I start bonnie++ as a test on one of the nodes , both
>>> nodes power cycle  with no errors in the logfiles, just a crash.
>>>
>>> When at the console at the time of crash it looks like a disk IO (you
>>> can type , but actions happen)  block happens  then a reboot, no panics,
>>> no oops , nothing. ( sysctl panic values set to timeouts etc )
>>> Setting up a dual-primary disk , with ocfs2 only mounting it on one node
>>> and starting bonnie++ causes only that node to crash.
>>>
>>> On DRBD level I get the following error when that node dissapears
>>>
>>> drbd0: PingAck did not arrive in time.
>>> drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure )
>>> pdsk(UpToDate -> DUnknown )
>>> drbd0: asender terminated
>>> drbd0: Terminating asender thread
>>>
>>> That however is an expected error because of the reboot.
>>>
>>> At first I assumed OCFS2 to be the root of this problem ..so I moved
>>> forward and setup an ISCSI target on a 3rd node, and used that device
>>> with the same OCFS2 setup. There no crashes occured and bonnie++
>>> flawlessly completed it test run.
>>>
>>> So my attention went  back to the combination of DRBD and OCFS 
>>>
>>> I tried both DRBD 8.2 drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-2  and
>>> the 83 variant from Centos Testing
>>>
>>> At first I was trying with the ocfs2 1.4.1-1.el5.i386.rpm verson but
>>> upgrading to  1.4.2-1.el5.i386.rpm didn't change the behaviour
>>>
>>>
>>> Anyone has an idea on this ? 
>>> How can we get more debug info from OCFS2  , apart from heartbeat
>>> tracing which doesn't learn me nothing yet ..  in order to potentially
>>> file a valuable bug report.
>>>   
>>>       
>
>