[Ocfs2-users] self fencing and system panicproblem afterforced reboot

Fri Sep 15 09:26:02 PDT 2006

It depends where the media error occurs. If it is in the 1MB heartbeat file,
it will fence.

Eckenfels. Bernd wrote:
> Did you get read error (media sense or something like that) messages in
> the kernel log (dmesg) while using the debug tool. Ocfs2 should really
> not kill the cluster in that case.
>
> Bernd 
>
> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com
> [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Holger
> Brueckner
> Sent: Friday, September 15, 2006 10:21 AM
> To: Sunil Mushran
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] self fencing and system panicproblem
> afterforced reboot
>
> i guess i found the solution. while dumping some files with debugfs, it
> suddenly stopped working and could not be killed. and guess what, media
> error on the drive :-/. funny that a filesystem check succeeds.
>
> anyway thx a lot to those who responded.
>
> holger
>
> On Thu, 2006-09-14 at 11:03 -0700, Sunil Mushran wrote:
>   
>> Not sure why a power outage should cause this.
>>
>> Do you have the full stack of the oops? It will show the times taken 
>> in the last 24 operations in the hb thread. That should tell us as to 
>> what is up.
>>
>> Holger Brueckner wrote:
>>     
>>> i just discovered the ls, cd, dump and rdump commands in
>>>       
> debugfs.ocfs2.
>   
>>> they work fine :-). neverless i would really like to know why 
>>> mounting and accessing the volume is not possible anymore.
>>>
>>> but thanks for the hint pieter
>>>
>>> holger brueckner
>>>
>>> On Thu, 2006-09-14 at 14:30 +0200, Pieter Viljoen - MWEB wrote:
>>>   
>>>       
>>>> Hi Holger
>>>>
>>>> Maybe you should try the fscat tools
>>>> (http://oss.oracle.com/projects/fscat/) - which has a fsls (to 
>>>> list) and fscp (to copy) directly from the device.
>>>>
>>>> I have not tried it yet, so good luck!
>>>>
>>>>
>>>> Pieter Viljoen
>>>>  
>>>>
>>>> -----Original Message-----
>>>> From: ocfs2-users-bounces at oss.oracle.com
>>>> [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Holger 
>>>> Brueckner
>>>> Sent: Thursday, September 14, 2006 14:17
>>>> To: ocfs2-users at oss.oracle.com
>>>> Subject: Re: [Ocfs2-users] self fencing and system panic problem 
>>>> afterforced reboot
>>>>
>>>> side note: setting HEARBEAT_THRESHOLD to 30 did not help either.
>>>>
>>>> could it be that the syncronization between the daemons does not
>>>>         
> work?
>   
>>>> (e.g daemons think fs is mounted on some nodes and try to 
>>>> synchonize but actually the fs isn't mounted on any node?)
>>>>
>>>> i'm rather clueless now. finding a way to access the data and copy 
>>>> it to the non shared partitions would help me a lot.
>>>>
>>>> thx
>>>>
>>>> holger brueckner
>>>>
>>>>
>>>> On Thu, 2006-09-14 at 13:47 +0200, Holger Brueckner wrote:
>>>>     
>>>>         
>>>>> X-CS-3-Report: plain
>>>>>
>>>>>
>>>>> hello,
>>>>>
>>>>> i'm running ocfs2 to provide a shared disk thoughout a xen
>>>>>           
> cluster.
>   
>>>>> this setup was working fine until today where there was an power
>>>>>       
>>>>>           
>>>> outage
>>>>     
>>>>         
>>>>> and all xen nodes where forcefully shut down. whenever i try to 
>>>>> mount/access the ocfs2 partition the system panics and reboots:
>>>>>
>>>>> darks:~# fsck.ocfs2 -y -f /dev/sda4
>>>>> (617,0):__dlm_print_nodes:377 Nodes in my domain
>>>>> ("5BA3969FC2714FFEAD66033486242B58"):
>>>>> (617,0):__dlm_print_nodes:381  node 0 Checking OCFS2 filesystem in
>>>>>           
>
>   
>>>>> /dev/sda4:
>>>>>   label:              <NONE>
>>>>>   uuid:               5b a3 96 9f c2 71 4f fe ad 66 03 34 86 24 2b
>>>>>           
> 58
>   
>>>>>   number of blocks:   35983584
>>>>>   bytes per block:    4096
>>>>>   number of clusters: 4497948
>>>>>   bytes per cluster:  32768
>>>>>   max slots:          4
>>>>>
>>>>> /dev/sda4 was run with -f, check forced.
>>>>> Pass 0a: Checking cluster allocation chains Pass 0b: Checking 
>>>>> inode allocation chains Pass 0c: Checking extent block allocation 
>>>>> chains Pass 1: Checking inodes and blocks.
>>>>> [CLUSTER_ALLOC_BIT] Cluster 295771 is marked in the global cluster
>>>>>           
>
>   
>>>>> bitmap but it isn't in use.  Clear its bit in the bitmap? y 
>>>>> [CLUSTER_ALLOC_BIT] Cluster 2456870 is marked in the global 
>>>>> cluster bitmap but it isn't in use.  Clear its bit in the bitmap? 
>>>>> y [CLUSTER_ALLOC_BIT] Cluster 2683096 is marked in the global 
>>>>> cluster bitmap but it isn't in use.  Clear its bit in the bitmap? 
>>>>> y Pass 2: Checking directory entries.
>>>>> Pass 3: Checking directory connectivity.
>>>>> Pass 4a: checking for orphaned inodes Pass 4b: Checking inodes 
>>>>> link counts.
>>>>> All passes succeeded.
>>>>> darks:~# mount /data
>>>>> (622,0):ocfs2_initialize_super:1326 max_slots for this device: 4
>>>>> (622,0):ocfs2_fill_local_node_info:1019 I am node 0
>>>>> (622,0):__dlm_print_nodes:377 Nodes in my domain
>>>>> ("5BA3969FC2714FFEAD66033486242B58"):
>>>>> (622,0):__dlm_print_nodes:381  node 0
>>>>> (622,0):ocfs2_find_slot:261 slot 2 is already allocated to this
>>>>>           
> node!
>   
>>>>> (622,0):ocfs2_find_slot:267 taking node slot 2
>>>>> (622,0):ocfs2_check_volume:1586 File system was not unmounted 
>>>>> cleanly, recovering volume.
>>>>> kjournald starting.  Commit interval 5 seconds
>>>>> ocfs2: Mounting device (8,4) on (node 0, slot 2) with ordered data
>>>>>       
>>>>>           
>>>> mode.
>>>>     
>>>>         
>>>>> (630,0):ocfs2_replay_journal:1181 Recovering node 2 from slot 0 on
>>>>>           
>
>   
>>>>> device (8,4) darks:~# (4,0):o2hb_write_timeout:164 ERROR: 
>>>>> Heartbeat write timeout
>>>>>       
>>>>>           
>>>> to
>>>>     
>>>>         
>>>>> device sda4 after 12000 milliseconds
>>>>> (4,0):o2hb_stop_all_regions:1789 ERROR: stopping heartbeat on all
>>>>>       
>>>>>           
>>>> active
>>>>     
>>>>         
>>>>> regions.
>>>>> Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
>>>>>           
>
>   
>>>>> system by panicing
>>>>>
>>>>> ocfs2-tools    1.2.1-1
>>>>> kernel         2.6.16-xen (with corresponding ocfs2 compiled into
>>>>>           
> the
>   
>>>>>                kernel)
>>>>>
>>>>> i already tried the elevator=deadline scheduler option with no
>>>>>           
> effect.
>   
>>>>> any further help debugging this issue is greatly appreciated. are 
>>>>> ther any other possibilities to get access to the data from 
>>>>> outside the cluster (obviously while the partition isn't mounted)
>>>>>           
> ?
>   
>>>>> thanks for your help
>>>>>
>>>>> holger brueckner
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Ocfs2-users mailing list
>>>>> Ocfs2-users at oss.oracle.com
>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>       
>>>>>           
>>>> _______________________________________________
>>>> Ocfs2-users mailing list
>>>> Ocfs2-users at oss.oracle.com
>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>     
>>>>         
>>> _______________________________________________
>>> Ocfs2-users mailing list
>>> Ocfs2-users at oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>   
>>>       
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>