[Ocfs2-users] self fencing and system panicproblem afterforced
reboot
Sunil Mushran
Sunil.Mushran at oracle.com
Fri Sep 15 09:26:02 PDT 2006
It depends where the media error occurs. If it is in the 1MB heartbeat file,
it will fence.
Eckenfels. Bernd wrote:
> Did you get read error (media sense or something like that) messages in
> the kernel log (dmesg) while using the debug tool. Ocfs2 should really
> not kill the cluster in that case.
>
> Bernd
>
> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com
> [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Holger
> Brueckner
> Sent: Friday, September 15, 2006 10:21 AM
> To: Sunil Mushran
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] self fencing and system panicproblem
> afterforced reboot
>
> i guess i found the solution. while dumping some files with debugfs, it
> suddenly stopped working and could not be killed. and guess what, media
> error on the drive :-/. funny that a filesystem check succeeds.
>
> anyway thx a lot to those who responded.
>
> holger
>
> On Thu, 2006-09-14 at 11:03 -0700, Sunil Mushran wrote:
>
>> Not sure why a power outage should cause this.
>>
>> Do you have the full stack of the oops? It will show the times taken
>> in the last 24 operations in the hb thread. That should tell us as to
>> what is up.
>>
>> Holger Brueckner wrote:
>>
>>> i just discovered the ls, cd, dump and rdump commands in
>>>
> debugfs.ocfs2.
>
>>> they work fine :-). neverless i would really like to know why
>>> mounting and accessing the volume is not possible anymore.
>>>
>>> but thanks for the hint pieter
>>>
>>> holger brueckner
>>>
>>> On Thu, 2006-09-14 at 14:30 +0200, Pieter Viljoen - MWEB wrote:
>>>
>>>
>>>> Hi Holger
>>>>
>>>> Maybe you should try the fscat tools
>>>> (http://oss.oracle.com/projects/fscat/) - which has a fsls (to
>>>> list) and fscp (to copy) directly from the device.
>>>>
>>>> I have not tried it yet, so good luck!
>>>>
>>>>
>>>> Pieter Viljoen
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: ocfs2-users-bounces at oss.oracle.com
>>>> [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Holger
>>>> Brueckner
>>>> Sent: Thursday, September 14, 2006 14:17
>>>> To: ocfs2-users at oss.oracle.com
>>>> Subject: Re: [Ocfs2-users] self fencing and system panic problem
>>>> afterforced reboot
>>>>
>>>> side note: setting HEARBEAT_THRESHOLD to 30 did not help either.
>>>>
>>>> could it be that the syncronization between the daemons does not
>>>>
> work?
>
>>>> (e.g daemons think fs is mounted on some nodes and try to
>>>> synchonize but actually the fs isn't mounted on any node?)
>>>>
>>>> i'm rather clueless now. finding a way to access the data and copy
>>>> it to the non shared partitions would help me a lot.
>>>>
>>>> thx
>>>>
>>>> holger brueckner
>>>>
>>>>
>>>> On Thu, 2006-09-14 at 13:47 +0200, Holger Brueckner wrote:
>>>>
>>>>
>>>>> X-CS-3-Report: plain
>>>>>
>>>>>
>>>>> hello,
>>>>>
>>>>> i'm running ocfs2 to provide a shared disk thoughout a xen
>>>>>
> cluster.
>
>>>>> this setup was working fine until today where there was an power
>>>>>
>>>>>
>>>> outage
>>>>
>>>>
>>>>> and all xen nodes where forcefully shut down. whenever i try to
>>>>> mount/access the ocfs2 partition the system panics and reboots:
>>>>>
>>>>> darks:~# fsck.ocfs2 -y -f /dev/sda4
>>>>> (617,0):__dlm_print_nodes:377 Nodes in my domain
>>>>> ("5BA3969FC2714FFEAD66033486242B58"):
>>>>> (617,0):__dlm_print_nodes:381 node 0 Checking OCFS2 filesystem in
>>>>>
>
>
>>>>> /dev/sda4:
>>>>> label: <NONE>
>>>>> uuid: 5b a3 96 9f c2 71 4f fe ad 66 03 34 86 24 2b
>>>>>
> 58
>
>>>>> number of blocks: 35983584
>>>>> bytes per block: 4096
>>>>> number of clusters: 4497948
>>>>> bytes per cluster: 32768
>>>>> max slots: 4
>>>>>
>>>>> /dev/sda4 was run with -f, check forced.
>>>>> Pass 0a: Checking cluster allocation chains Pass 0b: Checking
>>>>> inode allocation chains Pass 0c: Checking extent block allocation
>>>>> chains Pass 1: Checking inodes and blocks.
>>>>> [CLUSTER_ALLOC_BIT] Cluster 295771 is marked in the global cluster
>>>>>
>
>
>>>>> bitmap but it isn't in use. Clear its bit in the bitmap? y
>>>>> [CLUSTER_ALLOC_BIT] Cluster 2456870 is marked in the global
>>>>> cluster bitmap but it isn't in use. Clear its bit in the bitmap?
>>>>> y [CLUSTER_ALLOC_BIT] Cluster 2683096 is marked in the global
>>>>> cluster bitmap but it isn't in use. Clear its bit in the bitmap?
>>>>> y Pass 2: Checking directory entries.
>>>>> Pass 3: Checking directory connectivity.
>>>>> Pass 4a: checking for orphaned inodes Pass 4b: Checking inodes
>>>>> link counts.
>>>>> All passes succeeded.
>>>>> darks:~# mount /data
>>>>> (622,0):ocfs2_initialize_super:1326 max_slots for this device: 4
>>>>> (622,0):ocfs2_fill_local_node_info:1019 I am node 0
>>>>> (622,0):__dlm_print_nodes:377 Nodes in my domain
>>>>> ("5BA3969FC2714FFEAD66033486242B58"):
>>>>> (622,0):__dlm_print_nodes:381 node 0
>>>>> (622,0):ocfs2_find_slot:261 slot 2 is already allocated to this
>>>>>
> node!
>
>>>>> (622,0):ocfs2_find_slot:267 taking node slot 2
>>>>> (622,0):ocfs2_check_volume:1586 File system was not unmounted
>>>>> cleanly, recovering volume.
>>>>> kjournald starting. Commit interval 5 seconds
>>>>> ocfs2: Mounting device (8,4) on (node 0, slot 2) with ordered data
>>>>>
>>>>>
>>>> mode.
>>>>
>>>>
>>>>> (630,0):ocfs2_replay_journal:1181 Recovering node 2 from slot 0 on
>>>>>
>
>
>>>>> device (8,4) darks:~# (4,0):o2hb_write_timeout:164 ERROR:
>>>>> Heartbeat write timeout
>>>>>
>>>>>
>>>> to
>>>>
>>>>
>>>>> device sda4 after 12000 milliseconds
>>>>> (4,0):o2hb_stop_all_regions:1789 ERROR: stopping heartbeat on all
>>>>>
>>>>>
>>>> active
>>>>
>>>>
>>>>> regions.
>>>>> Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
>>>>>
>
>
>>>>> system by panicing
>>>>>
>>>>> ocfs2-tools 1.2.1-1
>>>>> kernel 2.6.16-xen (with corresponding ocfs2 compiled into
>>>>>
> the
>
>>>>> kernel)
>>>>>
>>>>> i already tried the elevator=deadline scheduler option with no
>>>>>
> effect.
>
>>>>> any further help debugging this issue is greatly appreciated. are
>>>>> ther any other possibilities to get access to the data from
>>>>> outside the cluster (obviously while the partition isn't mounted)
>>>>>
> ?
>
>>>>> thanks for your help
>>>>>
>>>>> holger brueckner
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Ocfs2-users mailing list
>>>>> Ocfs2-users at oss.oracle.com
>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>
>>>>>
>>>> _______________________________________________
>>>> Ocfs2-users mailing list
>>>> Ocfs2-users at oss.oracle.com
>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>
>>>>
>>> _______________________________________________
>>> Ocfs2-users mailing list
>>> Ocfs2-users at oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>
>>>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
More information about the Ocfs2-users
mailing list