[Ocfs2-users] self fencing and system panicproblem afterforced reboot

Fri Sep 15 01:47:01 PDT 2006

sd 1:0:0:0: SCSI error: return code = 0x8000002
sdb: Current: sense key: Medium Error
    Additional sense: Address mark not found for data field
end_request: I/O error, dev sdb, sector 157164961
ata2: translated ATA stat/err 0x51/01 to SCSI SK/ASC/ASCQ 0x3/13/00
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: error=0x01 { AddrMarkNotFound }
ata2: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ 0x3/11/04
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: error=0x40 { UncorrectableError }

and so on, some media sense errors, too

On Fri, 2006-09-15 at 10:32 +0200, Eckenfels. Bernd wrote:
> Did you get read error (media sense or something like that) messages in
> the kernel log (dmesg) while using the debug tool. Ocfs2 should really
> not kill the cluster in that case.
> 
> Bernd 
> 
> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com
> [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Holger
> Brueckner
> Sent: Friday, September 15, 2006 10:21 AM
> To: Sunil Mushran
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] self fencing and system panicproblem
> afterforced reboot
> 
> i guess i found the solution. while dumping some files with debugfs, it
> suddenly stopped working and could not be killed. and guess what, media
> error on the drive :-/. funny that a filesystem check succeeds.
> 
> anyway thx a lot to those who responded.
> 
> holger
> 
> On Thu, 2006-09-14 at 11:03 -0700, Sunil Mushran wrote:
> > Not sure why a power outage should cause this.
> > 
> > Do you have the full stack of the oops? It will show the times taken 
> > in the last 24 operations in the hb thread. That should tell us as to 
> > what is up.
> > 
> > Holger Brueckner wrote:
> > > i just discovered the ls, cd, dump and rdump commands in
> debugfs.ocfs2.
> > > they work fine :-). neverless i would really like to know why 
> > > mounting and accessing the volume is not possible anymore.
> > >
> > > but thanks for the hint pieter
> > >
> > > holger brueckner
> > >
> > > On Thu, 2006-09-14 at 14:30 +0200, Pieter Viljoen - MWEB wrote:
> > >   
> > >> Hi Holger
> > >>
> > >> Maybe you should try the fscat tools
> > >> (http://oss.oracle.com/projects/fscat/) - which has a fsls (to 
> > >> list) and fscp (to copy) directly from the device.
> > >>
> > >> I have not tried it yet, so good luck!
> > >>
> > >>
> > >> Pieter Viljoen
> > >>  
> > >>
> > >> -----Original Message-----
> > >> From: ocfs2-users-bounces at oss.oracle.com
> > >> [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Holger 
> > >> Brueckner
> > >> Sent: Thursday, September 14, 2006 14:17
> > >> To: ocfs2-users at oss.oracle.com
> > >> Subject: Re: [Ocfs2-users] self fencing and system panic problem 
> > >> afterforced reboot
> > >>
> > >> side note: setting HEARBEAT_THRESHOLD to 30 did not help either.
> > >>
> > >> could it be that the syncronization between the daemons does not
> work?
> > >> (e.g daemons think fs is mounted on some nodes and try to 
> > >> synchonize but actually the fs isn't mounted on any node?)
> > >>
> > >> i'm rather clueless now. finding a way to access the data and copy 
> > >> it to the non shared partitions would help me a lot.
> > >>
> > >> thx
> > >>
> > >> holger brueckner
> > >>
> > >>
> > >> On Thu, 2006-09-14 at 13:47 +0200, Holger Brueckner wrote:
> > >>     
> > >>> X-CS-3-Report: plain
> > >>>
> > >>>
> > >>> hello,
> > >>>
> > >>> i'm running ocfs2 to provide a shared disk thoughout a xen
> cluster.
> > >>> this setup was working fine until today where there was an power
> > >>>       
> > >> outage
> > >>     
> > >>> and all xen nodes where forcefully shut down. whenever i try to 
> > >>> mount/access the ocfs2 partition the system panics and reboots:
> > >>>
> > >>> darks:~# fsck.ocfs2 -y -f /dev/sda4
> > >>> (617,0):__dlm_print_nodes:377 Nodes in my domain
> > >>> ("5BA3969FC2714FFEAD66033486242B58"):
> > >>> (617,0):__dlm_print_nodes:381  node 0 Checking OCFS2 filesystem in
> 
> > >>> /dev/sda4:
> > >>>   label:              <NONE>
> > >>>   uuid:               5b a3 96 9f c2 71 4f fe ad 66 03 34 86 24 2b
> 58
> > >>>   number of blocks:   35983584
> > >>>   bytes per block:    4096
> > >>>   number of clusters: 4497948
> > >>>   bytes per cluster:  32768
> > >>>   max slots:          4
> > >>>
> > >>> /dev/sda4 was run with -f, check forced.
> > >>> Pass 0a: Checking cluster allocation chains Pass 0b: Checking 
> > >>> inode allocation chains Pass 0c: Checking extent block allocation 
> > >>> chains Pass 1: Checking inodes and blocks.
> > >>> [CLUSTER_ALLOC_BIT] Cluster 295771 is marked in the global cluster
> 
> > >>> bitmap but it isn't in use.  Clear its bit in the bitmap? y 
> > >>> [CLUSTER_ALLOC_BIT] Cluster 2456870 is marked in the global 
> > >>> cluster bitmap but it isn't in use.  Clear its bit in the bitmap? 
> > >>> y [CLUSTER_ALLOC_BIT] Cluster 2683096 is marked in the global 
> > >>> cluster bitmap but it isn't in use.  Clear its bit in the bitmap? 
> > >>> y Pass 2: Checking directory entries.
> > >>> Pass 3: Checking directory connectivity.
> > >>> Pass 4a: checking for orphaned inodes Pass 4b: Checking inodes 
> > >>> link counts.
> > >>> All passes succeeded.
> > >>> darks:~# mount /data
> > >>> (622,0):ocfs2_initialize_super:1326 max_slots for this device: 4
> > >>> (622,0):ocfs2_fill_local_node_info:1019 I am node 0
> > >>> (622,0):__dlm_print_nodes:377 Nodes in my domain
> > >>> ("5BA3969FC2714FFEAD66033486242B58"):
> > >>> (622,0):__dlm_print_nodes:381  node 0
> > >>> (622,0):ocfs2_find_slot:261 slot 2 is already allocated to this
> node!
> > >>> (622,0):ocfs2_find_slot:267 taking node slot 2
> > >>> (622,0):ocfs2_check_volume:1586 File system was not unmounted 
> > >>> cleanly, recovering volume.
> > >>> kjournald starting.  Commit interval 5 seconds
> > >>> ocfs2: Mounting device (8,4) on (node 0, slot 2) with ordered data
> > >>>       
> > >> mode.
> > >>     
> > >>> (630,0):ocfs2_replay_journal:1181 Recovering node 2 from slot 0 on
> 
> > >>> device (8,4) darks:~# (4,0):o2hb_write_timeout:164 ERROR: 
> > >>> Heartbeat write timeout
> > >>>       
> > >> to
> > >>     
> > >>> device sda4 after 12000 milliseconds
> > >>> (4,0):o2hb_stop_all_regions:1789 ERROR: stopping heartbeat on all
> > >>>       
> > >> active
> > >>     
> > >>> regions.
> > >>> Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
> 
> > >>> system by panicing
> > >>>
> > >>> ocfs2-tools    1.2.1-1
> > >>> kernel         2.6.16-xen (with corresponding ocfs2 compiled into
> the
> > >>>                kernel)
> > >>>
> > >>> i already tried the elevator=deadline scheduler option with no
> effect.
> > >>> any further help debugging this issue is greatly appreciated. are 
> > >>> ther any other possibilities to get access to the data from 
> > >>> outside the cluster (obviously while the partition isn't mounted)
> ?
> > >>>
> > >>> thanks for your help
> > >>>
> > >>> holger brueckner
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> Ocfs2-users mailing list
> > >>> Ocfs2-users at oss.oracle.com
> > >>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> > >>>       
> > >>
> > >> _______________________________________________
> > >> Ocfs2-users mailing list
> > >> Ocfs2-users at oss.oracle.com
> > >> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> > >>     
> > >
> > >
> > > _______________________________________________
> > > Ocfs2-users mailing list
> > > Ocfs2-users at oss.oracle.com
> > > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> > >   
> 
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users