[Ocfs2-users] ocfs2_search_chain: Group Descriptor has bad signature

Mon Jul 31 19:29:04 PDT 2006

What version of ocfs2 is on the nodes? Do modinfo ocfs2 on all nodes.

The version of OCFS2 shipped with SLES9 SP3 varies with kernel.
Are you using the modules shipped by suse or building them yourself?

Vladan Gunjic wrote:
> I've got a strange issue with the following configuration:
>
> Using Oracle 10gR2, having EMC CX500 with FC drives and 2 LUNs
> configured (one RAID5, one RAID1/0). We have 5 node ocfs2 cluster (4
> nodes are SLES9 SP3 64-bit, kernel 2.6.5-7.252-smp, one node is SLES9
> SP3 32-bit, 2.6.5-7.257-bigsmp). On all machines latest available OCFS2
> is installed (RPMs: ocfs2console-1.2.1-4.2, ocfs2-tools-1.2.1-4.2).
> As we have at the moment Oracle 10gR2 on other 32-bit machines, we
> wanted to migrate two such machines into Oracle RAC plus using our new
> SAN as a storage behind. Therefore I made ocfs2 filesystems on two LUNs
> (from 64-bit machines) and
> Connect all five machines in OCFS2 cluster). 
> - 32 bit machine is mounting both LUNs (and acting as a standby for our
> other existing productive Oracles unrelated to 5 machines described
> here).
> - 2 64-bit machines are mounting one of the LUNs (RAID5) and they are
> one of the two Oracle RACs.
> - 2 more 64-bit machines are mounting one of the LUNs (RAID1/0) and they
> are one of the two Oracle RACs.
>
> As we want to avoid big downtime for the switch, the idea is to use
> 32-bit standbies, convert them to 64-bit and use them under 64-bit
> Oracle RACs. We tested this scenario and it worked well. 
> Now we made final layout of the SAN (more disks in LUNs, etc.) and
> during the standby building one of the LUNs was suddenly mounted read
> only and I got following in dmesg:
>
> OCFS2: ERROR (device emcpowere1): ocfs2_search_chain: Group Descriptor #
> 0 has bad signature File system is now read-only due to the potential of
> on-disk corruption. Please run fsck.ocfs2 once the file system is
> unmounted.
> (9727,3):ocfs2_claim_suballoc_bits:1157 ERROR: status = -5
> (9727,3):ocfs2_claim_clusters:1392 ERROR: status = -5
> (9727,3):ocfs2_local_alloc_new_window:852 ERROR: status = -5
> (9727,3):ocfs2_local_alloc_slide_window:959 ERROR: status = -5
> (9727,3):ocfs2_reserve_local_alloc_bits:515 ERROR: status = -5
> (9727,3):ocfs2_reserve_clusters:592 ERROR: status = -5
> (9727,3):ocfs2_extend_file:836 ERROR: status = -5
> (9727,3):ocfs2_write_lock_maybe_extend:689 ERROR: status = -5
> (9727,3):ocfs2_write_lock_maybe_extend:693 ERROR: Failed to extend inode
> 262690 from 0 to 512
>
> After umounting and fsck I found a lot of errors:
>
> Checking OCFS2 filesystem in /dev/emcpowere1:
>   label:              <NONE>
>   uuid:               19 a2 94 f5 91 5d 4c ca be 2f c2 51 21 65 6e 2c
>   number of blocks:   175172744
>   bytes per block:    4096
>   number of clusters: 21896593
>   bytes per cluster:  32768
>   max slots:          4
> Pass 0a: Checking cluster allocation chains
> [CHAIN_LINK_MAGIC] Chain 85 in allocator at inode 23 contains a
> reference at depth 1 to block 84639744 which doesn't have a valid
> checksum.  Truncate this chain? <y>
> [CHAIN_BITS] Chain 85 in allocator inode 23 has 64716 bits marked free
> out of 96768 total bits but the block groups in the chain have 206 free
> out of 32256 total.  Fix this by updating the chain record? <y>
> [CHAIN_LINK_MAGIC] Chain 113 in allocator at inode 23 contains a
> reference at depth 2 to block 154570752 which doesn't have a valid
> checksum.  Truncate this chain? <y>
> [CHAIN_BITS] Chain 113 in allocator inode 23 has 64509 bits marked free
> out of 96768 total bits but the block groups in the chain have 32254
> free out of 64512 total.  Fix this by updating the chain record? <y>
> [CHAIN_LINK_MAGIC] Chain 241 in allocator at inode 23 contains a
> reference at depth 0 to block 62189568 which doesn't have a valid
> checksum.  Truncate this chain? <y>
> [CHAIN_BITS] Chain 241 in allocator inode 23 has 64510 bits marked free
> out of 64512 total bits but the block groups in the chain have 0 free
> out of 0 total.  Fix this by updating the chain record? <y>
> [CHAIN_GROUP_BITS] Allocator inode 23 has 6215157 bits marked used out
> of 21896593 total bits but the chains have 6215152 used out of 21735313
> total.  Fix this by updating the inode counts? <y>
> [CHAIN_I_CLUSTERS] Allocator inode 23 has 21735313 clusters represented
> in its allocator chains but has an i_clusters value of 21896593. Fix
> this by updating i_clusters? <y>
> [CHAIN_I_SIZE] Allocator inode 23 has 21735313 clusters represented in
> its allocator chain which accounts for 712222736384 total bytes, but its
> i_size is 717507559424. Fix this by updating i_size? <y>
> [GROUP_EXPECTED_DESC] Block 62189568 should be a group descriptor for
> the bitmap chain allocator but it wasn't found in any chains.
> Reinitialize it as a group desc and link it into the bitmap allocator?
> <y>
> [GROUP_EXPECTED_DESC] Block 84639744 should be a group descriptor for
> the bitmap chain allocator but it wasn't found in any chains.
> Reinitialize it as a group desc and link it into the bitmap allocator?
> <y>
> [GROUP_EXPECTED_DESC] Block 124895232 should be a group descriptor for
> the bitmap chain allocator but it wasn't found in any chains.
> Reinitialize it as a group desc and link it into the bitmap allocator?
> <y> y
> [GROUP_EXPECTED_DESC] Block 147345408 should be a group descriptor for
> the bitmap chain allocator but it wasn't found in any chains.
> Reinitialize it as a group desc and link it into the bitmap allocator?
> <y> y
> [GROUP_EXPECTED_DESC] Block 154570752 should be a group descriptor for
> the bitmap chain allocator but it wasn't found in any chains.
> Reinitialize it as a group desc and link it into the bitmap allocator?
> <y> y
> Pass 0b: Checking inode allocation chains
> Pass 0c: Checking extent block allocation chains
> Pass 1: Checking inodes and blocks.
> [CLUSTER_ALLOC_BIT] Cluster 2774016 is in use but isn't set in the
> global cluster bitmap. Set its bit in the bitmap? <y> y
> pass1: Bit does not exist in bitmap range while trying to set bit
> 2774016 in the cluster bitmap
> [CLUSTER_ALLOC_BIT] Cluster 2774017 is in use but isn't set in the
> global cluster bitmap. Set its bit in the bitmap? <y> y
> .....
>
> I couldn't detect any hardware error, any PowerPath (SAN path failover
> sw), fibre or SAN FC drive errors ? I used cluster size of only 32k, can
> it be the problem as my device has couple of 100s of GBs of big Oracle
> files ? I had following mount options: _netdev,datavolume on the 32-bit
> machine and _netdev,datavolume,nointr on RAC machines as recommended. To
> be more interesting, other LUN on the same 32-bit machine is having no
> issues, even tough it's bigger and contains 150GB more data ?
> Maybe some 32-bit limit reached ?
>
>
> Vladan
>  
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>