[Ocfs2-users] ocfs2_search_chain: Group Descriptor has bad signature

Mon Jul 31 04:17:04 PDT 2006

I've got a strange issue with the following configuration:

Using Oracle 10gR2, having EMC CX500 with FC drives and 2 LUNs
configured (one RAID5, one RAID1/0). We have 5 node ocfs2 cluster (4
nodes are SLES9 SP3 64-bit, kernel 2.6.5-7.252-smp, one node is SLES9
SP3 32-bit, 2.6.5-7.257-bigsmp). On all machines latest available OCFS2
is installed (RPMs: ocfs2console-1.2.1-4.2, ocfs2-tools-1.2.1-4.2).
As we have at the moment Oracle 10gR2 on other 32-bit machines, we
wanted to migrate two such machines into Oracle RAC plus using our new
SAN as a storage behind. Therefore I made ocfs2 filesystems on two LUNs
(from 64-bit machines) and
Connect all five machines in OCFS2 cluster). 
- 32 bit machine is mounting both LUNs (and acting as a standby for our
other existing productive Oracles unrelated to 5 machines described
here).
- 2 64-bit machines are mounting one of the LUNs (RAID5) and they are
one of the two Oracle RACs.
- 2 more 64-bit machines are mounting one of the LUNs (RAID1/0) and they
are one of the two Oracle RACs.

As we want to avoid big downtime for the switch, the idea is to use
32-bit standbies, convert them to 64-bit and use them under 64-bit
Oracle RACs. We tested this scenario and it worked well. 
Now we made final layout of the SAN (more disks in LUNs, etc.) and
during the standby building one of the LUNs was suddenly mounted read
only and I got following in dmesg:

OCFS2: ERROR (device emcpowere1): ocfs2_search_chain: Group Descriptor #
0 has bad signature File system is now read-only due to the potential of
on-disk corruption. Please run fsck.ocfs2 once the file system is
unmounted.
(9727,3):ocfs2_claim_suballoc_bits:1157 ERROR: status = -5
(9727,3):ocfs2_claim_clusters:1392 ERROR: status = -5
(9727,3):ocfs2_local_alloc_new_window:852 ERROR: status = -5
(9727,3):ocfs2_local_alloc_slide_window:959 ERROR: status = -5
(9727,3):ocfs2_reserve_local_alloc_bits:515 ERROR: status = -5
(9727,3):ocfs2_reserve_clusters:592 ERROR: status = -5
(9727,3):ocfs2_extend_file:836 ERROR: status = -5
(9727,3):ocfs2_write_lock_maybe_extend:689 ERROR: status = -5
(9727,3):ocfs2_write_lock_maybe_extend:693 ERROR: Failed to extend inode
262690 from 0 to 512

After umounting and fsck I found a lot of errors:

Checking OCFS2 filesystem in /dev/emcpowere1:
  label:              <NONE>
  uuid:               19 a2 94 f5 91 5d 4c ca be 2f c2 51 21 65 6e 2c
  number of blocks:   175172744
  bytes per block:    4096
  number of clusters: 21896593
  bytes per cluster:  32768
  max slots:          4
Pass 0a: Checking cluster allocation chains
[CHAIN_LINK_MAGIC] Chain 85 in allocator at inode 23 contains a
reference at depth 1 to block 84639744 which doesn't have a valid
checksum.  Truncate this chain? <y>
[CHAIN_BITS] Chain 85 in allocator inode 23 has 64716 bits marked free
out of 96768 total bits but the block groups in the chain have 206 free
out of 32256 total.  Fix this by updating the chain record? <y>
[CHAIN_LINK_MAGIC] Chain 113 in allocator at inode 23 contains a
reference at depth 2 to block 154570752 which doesn't have a valid
checksum.  Truncate this chain? <y>
[CHAIN_BITS] Chain 113 in allocator inode 23 has 64509 bits marked free
out of 96768 total bits but the block groups in the chain have 32254
free out of 64512 total.  Fix this by updating the chain record? <y>
[CHAIN_LINK_MAGIC] Chain 241 in allocator at inode 23 contains a
reference at depth 0 to block 62189568 which doesn't have a valid
checksum.  Truncate this chain? <y>
[CHAIN_BITS] Chain 241 in allocator inode 23 has 64510 bits marked free
out of 64512 total bits but the block groups in the chain have 0 free
out of 0 total.  Fix this by updating the chain record? <y>
[CHAIN_GROUP_BITS] Allocator inode 23 has 6215157 bits marked used out
of 21896593 total bits but the chains have 6215152 used out of 21735313
total.  Fix this by updating the inode counts? <y>
[CHAIN_I_CLUSTERS] Allocator inode 23 has 21735313 clusters represented
in its allocator chains but has an i_clusters value of 21896593. Fix
this by updating i_clusters? <y>
[CHAIN_I_SIZE] Allocator inode 23 has 21735313 clusters represented in
its allocator chain which accounts for 712222736384 total bytes, but its
i_size is 717507559424. Fix this by updating i_size? <y>
[GROUP_EXPECTED_DESC] Block 62189568 should be a group descriptor for
the bitmap chain allocator but it wasn't found in any chains.
Reinitialize it as a group desc and link it into the bitmap allocator?
<y>
[GROUP_EXPECTED_DESC] Block 84639744 should be a group descriptor for
the bitmap chain allocator but it wasn't found in any chains.
Reinitialize it as a group desc and link it into the bitmap allocator?
<y>
[GROUP_EXPECTED_DESC] Block 124895232 should be a group descriptor for
the bitmap chain allocator but it wasn't found in any chains.
Reinitialize it as a group desc and link it into the bitmap allocator?
<y> y
[GROUP_EXPECTED_DESC] Block 147345408 should be a group descriptor for
the bitmap chain allocator but it wasn't found in any chains.
Reinitialize it as a group desc and link it into the bitmap allocator?
<y> y
[GROUP_EXPECTED_DESC] Block 154570752 should be a group descriptor for
the bitmap chain allocator but it wasn't found in any chains.
Reinitialize it as a group desc and link it into the bitmap allocator?
<y> y
Pass 0b: Checking inode allocation chains
Pass 0c: Checking extent block allocation chains
Pass 1: Checking inodes and blocks.
[CLUSTER_ALLOC_BIT] Cluster 2774016 is in use but isn't set in the
global cluster bitmap. Set its bit in the bitmap? <y> y
pass1: Bit does not exist in bitmap range while trying to set bit
2774016 in the cluster bitmap
[CLUSTER_ALLOC_BIT] Cluster 2774017 is in use but isn't set in the
global cluster bitmap. Set its bit in the bitmap? <y> y
.....

I couldn't detect any hardware error, any PowerPath (SAN path failover
sw), fibre or SAN FC drive errors ? I used cluster size of only 32k, can
it be the problem as my device has couple of 100s of GBs of big Oracle
files ? I had following mount options: _netdev,datavolume on the 32-bit
machine and _netdev,datavolume,nointr on RAC machines as recommended. To
be more interesting, other LUN on the same 32-bit machine is having no
issues, even tough it's bigger and contains 150GB more data ?
Maybe some 32-bit limit reached ?

Vladan