[Ocfs2-users] How long for an fsck?

Thu Apr 21 09:50:31 PDT 2011

On 04/21/2011 06:43 AM, Josep Guerrero wrote:
> I have a cluster with 8 nodes, all of them running Debian Lenny (plus some
> additions so multipath and Infiniband works), which share an array of 48 1TB
> disks. Those disks form 22 pairs of hardware RAID1, plus 4 spares). The first
> 21 pairs are organized in two striped LVM logical volumes, of 16 and 3 TB,
> both formatted with ocfs2. The kernel is the version supplied with the
> distribution (2.6.26-2-amd64).
>
> I wanted to run an fsck on both volumes because of some errors I was getting
> (probably unrelated to the filesystems, but I wanted to check). On the 3TB
> volume (around 10% full) the check worked perfectly, and finished in less than
> an hour (this was run with the fsck.ocfs2 provided by Lenny ocfs2-tools,
> version 1.4.1):
>
<snip>

> but the check for the second filesystem (around 40% full) did this:
>
> ============
> hidra0:/usr/local/src# fsck.ocfs2 -f /dev/hidrahome/lvol0
> Checking OCFS2 filesystem in /dev/hidrahome/lvol0:
>    label:<NONE>
>    uuid:               6a a9 0e aa cf 33 45 4c b4 72 3a b6 7c 3b 8d 57
>    number of blocks:   4168098816
>    bytes per block:    4096
>    number of clusters: 4168098816
>    bytes per cluster:  4096
>    max slots:          8
>
> /dev/hidrahome/lvol0 was run with -f, check forced.
> Pass 0a: Checking cluster allocation chains
> =============
>
> and stayed there for 8 hours (all the time keeping one core around 100% CPU
> usage and with a light load on the disks; this was consistent with the same
> step in the previous run, but of course it didn't take so long). I thought
> that maybe I had run into some bug, so I interrupted the process, downloaded
> ocfs2-tools 1.4.4 sources, compiled them, and tried with that fsck, obtaining
> similar results, since it's been running for almost 7 hours like this:
>
> =============
> hidra0:/usr/local/src/ocfs2-tools-1.4.4/fsck.ocfs2# ./fsck.ocfs2 -f
> /dev/hidrahome/lvol0
> fsck.ocfs2 1.4.4
> Checking OCFS2 filesystem in /dev/hidrahome/lvol0:
>    Label:<NONE>
>    UUID:               6AA90EAACF33454CB4723AB67C3B8D57
>    Number of blocks:   4168098816
>    Block size:         4096
>    Number of clusters: 4168098816
>    Cluster size:       4096
>    Number of slots:    8
>
> /dev/hidrahome/lvol0 was run with -f, check forced.
> Pass 0a: Checking cluster allocation chains
>
> =============
>
> and with one core CPU at 100%.
>
> Could someone tell me if this is normal? I've been searching the web and
> checking manuals for information on how long this checks should take, and
> apart from one message in this list mentioning that 3 days in a 8 TB filesystem
> with 300 GB was too long, I haven't been able to find anything.
>
> If this is normal, is there any way to estimate, taking into account that the
> first filesystem uses exactly the same disks and took less than an hour to
> check, how long it should take for this other filesystem?

Do:
# debugfs.ocfs2 -R "stat //global_bitmap" /dev/hidrahome/lvol0

Does this hang too? Redirect the output to a file. That will give us some clues.