[Ocfs2-tools-devel] Patch of backup superblocks for ocfs2-tools.

Thu Nov 30 17:33:06 PST 2006

Hi Tao,

Firstly a patch protocol comment. It is very difficult and time consuming to
read one massive patch. It is much easier if things are broken up into
smaller, logical units. Even though the total number of lines changed might
be the same, it's much much easier to review and verify changes in a logical
order, rather than jumping around a ~1000 line patch. This works better for
all of us - you will get better and faster review, and I'll keep more of my
hair :)

There are several tools available that can help you manage a stack of
patches. 'quilt' seems to be a very popular choice these days.

For something like this, breaking the patch up into one core libocfs2
change, and then individual program updates would be desirable.

http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt

Has more details on how to break up patches.

On Wed, Nov 29, 2006 at 04:50:56PM +0800, tao.ma wrote:
> --- fsck.ocfs2/fsck.c	(revision 1267)
> +++ fsck.ocfs2/fsck.c	(working copy)
> @@ -66,6 +66,8 @@
>  #include "problem.h"
>  #include "util.h"
>  
> +#define ARRAY_LEN(arr) (sizeof(arr) / sizeof(arr[0]))
> +
>  int verbose = 0;

Instead of us defining our own version of this functionality and keeping it
local to fsck, can you just copy ARRAY_SIZE out of a kernel tree (should be
in include/linux/kernel.h) and put it in libocfs2/include/ocfs2.h?

The kernel one in particular gets the parens around the last reference of
'arr' correct.

> @@ -410,12 +441,60 @@ out:
>  	return ret;
>  }
>  
> +static errcode_t recover_backup_sb(o2fsck_state *ost, char* device, int sb_num)
> +{
> +	errcode_t ret;
> +	uint64_t offsets[OCFS2_MAX_BACKUP_SUPERBLOCKS], blksize, sb;
> +	ocfs2_filesys *fs = NULL;
> +
> +	if (sb_num < 1 || sb_num > OCFS2_MAX_BACKUP_SUPERBLOCKS)
> +		return -1;
> + 	
> +	ret = ocfs2_get_backup_sb_offset(NULL, offsets, ARRAY_LEN(offsets));
> +	if (ret)
> +		goto bail;
> +
> +	/* iterate all the blocksize to get the right one. */
> +	for (blksize = OCFS2_MIN_BLOCKSIZE;
> +		blksize <= OCFS2_MAX_BLOCKSIZE;	blksize <<= 1) {
> +		sb = offsets[sb_num - 1] / blksize;
> +		ret = ocfs2_open(device, OCFS2_FLAG_RW, sb, blksize, &fs);

Ok, I assume then that you're expecting the i_blkno and s_blocksize checks
in ocfs2_open() to tell us whether we've hit upon the correct blocksize or
not. This wasn't super-duper obvious to me at first, so if you don't mind, a
comment indicating that to be the case would help future fsck.ocfs2 hackers.

> --- fsck.ocfs2/pass1.c	(revision 1267)
> +++ fsck.ocfs2/pass1.c	(working copy)
> @@ -1005,6 +1005,8 @@ static void write_cluster_alloc(o2fsck_s
>  	errcode_t ret;
>  	uint64_t blkno, last_cbit, cbit, cbit_found;
>  	struct ocfs2_cluster_group_sizes cgs;
> +	uint64_t blocks[OCFS2_MAX_BACKUP_SUPERBLOCKS], *backup = NULL, bpc;
> +	struct ocfs2_super_block *super = OCFS2_RAW_SB(ost->ost_fs->fs_super);
>  
>  	ocfs2_calc_cluster_groups(ost->ost_fs->fs_clusters,
>  				  ost->ost_fs->fs_blocksize, &cgs);
> @@ -1033,6 +1035,21 @@ static void write_cluster_alloc(o2fsck_s
>  		goto out;
>  	}
>  
> +	/* handle the condition of backup superblock. */
> +	memset(&blocks, 0, sizeof(blocks));
> +	bpc = ost->ost_fs->fs_clustersize / ost->ost_fs->fs_blocksize;
> +	if (super->s_feature_compat & OCFS2_FEATURE_COMPAT_BACKUP_SB) {
> +		ret = ocfs2_get_backup_sb_offset(ost->ost_fs, blocks, 
> +					 sizeof(blocks) / sizeof(blocks[0]));

This appears to be an open coded ARRAY_SIZE()...

> +		if (ret) {
> +			/* Here we just output a warning and continue. */
> +			com_err(whoami, ret, "while get backup superblock "
> +				"offset");

What happens if you ignore an error like this? Won't fsck then reclaim the
space allocated to the backup super blocks? That's definitly not something
we should ignore IMHO.

Hmm, it seems that ocfs2_get_backup_sb_offset() can just be a void function.
Noted below.

> +		}
> +		else
> +			backup = blocks;
> +	}
> +
>  	/* we walk our found blocks bitmap to find clusters that we think
>  	 * are in use.  each time we find a block in a cluster we skip ahead
>  	 * to the first block of the next cluster when looking for the next.
> @@ -1044,6 +1061,10 @@ static void write_cluster_alloc(o2fsck_s
>  	 * we special case the number of clusters as the cluster offset which
>  	 * indicates that the rest of the bits to the end of the bitmap
>  	 * should be clear.
> +	 *
> +	 * we should take backup superblock as a special case since it doesn't
> +	 * belong to any inode. So it shouldn't be exist in
> +	 * ost->ost_allocated_clusters.
>  	 */
>  	for (last_cbit = 0, cbit = 0;
>  	     cbit < ost->ost_fs->fs_clusters; 
> @@ -1069,7 +1090,12 @@ static void write_cluster_alloc(o2fsck_s
>  
>  		/* clear set bits that should have been clear up to cbit */
>  		while (cbit_found < cbit) {
> -			force_cluster_bit(ost, ci, cbit_found, 0);
> +			/* check the backup superblock */
> +			if (backup && *backup &&
> +				cbit_found == *backup / bpc)
> +					backup++;
> +			else
> +				force_cluster_bit(ost, ci, cbit_found, 0);

Hmm, I'm not sure I completely understand this section of code... What
happens if the 1st value of cbit_found is past the 1st backup superblock
location pointed to by 'backup', but matches to one later in the array?

> --- libocfs2/backup_sb.c	(revision 0)
> +++ libocfs2/backup_sb.c	(revision 0)
> @@ -0,0 +1,192 @@
> +/* -*- mode: c; c-basic-offset: 8; -*-
> + * vim: noexpandtab sw=8 ts=8 sts=0:
> + *
> + * backup_sb.c
> + *
> + * Backup superblocks for an OCFS2 volume.
> + *
> + * Copyright (C) 2006 Oracle.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License, version 2,  as published by the Free Software Foundation.
> + * 
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + * 
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + *
> + */
> +
> +#include <errno.h>
> +#include "ocfs2.h"
> +
> +/* The byte offset of the first backup block will be 1G.
> + * The following will be 4G, 16G, 256G and 1T.
> + */
> +#define START_OFFSET	1 << 30

This should go in ocfs2_fs.h - the kernel code will likely need it when
updating super blocks during an online resize.

Rename it to something more descriptive, like OCFS2_BACKUP_SB_START.

I like the comment there btw - very descriptive, but quite concise.

> +
> +#undef min
> +#define min(a,b)	((a) < (b) ? (a) : (b))

We already have ocfs2_min() and ocfs2_max() in ocfs2.h, which also do type
checking. Please use those instead.

> +/* In case we don't have fs_blocksize, we will return
> + * byte offsets and let the caller calculate them by itself.
> + */
> +int ocfs2_get_backup_sb_offset(ocfs2_filesys *fs, 
> +			       uint64_t *offsets, size_t len)

This function never returns error, so it should be void.

> +{
> +	size_t i;
> +	uint64_t blkno, byteoff = START_OFFSET;
> +	uint32_t blocksize;
> +	
> +	memset(offsets, 0, sizeof(uint64_t) * len);
> +	len = min(len, OCFS2_MAX_BACKUP_SUPERBLOCKS);
> +
> +	if (fs)
> +		blocksize = fs->fs_blocksize;
> +	else
> +		blocksize = 1;
> +
> +	for (i = 0; i < len; i++) {
> +		blkno = byteoff / blocksize;
> +		if (fs && fs->fs_blocks <= blkno)
> +			break;
> +	
> +		offsets[i] = blkno;
> +		byteoff = byteoff << 2;

Can we take the calculations here and put them into ocfs2_fs.h as well?
Follow the conventions of functions like ocfs2_group_bitmap_size() where we
have a userspace and a kernel version. This way we keep the code to handle
this in one place, and the kernel can just use it when the time comes.

I'm specifically thinking that you pass in a struct super_block or an int
blocksize and an index and the function returns the block number based on
that index.

Maybe something like this for kernel:

static inline u64 ocfs2_backup_sb_blkno(struct super_block *sb, int index)
{
	u64 off = OCFS2_BACKUP_SB_START << (2 * index);

	return off >> sb->s_blocksize_bits;
}

Please double check my bit math :) I'll let you do the userspace version -
it looks like it might have the additional worry of an unkown blocksize,
which you could probably handle by just passing '1' like you do above.

> +/* 
> + * This function will backup superblock in the given blocks.
> + * And add the compat flag to superblock if needed.
> + */
> +errcode_t ocfs2_set_backup_sb(ocfs2_filesys *fs, uint64_t *blocks, size_t len)
> +{
> +	size_t i;
> +	errcode_t ret = 0;
> +	char *buf = NULL;
> +	uint64_t bm_blk, sb_blk, *blkno = blocks;
> +	int val;
> +	uint32_t compat;
> +	struct ocfs2_dinode *di = NULL;
> +	struct ocfs2_super_block *sb = NULL;
> +	uint32_t bpc = fs->fs_clustersize / fs->fs_blocksize;
> +
> +	if (!len || !blocks || !*blocks)
> +		goto bail;
> +	len = min(len, OCFS2_MAX_BACKUP_SUPERBLOCKS);
> +	if (!fs->fs_cluster_alloc) {
> +		ret = ocfs2_lookup_system_inode(fs, GLOBAL_BITMAP_SYSTEM_INODE,
> +						0, &bm_blk);
> +		if (ret)
> +			goto bail;
> +
> +		ret = ocfs2_read_cached_inode(fs, bm_blk, &fs->fs_cluster_alloc);
> +		if (ret)
> +			goto bail;
> +
> +		ret = ocfs2_load_chain_allocator(fs, fs->fs_cluster_alloc);
> +		if (ret)
> +			goto bail;
> +	}
> +
> +	if (!(OCFS2_RAW_SB(fs->fs_super)->s_feature_compat &
> +				OCFS2_FEATURE_COMPAT_BACKUP_SB)) {
> +		ret = verify_block_allocation(fs->fs_cluster_alloc->ci_chains,
> +						    bpc, blocks, &len);
> +		if (ret)
> +			goto bail;
> +	}
> +
> +	/* we get the superblock information from the disk. */
> +	sb_blk = fs->fs_super->i_blkno;
> +
> +	ret = ocfs2_malloc_block(fs->fs_io, &buf);
> +	if (ret)
> +		goto bail;
> +
> +	ret = io_read_block(fs->fs_io, sb_blk, 1, buf);
> +	if (ret)
> +		goto bail;	
> +
> +	di = (struct ocfs2_dinode *)buf;
> +	sb = OCFS2_RAW_SB(di);
> +
> +	compat = le32_to_cpu(sb->s_feature_compat);
> +
> +	if (!(compat & OCFS2_FEATURE_COMPAT_BACKUP_SB))
> +		sb->s_feature_compat |=
> +				 cpu_to_le32(OCFS2_FEATURE_COMPAT_BACKUP_SB);
> +
> +	for (i = 0; i < len; i++, blkno++) {
> +		di->i_blkno = cpu_to_le32(*blkno);

That needs to be a cpu_to_le64()...

> --- libocfs2/include/ocfs2.h	(revision 1267)
> +++ libocfs2/include/ocfs2.h	(working copy)
> @@ -79,7 +79,8 @@
>  
>  #define OCFS2_LIB_FEATURE_RO_COMPAT_SUPP	OCFS2_FEATURE_RO_COMPAT_SUPP
>  
> -#define OCFS2_LIB_FEATURE_COMPAT_SUPP		OCFS2_FEATURE_COMPAT_SUPP
> +#define OCFS2_LIB_FEATURE_COMPAT_SUPP		(OCFS2_FEATURE_COMPAT_SUPP |	\
> +						 OCFS2_FEATURE_COMPAT_BACKUP_SB)	
>  
>  /* Flags for the ocfs2_filesys structure */
>  #define OCFS2_FLAG_RO			0x00
> @@ -178,6 +179,9 @@
>  #define OCFS2_CHB_WAITING	2
>  #define OCFS2_CHB_COMPLETE	3
>  
> +/* the max backup superblock nums */
> +#define OCFS2_MAX_BACKUP_SUPERBLOCKS	6

Please also put this in ocfs2_fs.h.

>  typedef void (*ocfs2_chb_notify)(int state, char *progress, void *data);
>  
>  typedef struct _ocfs2_filesys ocfs2_filesys;
> @@ -608,6 +612,20 @@ errcode_t ocfs2_encode_lockres(enum ocfs
>  errcode_t ocfs2_decode_lockres(char *lockres, int len, enum ocfs2_lock_type *type,
>  			       uint64_t *blkno, uint32_t *generation);
>  
> +/* get the blkno according to the file system info.
> + * The unused ones, depending on the volume size, are zeroed. 
> + */
> +int ocfs2_get_backup_sb_offset(ocfs2_filesys *fs, 
> +			       uint64_t *blocks, size_t len);
> +
> +/* This function will get the superblock pointed to by fs and copy it to
> + * the blocks. But first it will ensure all the appropriate clusters are free.
> + * If not, it will error out with ENOSPC. If free, it will set bits for all 
> + * the clusters, zero the full cluster and write the backup sb. 
> + * In case of updating, it will override the backup blocks with the newest
> + * superblock information.
> + */
> +errcode_t ocfs2_set_backup_sb(ocfs2_filesys *fs, uint64_t *blocks, size_t len);

Again, good job on commenting things :)

One question though - you say there that ocfs2_set_backup_sb() will be
zeroing the clusters as they are allocated, but I don't see where it's doing
that.

> --- libocfs2/include/ocfs2_fs.h	(revision 1267)
> +++ libocfs2/include/ocfs2_fs.h	(working copy)
> @@ -103,6 +103,12 @@
>   */
>  #define OCFS2_FEATURE_INCOMPAT_RESIZE_INPROG    0x0004
>  
> +/* 
> + * backup superblock flag is used to indicate that this volume
> + * has backup superblocks.
> + */
> +#define OCFS2_FEATURE_COMPAT_BACKUP_SB		0x0008
> +

You don't need to start at 0x8, you can start at 0x1 considering it's our
1st compat flag (and they go in a seperate disk field from incompat). 

Also, that should be added to "OCFS2_FEATURE_COMPAT_SUPP"

> --- mkfs.ocfs2/mkfs.c	(revision 1267)
> +++ mkfs.ocfs2/mkfs.c	(working copy)
> @@ -84,6 +84,7 @@ static AllocGroup * initialize_alloc_gro
>  					   uint16_t bpc);
>  static void create_lost_found_dir(State *s);
>  static void format_journals(State *s);
> +static void format_backup_sb(State *s);
>  
>  extern char *optarg;
>  extern int optind, opterr, optopt;
> @@ -438,6 +439,14 @@ main(int argc, char **argv)
>  	if (!s->quiet)
>  		printf("done\n");
>  
> +	if (!s->quiet)
> +		printf("Writing backup superblock: ");
> +
> +	format_backup_sb(s);
> +
> +	if (!s->quiet)
> +		printf("done\n");
> +
>  	if (!s->hb_dev) {
>  		/* These routines use libocfs2 to do their work. We
>  		 * don't share an ocfs2_filesys context between the
> @@ -503,6 +512,7 @@ get_state(int argc, char **argv)
>  	State *s;
>  	int c;
>  	int verbose = 0, quiet = 0, force = 0, xtool = 0, hb_dev = 0;
> +	int backup = 0;
>  	int show_version = 0;
>  	char *device_name;
>  	int ret;
> @@ -521,6 +531,7 @@ get_state(int argc, char **argv)
>  		{ "journal-options", 0, 0, 'J'},
>  		{ "heartbeat-device", 0, 0, 'H'},
>  		{ "force", 0, 0, 'F'},
> +		{ "num-backup", 0, 0, 'n'},

Can you please just remove this option - it's confusingly named, and we're
not honoring it completely. The mke2fs manpage states:

       -n     causes  mke2fs  to not actually create a filesystem, but display
              what it would do if it were to create a filesystem.  This can be
              used  to  determine the location of the backup superblocks for a
              particular filesystem, so long as  the  mke2fs  parameters that
              were  passed when the filesystem was originally created are used
              again.  (With the -n option added, of course!)

What you seem to do if '-n' is passed is simply print backup sb locations
while formatting. This means that someone who expects '-n' to not actually
change the block device will get a very nasty surprise when using it on
mkfs.ocfs2.

If we want to print the backup super block locations, we can just always
print them. The way ours work however, it shouldn't ever be necessary for a
user to remember block offsets anyway, so I wonder how useful such a feature
would be.

> --- debugfs.ocfs2/commands.c	(revision 1267)
> +++ debugfs.ocfs2/commands.c	(working copy)
> @@ -293,6 +293,70 @@ static int process_inodestr_args(char **
>  }
>  
>  /*
> + * process_open_args
> + *
> + */
> +static int process_open_args(char **args, 
> +			     uint64_t *superblock, uint64_t *blocksize)
> +{
> +	errcode_t ret;
> +	long s;
> +	char *ptr;
> +	int ind = 2;
> +	ocfs2_filesys *fs = NULL;
> +	uint64_t byte_off[OCFS2_MAX_BACKUP_SUPERBLOCKS], sb, blksize;
> +
> +	*superblock = 0;
> +	*blocksize = 0;
> +
> +	if (!args[ind])
> +		return 0;
> +
> +	if (args[ind] && !strcmp(args[ind], "-s"))
> +		ind++;
> +	else
> +		return -1;
> +
> +	if(!args[ind])
> +		return -1;
> +		
> +	s = strtol(args[ind], &ptr, 0);
> +
> +	if (s < 1 || s > OCFS2_MAX_BACKUP_SUPERBLOCKS) {
> +		fprintf (stderr, "num exceeds limit\n");

Maybe we want to print to the user what those limits are? Something like 

"backup super block is outside of valid range (between 1 and 6)"

We could probably do even better than that.

One last nit. There's a bunch of lines added with trailing whitespace.
tpp.txt (referenced above) has a script which you should run to remove
those.

Other than the comments above, I'd say your approach to this task seems quite
good. Thanks so far!
	--mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com