[Ocfs2-devel] [PATCH 1/4] ocfs2: allocation reservations

Mark Fasheh mfasheh at suse.com
Tue Mar 16 20:13:14 PDT 2010


On Tue, Mar 16, 2010 at 05:17:35PM +0800, Tao Ma wrote:
> Hi Mark,
> 	The idea is cool. Some comments inlined.

Thanks, and thanks for the detailed review.


> > +static void ocfs2_check_resmap(struct ocfs2_reservation_map *resmap)
> > +{
> > +	unsigned int off = 0;
> > +	int i = 0;
> > +	struct rb_node *node;
> > +	struct ocfs2_alloc_reservation *resv;
> > +
> > +	node = rb_first(&resmap->m_reservations);
> > +	while (node) {
> > +		resv = rb_entry(node, struct ocfs2_alloc_reservation, r_node);
> > +
> > +		if (i > 0 && resv->r_start <= off) {
> > +			mlog(ML_ERROR, "reservation %d has bad start off!\n",
> > +			     i);
> > +			goto bad;
> > +		}
> > +
> > +		if (resv->r_len == 0) {
> > +			mlog(ML_ERROR, "reservation %d has no length!\n",
> > +			     i);
> > +			goto bad;
> > +		}
> > +
> > +		if (resv->r_start > ocfs2_resv_end(resv)) {
> > +			mlog(ML_ERROR, "reservation %d has invalid range!\n",
> > +			     i);
> > +			goto bad;
> > +		}
> > +
> > +		if (ocfs2_resv_end(resv) > resmap->m_bitmap_len) {
> I guess here should be ocfs2_resv_end(resv) >= resmap->m_bitmap_len?

Yes, good catch. I'll fix that up.


> > +static void ocfs2_resv_insert(struct ocfs2_reservation_map *resmap,
> > +			      struct ocfs2_alloc_reservation *new)
> > +{
> > +	struct rb_root *root = &resmap->m_reservations;
> > +	struct rb_node *parent = NULL;
> > +	struct rb_node **p = &root->rb_node;
> > +	struct ocfs2_alloc_reservation *tmp;
> > +
> > +	assert_spin_locked(&resv_lock);
> > +
> > +	mlog(0, "Insert reservation start: %u len: %u\n", new->r_start,
> > +	     new->r_len);
> > +
> > +	while(*p) {
> > +		parent = *p;
> > +
> > +		tmp = rb_entry(parent, struct ocfs2_alloc_reservation, r_node);
> > +
> > +		if (new->r_start < tmp->r_start)
> > +			p = &(*p)->rb_left;
> > +		else if (new->r_start > ocfs2_resv_end(tmp))
> > +			p = &(*p)->rb_right;
> > +		else {
> > +			/* This should never happen! */
> > +			mlog(ML_ERROR, "Duplicate reservation window!\n");
> > +			BUG();
> We bug out here in case r_start is in [tmp->r_start, 
> ocfs2_resv_end(tmp)]. Do we need to check ocfs2_resv_end(new) somewhere?

We don't really care about ocfs2_resv_end(new) for tree traversal purposes.
However, I think you have a good idea - reservations should never overlap,
and this would be a good place to check for code bugs. I'll add some extra
checks.


> > +{
> > +	/* Now we do a linear search for a window, starting at 'prev_rsv' */
> > +	while (1) {
> > +		next = rb_next(prev);
> > +		if (next) {
> > +			mlog(0, "One more resv found in linear search\n");
> > +			next_resv = rb_entry(next,
> > +					     struct ocfs2_alloc_reservation,
> > +					     r_node);
> > +
> > +			gap_start = ocfs2_resv_end(prev_resv) + 1;
> > +			gap_end = next_resv->r_start - 1;
> > +			gap_len = gap_end - gap_start + 1;
> > +		} else {
> > +			mlog(0, "No next node\n");
> > +			/*
> > +			 * We're at the rightmost edge of the
> > +			 * tree. See if a reservation between this
> > +			 * window and the end of the bitmap will work.
> > +			 */
> > +			gap_start = ocfs2_resv_end(prev_resv) + 1;
> > +			gap_len = resmap->m_bitmap_len - gap_start;
> > +			gap_end = resmap->m_bitmap_len - 1;
> > +		}
> > +
> > +		clen = ocfs2_resmap_find_free_bits(resmap, wanted, gap_start,
> > +						   gap_len, &cstart, &clen);
> I guess we can have some improvement here? In case gap_len <= best_len, 
> we don't need to call this function actually and just loop to the next 
> round.

Yep, fixed. That should save us some CPU :)


> > +		if (clen == wanted) {
> > +			best_len = clen;
> > +			best_start = cstart;
> > +			goto out_insert;
> > +		} else if (clen > best_len) {
> > +			best_len = clen;
> > +			best_start = cstart;
> > +		}
> > +
> > +		if (!next)
> > +			break;
> > +
> > +		prev = next;
> > +		prev_resv = rb_entry(prev, struct ocfs2_alloc_reservation,
> > +				     r_node);
> > +	}
> > +
> > +out_insert:
> > +	if (best_len) {
> > +		resv->r_start = best_start;
> > +		resv->r_len = best_len;
> > +		ocfs2_resv_insert(resmap, resv);
> > +	}
> > +}
> > +static void ocfs2_cannibalize_resv(struct ocfs2_reservation_map *resmap,
> > +				   struct ocfs2_alloc_reservation *resv,
> > +				   unsigned int wanted)
> > +{
> <snip>
> > +	} else {
> > +		unsigned int shrink = lru_resv->r_len / 2;
> > +
> > +		mlog(0, "shrink lru resv by %u\n", shrink);
> > +
> > +		lru_resv->r_len -= shrink;
> Do we need to change lru_resv->r_last_len here so that the next search 
> can start right after the shrinked zone? I am not sure.

No, I left them unchanged on purpose - they still record the last sucessful
allocation. In theory we could try adjusting them, but that'd be speculative
and not based on any actual allocation requests.


> > +
> > +		resv->r_start = ocfs2_resv_end(lru_resv) + 1;
> > +		resv->r_len = shrink;
> > +	}
> > +
> <snip>
> > +void ocfs2_resmap_claimed_bits(struct ocfs2_reservation_map *resmap,
> > +			       struct ocfs2_alloc_reservation *resv,
> > +			       u32 cstart, u32 clen)
> > +{
> > +	unsigned int cend = cstart + clen - 1;
> > +	int search, used_resv = 0;
> > +
> > +	if (resmap == NULL || ocfs2_resmap_disabled(resmap))
> > +		return;
> > +
> > +	if (resv == NULL)
> > +		return;
> > +
> > +	spin_lock(&resv_lock);
> > +
> > +	mlog(0, "claim bits: cstart: %u cend: %u clen: %u r_start: %u "
> > +	     "r_end: %u r_len: %u, r_last_start: %u r_last_len: %u\n",
> > +	     cstart, cend, clen, resv->r_start, ocfs2_resv_end(resv),
> > +	     resv->r_len, resv->r_last_start, resv->r_last_len);
> > +
> > +	/*
> > +	 * Find all reservations which contain (cstart-clen). This
> > +	 * could potentially be multiple ones, depending on the
> > +	 * allocation.
> I am just curious how this could happen with multiple reservations. ;)

Yeah, I realized after I sent the series -- this is old code that I'm going
to remove in my next round. It's back from when local alloc didn't always
honor reservations (which as I discussed, didn't work well). The only parts
of this whole loop that we need to keep is a call to
ocfs2_adjust_resv_from_alloc() for the passed in reservation.


> > +	 *
> > +	 * Each one needs to be adjusted. If the allocation overwrites
> > +	 * the whole window, we should just discard that reservation.
> > +	 *
> > +	 * If we hit the one passed in ('resv'), we treat it slightly
> > +	 * special by trying to extend it after adjustments have been
> > +	 * made.
> > +	 */
> > +
> > +	search = cstart;
> > +	while (search <= cend) {
> > +		struct ocfs2_alloc_reservation *r;
> > +		int inc;
> > +
> > +		/*
> > +		 * This will find us the leftmost resv containing any
> > +		 * bits between 'search' and 'cend'
> > +		 */
> > +		r = ocfs2_find_resv_containing(resmap, search, cend);
> > +		if (!r)
> > +			break;
> > +
> > +		inc = ocfs2_adjust_resv_from_alloc(resmap, r, search, cend);
> > +
> > +		/*
> > +		 * This will signal the next set of code that 'resv'
> > +		 * was used at least partially by the allocation.
> > +		 */
> > +		if (r == resv) {
> > +			used_resv = 1;
> I just found that used_resv isn't used anymore later.

Yep, this will go away as mentioned.


> > +struct ocfs2_reservation_map {
> > +	struct rb_root		m_reservations;
> > +	char			*m_disk_bitmap;
> > +
> > +	struct ocfs2_super	*m_osb;
> > +
> > +	/* The following are not initialized to meaningful values until a disk
> > +	 * bitmap is provided. */
> > +	u32			m_bitmap_len;	/* Number of valid
> > +						 * bits available */
> > +
> > +	unsigned int		m_search_start;	/* Records the end
> > +						 * location of our
> > +						 * most recent
> > +						 * allocation. */
> oh, this member isn't used actually.

Yeah another bit of old code. I'll remove.



> > +
> > +	struct list_head	m_lru;		/* LRU of reservations
> > +						 * structures. */
> > +
> > +};
> <snip>
> > @@ -1429,6 +1434,17 @@ static int ocfs2_parse_options(struct super_block *sb,
> >  			mopt->mount_opt |= OCFS2_MOUNT_NO_POSIX_ACL;
> >  			mopt->mount_opt &= ~OCFS2_MOUNT_POSIX_ACL;
> >  			break;
> > +		case Opt_resv_level:
> > +			if (is_remount)
> > +				break;
> > +			if (match_int(&args[0], &option)) {
> > +				status = 0;
> > +				goto bail;
> > +			}
> > +			if (option >= OCFS2_MIN_RESV_LEVEL &&
> > +			    option < OCFS2_MAX_RESV_LEVEL)
> > +				mopt->resv_level = option;
> Do we need to check cluster size here? I guess with a large cluster 
> size, say 1M, we need a very small window or even no window.
> ok, that's all for the codes.
> 
> Besides the comments, I have some other thoughts.
> 1. Do we need to some additional check in 
> ocfs2_reserve_local_alloc_bits? The old schema just check the condition 
> bits_wanted <= free_bits since local_alloc is contiguous enough. But 
> with reservation window, we can have a fragmented local alloc actually.
> 2. Current codes don't handle with space reservations(unwritten 
> extents). Maybe it is because unwritten will allocate contiguous clusters?
> 
> With these 2 problems, I corrupted the system somehow by the following 
> scripts(Sorry, Mark, ;) ).
> 
> echo 'y'|mkfs.ocfs2 -b 4K -C 4K --fs-features=local $DEVICE
> mount -t ocfs2 $DEVICE $MNT_DIR
> for((j=0;j<32;j++))
> do
>          FILE_NAME=$MNT_DIR/$RANDOM
>          for((i=0;i<63;i++))
>          do
>          cat /mnt/4096 >> $FILE_NAME #4096 is a file with 4096 bytes.
>          done
> done
> FILE_NAME=$MNT_DIR/$RANDOM
> ./rw_test -f $FILE_NAME -l 0 -u 8192 #this will create a file with an 
> unwritten extent of 2 clusters length at offset 0.
> umount $MNT_DIR
> 
> Now run fsck.ocfs2, you will find duplicated clusters.
> This is caused by the code here.
>  >	start = ocfs2_local_alloc_find_clear_bits(osb, alloc, bits_wanted,
>  >                                                  ac->ac_resv);
>  >        if (start == -1) {
>  >                /* TODO: Shouldn't we just BUG here? */
>  >                status = -ENOSPC;
>  >                mlog_errno(status);
>  >                goto bail;
>  >        }
>  >        printk("end of start %d\n", start);
>  >
>  >        bitmap = la->la_bitmap;
>  >        *bit_off = le32_to_cpu(la->la_bm_off) + start;
>  >        /* local alloc is always contiguous by nature -- we never
>  >         * delete bits from it! */
>  >        *num_bits = bits_wanted;
> 
> See here? bit_wanted is 2, while with the script, the local alloc is 
> divided into 32 parts and every part only have 1 clusters left. But this 
> code deem that we can always get bits_wanted. :)

Ahh, yeah we need to pass back the actual allocation from
ocfs2_local_alloc_find_clear_bits(). That should be simple enough. I
probably never hit this because all other allocations that went to the local
alloc were 1 bit at a time (for file write, etc).


Btw, I realized that we don't really need ocfs2_local_alloc_in_range() since
we don't allocate inode groups from it any more. Or at least, I can simplify
it quite a bit.
	--Mark

--
Mark Fasheh



More information about the Ocfs2-devel mailing list