[Ocfs2-devel] [PATCH 1/4] ocfs2: allocation reservations
Mark Fasheh
mfasheh at suse.com
Tue Mar 16 20:13:14 PDT 2010
On Tue, Mar 16, 2010 at 05:17:35PM +0800, Tao Ma wrote:
> Hi Mark,
> The idea is cool. Some comments inlined.
Thanks, and thanks for the detailed review.
> > +static void ocfs2_check_resmap(struct ocfs2_reservation_map *resmap)
> > +{
> > + unsigned int off = 0;
> > + int i = 0;
> > + struct rb_node *node;
> > + struct ocfs2_alloc_reservation *resv;
> > +
> > + node = rb_first(&resmap->m_reservations);
> > + while (node) {
> > + resv = rb_entry(node, struct ocfs2_alloc_reservation, r_node);
> > +
> > + if (i > 0 && resv->r_start <= off) {
> > + mlog(ML_ERROR, "reservation %d has bad start off!\n",
> > + i);
> > + goto bad;
> > + }
> > +
> > + if (resv->r_len == 0) {
> > + mlog(ML_ERROR, "reservation %d has no length!\n",
> > + i);
> > + goto bad;
> > + }
> > +
> > + if (resv->r_start > ocfs2_resv_end(resv)) {
> > + mlog(ML_ERROR, "reservation %d has invalid range!\n",
> > + i);
> > + goto bad;
> > + }
> > +
> > + if (ocfs2_resv_end(resv) > resmap->m_bitmap_len) {
> I guess here should be ocfs2_resv_end(resv) >= resmap->m_bitmap_len?
Yes, good catch. I'll fix that up.
> > +static void ocfs2_resv_insert(struct ocfs2_reservation_map *resmap,
> > + struct ocfs2_alloc_reservation *new)
> > +{
> > + struct rb_root *root = &resmap->m_reservations;
> > + struct rb_node *parent = NULL;
> > + struct rb_node **p = &root->rb_node;
> > + struct ocfs2_alloc_reservation *tmp;
> > +
> > + assert_spin_locked(&resv_lock);
> > +
> > + mlog(0, "Insert reservation start: %u len: %u\n", new->r_start,
> > + new->r_len);
> > +
> > + while(*p) {
> > + parent = *p;
> > +
> > + tmp = rb_entry(parent, struct ocfs2_alloc_reservation, r_node);
> > +
> > + if (new->r_start < tmp->r_start)
> > + p = &(*p)->rb_left;
> > + else if (new->r_start > ocfs2_resv_end(tmp))
> > + p = &(*p)->rb_right;
> > + else {
> > + /* This should never happen! */
> > + mlog(ML_ERROR, "Duplicate reservation window!\n");
> > + BUG();
> We bug out here in case r_start is in [tmp->r_start,
> ocfs2_resv_end(tmp)]. Do we need to check ocfs2_resv_end(new) somewhere?
We don't really care about ocfs2_resv_end(new) for tree traversal purposes.
However, I think you have a good idea - reservations should never overlap,
and this would be a good place to check for code bugs. I'll add some extra
checks.
> > +{
> > + /* Now we do a linear search for a window, starting at 'prev_rsv' */
> > + while (1) {
> > + next = rb_next(prev);
> > + if (next) {
> > + mlog(0, "One more resv found in linear search\n");
> > + next_resv = rb_entry(next,
> > + struct ocfs2_alloc_reservation,
> > + r_node);
> > +
> > + gap_start = ocfs2_resv_end(prev_resv) + 1;
> > + gap_end = next_resv->r_start - 1;
> > + gap_len = gap_end - gap_start + 1;
> > + } else {
> > + mlog(0, "No next node\n");
> > + /*
> > + * We're at the rightmost edge of the
> > + * tree. See if a reservation between this
> > + * window and the end of the bitmap will work.
> > + */
> > + gap_start = ocfs2_resv_end(prev_resv) + 1;
> > + gap_len = resmap->m_bitmap_len - gap_start;
> > + gap_end = resmap->m_bitmap_len - 1;
> > + }
> > +
> > + clen = ocfs2_resmap_find_free_bits(resmap, wanted, gap_start,
> > + gap_len, &cstart, &clen);
> I guess we can have some improvement here? In case gap_len <= best_len,
> we don't need to call this function actually and just loop to the next
> round.
Yep, fixed. That should save us some CPU :)
> > + if (clen == wanted) {
> > + best_len = clen;
> > + best_start = cstart;
> > + goto out_insert;
> > + } else if (clen > best_len) {
> > + best_len = clen;
> > + best_start = cstart;
> > + }
> > +
> > + if (!next)
> > + break;
> > +
> > + prev = next;
> > + prev_resv = rb_entry(prev, struct ocfs2_alloc_reservation,
> > + r_node);
> > + }
> > +
> > +out_insert:
> > + if (best_len) {
> > + resv->r_start = best_start;
> > + resv->r_len = best_len;
> > + ocfs2_resv_insert(resmap, resv);
> > + }
> > +}
> > +static void ocfs2_cannibalize_resv(struct ocfs2_reservation_map *resmap,
> > + struct ocfs2_alloc_reservation *resv,
> > + unsigned int wanted)
> > +{
> <snip>
> > + } else {
> > + unsigned int shrink = lru_resv->r_len / 2;
> > +
> > + mlog(0, "shrink lru resv by %u\n", shrink);
> > +
> > + lru_resv->r_len -= shrink;
> Do we need to change lru_resv->r_last_len here so that the next search
> can start right after the shrinked zone? I am not sure.
No, I left them unchanged on purpose - they still record the last sucessful
allocation. In theory we could try adjusting them, but that'd be speculative
and not based on any actual allocation requests.
> > +
> > + resv->r_start = ocfs2_resv_end(lru_resv) + 1;
> > + resv->r_len = shrink;
> > + }
> > +
> <snip>
> > +void ocfs2_resmap_claimed_bits(struct ocfs2_reservation_map *resmap,
> > + struct ocfs2_alloc_reservation *resv,
> > + u32 cstart, u32 clen)
> > +{
> > + unsigned int cend = cstart + clen - 1;
> > + int search, used_resv = 0;
> > +
> > + if (resmap == NULL || ocfs2_resmap_disabled(resmap))
> > + return;
> > +
> > + if (resv == NULL)
> > + return;
> > +
> > + spin_lock(&resv_lock);
> > +
> > + mlog(0, "claim bits: cstart: %u cend: %u clen: %u r_start: %u "
> > + "r_end: %u r_len: %u, r_last_start: %u r_last_len: %u\n",
> > + cstart, cend, clen, resv->r_start, ocfs2_resv_end(resv),
> > + resv->r_len, resv->r_last_start, resv->r_last_len);
> > +
> > + /*
> > + * Find all reservations which contain (cstart-clen). This
> > + * could potentially be multiple ones, depending on the
> > + * allocation.
> I am just curious how this could happen with multiple reservations. ;)
Yeah, I realized after I sent the series -- this is old code that I'm going
to remove in my next round. It's back from when local alloc didn't always
honor reservations (which as I discussed, didn't work well). The only parts
of this whole loop that we need to keep is a call to
ocfs2_adjust_resv_from_alloc() for the passed in reservation.
> > + *
> > + * Each one needs to be adjusted. If the allocation overwrites
> > + * the whole window, we should just discard that reservation.
> > + *
> > + * If we hit the one passed in ('resv'), we treat it slightly
> > + * special by trying to extend it after adjustments have been
> > + * made.
> > + */
> > +
> > + search = cstart;
> > + while (search <= cend) {
> > + struct ocfs2_alloc_reservation *r;
> > + int inc;
> > +
> > + /*
> > + * This will find us the leftmost resv containing any
> > + * bits between 'search' and 'cend'
> > + */
> > + r = ocfs2_find_resv_containing(resmap, search, cend);
> > + if (!r)
> > + break;
> > +
> > + inc = ocfs2_adjust_resv_from_alloc(resmap, r, search, cend);
> > +
> > + /*
> > + * This will signal the next set of code that 'resv'
> > + * was used at least partially by the allocation.
> > + */
> > + if (r == resv) {
> > + used_resv = 1;
> I just found that used_resv isn't used anymore later.
Yep, this will go away as mentioned.
> > +struct ocfs2_reservation_map {
> > + struct rb_root m_reservations;
> > + char *m_disk_bitmap;
> > +
> > + struct ocfs2_super *m_osb;
> > +
> > + /* The following are not initialized to meaningful values until a disk
> > + * bitmap is provided. */
> > + u32 m_bitmap_len; /* Number of valid
> > + * bits available */
> > +
> > + unsigned int m_search_start; /* Records the end
> > + * location of our
> > + * most recent
> > + * allocation. */
> oh, this member isn't used actually.
Yeah another bit of old code. I'll remove.
> > +
> > + struct list_head m_lru; /* LRU of reservations
> > + * structures. */
> > +
> > +};
> <snip>
> > @@ -1429,6 +1434,17 @@ static int ocfs2_parse_options(struct super_block *sb,
> > mopt->mount_opt |= OCFS2_MOUNT_NO_POSIX_ACL;
> > mopt->mount_opt &= ~OCFS2_MOUNT_POSIX_ACL;
> > break;
> > + case Opt_resv_level:
> > + if (is_remount)
> > + break;
> > + if (match_int(&args[0], &option)) {
> > + status = 0;
> > + goto bail;
> > + }
> > + if (option >= OCFS2_MIN_RESV_LEVEL &&
> > + option < OCFS2_MAX_RESV_LEVEL)
> > + mopt->resv_level = option;
> Do we need to check cluster size here? I guess with a large cluster
> size, say 1M, we need a very small window or even no window.
> ok, that's all for the codes.
>
> Besides the comments, I have some other thoughts.
> 1. Do we need to some additional check in
> ocfs2_reserve_local_alloc_bits? The old schema just check the condition
> bits_wanted <= free_bits since local_alloc is contiguous enough. But
> with reservation window, we can have a fragmented local alloc actually.
> 2. Current codes don't handle with space reservations(unwritten
> extents). Maybe it is because unwritten will allocate contiguous clusters?
>
> With these 2 problems, I corrupted the system somehow by the following
> scripts(Sorry, Mark, ;) ).
>
> echo 'y'|mkfs.ocfs2 -b 4K -C 4K --fs-features=local $DEVICE
> mount -t ocfs2 $DEVICE $MNT_DIR
> for((j=0;j<32;j++))
> do
> FILE_NAME=$MNT_DIR/$RANDOM
> for((i=0;i<63;i++))
> do
> cat /mnt/4096 >> $FILE_NAME #4096 is a file with 4096 bytes.
> done
> done
> FILE_NAME=$MNT_DIR/$RANDOM
> ./rw_test -f $FILE_NAME -l 0 -u 8192 #this will create a file with an
> unwritten extent of 2 clusters length at offset 0.
> umount $MNT_DIR
>
> Now run fsck.ocfs2, you will find duplicated clusters.
> This is caused by the code here.
> > start = ocfs2_local_alloc_find_clear_bits(osb, alloc, bits_wanted,
> > ac->ac_resv);
> > if (start == -1) {
> > /* TODO: Shouldn't we just BUG here? */
> > status = -ENOSPC;
> > mlog_errno(status);
> > goto bail;
> > }
> > printk("end of start %d\n", start);
> >
> > bitmap = la->la_bitmap;
> > *bit_off = le32_to_cpu(la->la_bm_off) + start;
> > /* local alloc is always contiguous by nature -- we never
> > * delete bits from it! */
> > *num_bits = bits_wanted;
>
> See here? bit_wanted is 2, while with the script, the local alloc is
> divided into 32 parts and every part only have 1 clusters left. But this
> code deem that we can always get bits_wanted. :)
Ahh, yeah we need to pass back the actual allocation from
ocfs2_local_alloc_find_clear_bits(). That should be simple enough. I
probably never hit this because all other allocations that went to the local
alloc were 1 bit at a time (for file write, etc).
Btw, I realized that we don't really need ocfs2_local_alloc_in_range() since
we don't allocate inode groups from it any more. Or at least, I can simplify
it quite a bit.
--Mark
--
Mark Fasheh
More information about the Ocfs2-devel
mailing list