[Ocfs2-tools-devel] [PATCH 07/11] libocfs2: Add aio read support

Fri Sep 23 12:52:49 PDT 2011

On Fri, Sep 23, 2011 at 12:40 PM, Sunil Mushran
<sunil.mushran at oracle.com> wrote:
> On 09/23/2011 09:49 AM, Goldwyn Rodrigues wrote:
>>
>> On Thu, Sep 22, 2011 at 9:04 PM, Sunil Mushran<sunil.mushran at oracle.com>
>>  wrote:
>>>
>>> Added public function io_aio_read_blocks() that performs aio reads on the
>>> provided set of blocks. It is io cache friendly. One use case is to use
>>> this
>>> to warm the cache, which has proven to be very useful in fsck.
>>>
>>> Signed-off-by: Sunil Mushran<sunil.mushran at oracle.com>
>>> ---
>
> <snip>
>>>
>>> +static errcode_t unix_aio_read_blocks(io_channel *channel,
>>> +                                     struct io_aio_unit *aios, int
>>> count)
>>> +{
>>> +       int i;
>>> +       int ret;
>>> +       io_context_t io_ctx;
>>> +       struct iocb *iocb = NULL, **iocbs = NULL;
>>> +       struct io_event *events = NULL;
>>> +       int64_t offset;
>>> +       int submitted, completed = 0;
>>> +
>>> +       ret = OCFS2_ET_NO_MEMORY;
>>> +       iocb = malloc((sizeof(struct iocb) * count));
>>> +       iocbs = malloc((sizeof(struct iocb *) * count));
>>> +       events = malloc((sizeof(struct io_event) * count));
>>> +       if (!iocb || !iocbs || !events)
>>> +               goto out;
>>> +
>>> +       memset(&io_ctx, 0, sizeof(io_ctx));
>>> +       ret = io_queue_init(count,&io_ctx);
>>> +       if (ret)
>>> +               return ret;
>>> +
>>> +       for (i = 0; i<  count; ++i) {
>>> +               offset = aios[i].aio_blkno * channel->io_blksize;
>>> +               io_prep_pread(&(iocb[i]), channel->io_fd,
>>> +                             aios[i].aio_buf,
>>> +                             channel->io_blksize, offset);
>>> +               iocbs[i] =&iocb[i];
>>> +       }
>>> +
>>> +resubmit:
>>> +       ret = io_submit(io_ctx, count - completed,&iocbs[completed]);
>>> +       if (!ret&&  (count - completed))
>>> +               ret = OCFS2_ET_SHORT_READ;
>>> +       if (ret<  0)
>>> +               goto out;
>>> +       submitted = ret;
>>> +
>>> +       ret = io_getevents(io_ctx, submitted, submitted, events, NULL);
>>> +       if (ret<  0)
>>> +               goto out;
>>> +
>>> +       completed += submitted;
>>> +       if (completed<  count)
>>> +               goto resubmit;
>>
>> You can replace resubmit with a while loop.
>>
>> However, Are you using the full potential of asynchronous reads
>> though? This seems like a function for performing bulk I/O of
>> different blocks.
>> By collecting the results immediately, you are waiting for all
>> submitted I/O to complete, or blocking. How about breaking the whole
>> thing into two, ie submission and collection. Submitting the I/O when
>> you know what is going to be read, and collecting the events when you
>> actually need it.
>>
>> Taking the example of inodes, you can io_submit inode blocks when you
>> read the inode_alloc file in pass 0, and io_getevents in pass 1 when
>> you actually need it.
>
> In effect that's what the earlier pre-caching was doing. Read inodes in
> Pass 0. The problem is that that works only if the cache size is large
> enough. We knew that when we wrote it 2 years ago. But we did not know
> how expensive it would be. The stats patch showed us that it is very
> expensive.

No, there is a difference in what you are doing in pre-caching. With
pre-caching, you are reading block-by-block, but the whole rb tree.
With your current patch you are doing bulk I/O, though in parts. A
wholesale market is usually cheaper than a retail store ;) Effectively
it is still synced I/O, not aio.

>
> For now, in patch 6, we read the inodes in larger (4MB) chunks. Still
> sync reads though. And that gives us good numbers. Adding aio reads
> in the inode scan interface is a non-trivial exercise.

I am not saying your method is bad. Your numbers prove it is good. I
am just pointing out what could be a better implementation.

>
> Breaking up submit and getevents will be tricky because libocfs2 is
> single threaded. Overhauling it will be an enormous task. So the gain
> better be worth it.

Okay, pass 0 might be a bit too early. I am not talking of a
multi-threaded application either. For example, you can abstract it
all in inode_scan itself. Keep two buffers instead of one. io_submit
into first buffer while the second buffer is being
consumed.io_getevents on the first and io_submit on second, once
second is fully consumed. vice versa once first is consumed.

You're keeping tracking information already. So, keeping additional io
context data structures should not hurt as much.

>
> The places we use aio reads are with allocators and dirblocks. Allocator
> performance will not improve at all because fsck processes them on a
> column-basis and we aio read the groups row-wise.
>
> But it may help with dirblocks.
>
> Pass 2: Checking directory entries
>  I/O read disk/cache: 3902MB / 3938MB, write: 0MB, rate: 25.89MB/s
>  Times real: 302.723s, user: 95.698s, sys: 13.322s
>
> IO rate is more gaudy because we add the disk and cached reads. If we
> remove cache read and also remove half the user time, the io rate drops
> to 15MB/s.
>
> It may be interesting to track the time between io_submit and io_getevents.
>

I suppose you meant after resubmit condition. That is the time you
could use to do something in parallel :)

-- 
Goldwyn