[Ocfs2-devel] fully support O_DIRECT for ocfs2
ryding
ryan.ding at oracle.com
Fri Aug 7 01:39:06 PDT 2015
Hi,
I want to redesign the direct io code path for ocfs2. Here is some
analysis and designation. Any suggestion is appreciated.
1. Reason why ocfs2 turn to buffer io.
ocfs2 use cluster inode lock to protect buffer io, although it's acquire
and release quickly in write_begin and write_end. It will do a check
point and page flush wait when get a bast. This will make buffer fully
covered by the lock.
I think that's why direct io can not act like buffer io to allocate
extent and do append write. Since direct io beyond control of cluster
inode lock. That is, cluster inode lock can not flush direct io data.
This will disturb cache coherence.
There is 2 way to resolve it:
* add the user page to page cache, so jbd2 check point can wait that page.
* wait all direct io to return when downconvert inode lock.
The former one will hack a lot of code. The later one is more simple,
and easy to implement.
2. More things to be considered when doing direct io.
* sparse file support:
we can use __blockdev_direct_IO() to deal with hole inside block. But
for hole that outside block, we should do it by ourself.
* data consistency:
Means when a crash is happened during write, file system should ensure
that the file do not have stale data. There are2 methods to ensure data
consistency:
a) allocate extent with flag UNWRITE before write. After data is written
to disk, clear UNWRITE flag. This is a 2 phase commit. And these 2 phase
do not require to be in the same transaction. But file size can not be
protected by this method.
b) use journal to ensure meta data is committed after data written to
disk. In ordered journal mode, jbd2 will wait data to be flushed and
then commit meta data.
Buffer io can use these 2 method together. While direct io can only use
a), since journal can not monitor direct io pages. So, ext4 invite
orphan to protect file size when doing direct append.
Unfortunately, ocfs2 direct io can not use a). ocfs2 can not clear
UNWRITE flag after the io is done. Because ocfs2 needs to acquire
cluster inode lock to change that flag. And this will cause dead lock
when the inode lock is downconverting.
So, the data consistency can not be guaranteed in direct io mode.
Further, we need not to invite ext4 way to add inode to orphan when
doing direct append.
3. detail design
Use pseudo code to explain:
ocfs2_direct_IO() {
...
ocfs2_inode_lock()
increase inode direct_io_count
...
__blockdev_direct_IO(ocfs2_direct_IO_get_blocks, ocfs2_dio_end_io,
ocfs2_dio_submit)
...
ocfs2_inode_unlock()
fill file hole. add some empty page to bio list
submit bio list.
...
}
ocfs2_direct_IO_get_blocks() {
...
create new extent when write to a hole
increase file size when append.
...
}
ocfs2_dio_submit() {
add bio to a list. do not actually submit it.
}
ocfs2_dio_end_io(){
...
decrease inode direct_io_count.
...
}
/* This is the inode lock downconvert callback. */
ocfs2_data_convert_worker() {
...
wait inode direct_io_count drop to 0.
...
}
More information about the Ocfs2-devel
mailing list