[Ocfs2-devel] fully support O_DIRECT for ocfs2

Fri Aug 7 01:39:06 PDT 2015

Hi,

I want to redesign the direct io code path for ocfs2. Here is some 
analysis and designation. Any suggestion is appreciated.

1. Reason why ocfs2 turn to buffer io.
ocfs2 use cluster inode lock to protect buffer io, although it's acquire 
and release quickly in write_begin and write_end. It will do a check 
point and page flush wait when get a bast. This will make buffer fully 
covered by the lock.
I think that's why direct io can not act like buffer io to allocate 
extent and do append write. Since direct io beyond control of cluster 
inode lock. That is, cluster inode lock can not flush direct io data. 
This will disturb cache coherence.
There is 2 way to resolve it:
* add the user page to page cache, so jbd2 check point can wait that page.
* wait all direct io to return when downconvert inode lock.
The former one will hack a lot of code. The later one is more simple, 
and easy to implement.

2. More things to be considered when doing direct io.

* sparse file support:
we can use __blockdev_direct_IO() to deal with hole inside block. But 
for hole that outside block, we should do it by ourself.

* data consistency:
Means when a crash is happened during write, file system should ensure 
that the file do not have stale data. There are2 methods to ensure data 
consistency:
a) allocate extent with flag UNWRITE before write. After data is written 
to disk, clear UNWRITE flag. This is a 2 phase commit. And these 2 phase 
do not require to be in the same transaction. But file size can not be 
protected by this method.
b) use journal to ensure meta data is committed after data written to 
disk. In ordered journal mode, jbd2 will wait data to be flushed and 
then commit meta data.
Buffer io can use these 2 method together. While direct io can only use 
a), since journal can not monitor direct io pages. So, ext4 invite 
orphan to protect file size when doing direct append.
Unfortunately, ocfs2 direct io can not use a). ocfs2 can not clear 
UNWRITE flag after the io is done. Because ocfs2 needs to acquire 
cluster inode lock to change that flag. And this will cause dead lock 
when the inode lock is downconverting.
So, the data consistency can not be guaranteed in direct io mode. 
Further, we need not to invite ext4 way to add inode to orphan when 
doing direct append.

3. detail design
Use pseudo code to explain:

ocfs2_direct_IO() {
...
     ocfs2_inode_lock()
     increase inode direct_io_count
...
     __blockdev_direct_IO(ocfs2_direct_IO_get_blocks, ocfs2_dio_end_io, 
ocfs2_dio_submit)
...
     ocfs2_inode_unlock()
     fill file hole. add some empty page to bio list
     submit bio list.
...
}

ocfs2_direct_IO_get_blocks() {
...
     create new extent when write to a hole
     increase file size when append.
...
}

ocfs2_dio_submit() {
     add bio to a list. do not actually submit it.
}

ocfs2_dio_end_io(){
...
     decrease inode direct_io_count.
...
}

/* This is the inode lock downconvert callback. */
ocfs2_data_convert_worker() {
...
     wait inode direct_io_count drop to 0.
...
}