[Ocfs2-devel] [PATCH] ocfs2: return error when we attempt to access a dirty bh in jbd2
Changwei Ge
ge.changwei at h3c.com
Thu Jan 25 18:03:48 PST 2018
On 2018/1/26 9:45, piaojun wrote:
> Hi Changwei,
>
> On 2018/1/26 9:00, Changwei Ge wrote:
>> Hi Jun,
>> Good morning:-)
>>
>> On 2018/1/25 20:45, piaojun wrote:
>>> Hi Changwei,
>>>
>>> On 2018/1/25 20:30, Changwei Ge wrote:
>>>> Hi Jun,
>>>>
>>>> On 2018/1/25 20:18, piaojun wrote:
>>>>> Hi Changwei,
>>>>>
>>>>> On 2018/1/25 19:59, Changwei Ge wrote:
>>>>>> Hi Jun,
>>>>>>
>>>>>> On 2018/1/25 10:41, piaojun wrote:
>>>>>>> We should not reuse the dirty bh in jbd2 directly due to the following
>>>>>>> situation:
>>>>>>>
>>>>>>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>>>>>> Quick questions:
>>>>>> Do you mean current code puts modifying extent records belonging to a certain file and modifying truncate log into the same transaction?
>>>>>> If so, forgive me since I didn't figure it out. Could you point out in your following sequence diagram?
>>>>>>
>>>>>> Afterwards, I can understand the issue your change log is describing better.
>>>>>>
>>>>>> Thanks,
>>>>>> Changwei
>>>>>>
>>>>> Yes, I mean they are in the same transaction as below:
>>>>>
>>>>> ocfs2_remove_btree_range
>>>>> ocfs2_remove_extent // modify extent records
>>>>> ocfs2_truncate_log_append // modify truncate log
>>>>
>>>> If so I think the transaction including operations on extent and truncate log won't be committed.
>>>> And journal should already be aborted if interval transaction commit thread has been woken.
>>>> So no metadata will be changed.
>>>> And later, ocfs2_truncate_log_worker shouldn't see any inode on truncate log.
>>>> Are you sure this is the root cause of your problem?
>>>> I feel a little strange for it.
>>>>
>>>> Thanks,
>>>> Changwei
>>>>
>>> As you said, the transaction was not committed, but after a while the
>>> bh of truncate log was committed in another transaction. I'm sure for
>>> the cause and after applying this patch, the duplicate cluster problem
>>> is gone. I have tested it a few month.
>>
>> I think we are talking about two jbd2/transactions. right?
> yes, two transactions involved.
>
>> One is for moving clusters from extent to truncate log. Let's name it T1.
>> Anther is for declaiming clusters from truncate log and returning them back to global bitmap. Let's name it T2.
>>
>> If jbd2 fails to commit T1 due to an IO error, the whole jbd2/journal will be aborted which means it can't work any more.
>> All following starting transaction and commit transaction will fail.
>>
>> So, how can the T2 be committed while T1 fails?
> From my testing jbd2 won't be aborted when encounter IO error, and I
> print the bh->b_state = 0x44828 = 1000100100000101000. That means the
> bh has been submitted but write IO, and still in jbd2 according to
> 'bh_state_bits' and 'jbd_state_bits'.
Um... Strange :(
T1 fails to be committed but journal is still normal?
The T2 is even committed successfully?
I add Jan to our discussion.
Hopefully, he can help clear our doubts. :)
>
>>
>> Otherwise, did you ever try to recover jbd2/journal? If so, I think your patch here is not fit for mainline yet.
>>
> Currently this problem needs user umount and mount again to recover,
> and I'm glad to hear your advice.
>
I think it's better to do so for now.
Moreover, ocfs2 will fence the problematic node out.
Thanks,
Changwei
> thanks,
> Jun
>
>> Thanks,
>> Changwei
>>
>
>>>
>>> thanks,
>>> Jun
>>>
>>>>>
>>>>> thanks,
>>>>> Jun
>>>>>
>>>>>>> truncate log at the same time, and hand them over to jbd2.
>>>>>>> 2. The bhs are not flushed to disk due to abnormal storage link.
>>>>>>> 3. After a while the storage link become normal. Truncate log flush
>>>>>>> worker triggered by the next space reclaiming found the dirty bh of
>>>>>>> truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>>>>>>> in __ocfs2_journal_access():
>>>>>>>
>>>>>>> ocfs2_truncate_log_worker
>>>>>>> ocfs2_flush_truncate_log
>>>>>>> __ocfs2_flush_truncate_log
>>>>>>> ocfs2_replay_truncate_records
>>>>>>> ocfs2_journal_access_di
>>>>>>> __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.
>>>>>>>
>>>>>>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>>>>>>> extent rec is still in error state, and unfortunately nobody will
>>>>>>> take care of it.
>>>>>>> 5. At last the space of extent rec was not reduced, but truncate log
>>>>>>> flush worker have given it back to globalalloc. That will cause
>>>>>>> duplicate cluster problem which could be identified by fsck.ocfs2.
>>>>>>>
>>>>>>> So we should return -EIO in case of ruining atomicity and consistency
>>>>>>> of space reclaim.
>>>>>>>
>>>>>>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
>>>>>>>
>>>>>>> Signed-off-by: Jun Piao <piaojun at huawei.com>
>>>>>>> Reviewed-by: Yiwen Jiang <jiangyiwen at huawei.com>
>>>>>>> ---
>>>>>>> fs/ocfs2/journal.c | 45 +++++++++++++++++++++++++++++++++++----------
>>>>>>> 1 file changed, 35 insertions(+), 10 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>>>>>>> index 3630443..d769ca2 100644
>>>>>>> --- a/fs/ocfs2/journal.c
>>>>>>> +++ b/fs/ocfs2/journal.c
>>>>>>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>>>>>>> /* we can safely remove this assertion after testing. */
>>>>>>> if (!buffer_uptodate(bh)) {
>>>>>>> mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
>>>>>>> - mlog(ML_ERROR, "b_blocknr=%llu\n",
>>>>>>> - (unsigned long long)bh->b_blocknr);
>>>>>>> + mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
>>>>>>> + (unsigned long long)bh->b_blocknr, bh->b_state);
>>>>>>>
>>>>>>> lock_buffer(bh);
>>>>>>> /*
>>>>>>> - * A previous attempt to write this buffer head failed.
>>>>>>> - * Nothing we can do but to retry the write and hope for
>>>>>>> - * the best.
>>>>>>> + * We should not reuse the dirty bh directly due to the
>>>>>>> + * following situation:
>>>>>>> + *
>>>>>>> + * 1. When removing extent rec, we will dirty the bhs of
>>>>>>> + * extent rec and truncate log at the same time, and
>>>>>>> + * hand them over to jbd2.
>>>>>>> + * 2. The bhs are not flushed to disk due to abnormal
>>>>>>> + * storage link.
>>>>>>> + * 3. After a while the storage link become normal.
>>>>>>> + * Truncate log flush worker triggered by the next
>>>>>>> + * space reclaiming found the dirty bh of truncate log
>>>>>>> + * and clear its 'BH_Write_EIO' and then set it uptodate
>>>>>>> + * in __ocfs2_journal_access():
>>>>>>> + *
>>>>>>> + * ocfs2_truncate_log_worker
>>>>>>> + * ocfs2_flush_truncate_log
>>>>>>> + * __ocfs2_flush_truncate_log
>>>>>>> + * ocfs2_replay_truncate_records
>>>>>>> + * ocfs2_journal_access_di
>>>>>>> + * __ocfs2_journal_access
>>>>>>> + *
>>>>>>> + * 4. Then jbd2 will flush the bh of truncate log to disk,
>>>>>>> + * but the bh of extent rec is still in error state, and
>>>>>>> + * unfortunately nobody will take care of it.
>>>>>>> + * 5. At last the space of extent rec was not reduced,
>>>>>>> + * but truncate log flush worker have given it back to
>>>>>>> + * globalalloc. That will cause duplicate cluster problem
>>>>>>> + * which could be identified by fsck.ocfs2.
>>>>>>> + *
>>>>>>> + * So we should return -EIO in case of ruining atomicity
>>>>>>> + * and consistency of space reclaim.
>>>>>>> */
>>>>>>> if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
>>>>>>> - clear_buffer_write_io_error(bh);
>>>>>>> - set_buffer_uptodate(bh);
>>>>>>> - }
>>>>>>> -
>>>>>>> - if (!buffer_uptodate(bh)) {
>>>>>>> + mlog(ML_ERROR, "A previous attempt to write this "
>>>>>>> + "buffer head failed\n");
>>>>>>> unlock_buffer(bh);
>>>>>>> return -EIO;
>>>>>>> }
>>>>>>>
>>>>>> .
>>>>>>
>>>>>
>>>> .
>>>>
>>>
>> .
>>
>
More information about the Ocfs2-devel
mailing list