[Ocfs2-devel] [PATCH] ocfs2: checkpoint appending truncate log transaction before flushing

Fri Feb 15 01:21:07 PST 2019

Hi Changwei,

I just need more time to review this.

Thanks,
Jun

On 2019/2/15 16:27, Changwei Ge wrote:
> Hi Jun,
> 
> Do you have any other question, advise or concern?
> I am expecting an explicit feedback(ack/nack) if you already understand the problem and my way fixing it.
> 
> Thanks,
> Changwei
> 
> On 2019/2/14 18:25, Changwei Ge wrote:
>> On 2019/2/14 18:06, piaojun wrote:
>>> Hi Changwei,
>>>
>>> On 2019/2/14 16:53, Changwei Ge wrote:
>>>> Hi Jun,
>>>>
>>>> Thanks for looking into this :-)
>>>>
>>>> On 2019/2/14 16:24, piaojun wrote:
>>>>> Hi Changwei,
>>>>>
>>>>> On 2019/2/14 12:03, Changwei Ge wrote:
>>>>>> Appending truncate log(TA) and and flushing truncate log(TF) are
>>>>>> two separated transactions. They can be both committed but not
>>>>>> checkpointed. If crash occurs then, both two transaction will be
>>>>>> replayed with several already released to global bitmap clusters.
>>>>>
>>>>> Do you mean that both the two transactions will release cluster to
>>>>> global bitmap? But I think the TA won't give back clusters to global
>>>>> bitmap.
>>>>>
>>>>
>>>> No, I don't mean that both TA and TF are releasing clusters to global bitmap.
>>>>
>>>> But consideration into clusters reclaim , clusters will first be recorded in truncate
>>>> log and then be returned to global bitmap, which involves TA and TF jdb2/transactions.
>>>>
>>>> TA's job is to append cluster records to truncate log, by which we can overcome a potential space leak problem.
>>>> TF's job is to return clusters to global bitmap.
>>>>
>>>> It's possible that TA and TF are both committed to JBD but sadly none of them is check-pointed.
>>>> So journal replaying need to replay both TA and TF during next mount.
>>>> Then there is a record residing in truncate log representing the already released cluster
>>>> which has been returned to global bitmap by replaying TF.
>>>>
>>>> Now the double free shows up.
>>>
>>> Do you mean that when mount again, truncate log recovery will find
>>> record residing in truncate log which already released? But after the
>>> TF transaction replayed during mount, truncate log won't be recovered
>>> as tl->tl_used is less than tl->tl_count.
>>
>> Um, not just truncate log relaying but also involves a jbd2 transaction recording its last append operation.
>> That operation may meet the flush condition (ocfs2_truncate_log_needs_flush)
>>
>> Thanks,
>> Changwei
>>
>>>
>>> Thanks,
>>> Jun
>>>
>>>>
>>>>
>>>>>> Then truncate log will be replayed resulting in cluster double free.
>>>>>
>>>>> Does this problem only cause some error log? As below:
>>>>>
>>>>> ocfs2_replay_truncate_records
>>>>>      ocfs2_free_clusters
>>>>>        _ocfs2_free_clusters
>>>>>          _ocfs2_free_suballoc_bits
>>>>>            ocfs2_block_group_clear_bits
>>>>>              "Trying to clear %u bits at offset %u in group descriptor"
>>>>>
>>>>
>>>> Exactly, when the issue occurs, it will be printed as above.
>>>>
>>>> Thanks,
>>>> Changwei
>>>>
>>>>> Thanks,
>>>>> Jun
>>>>>
>>>>>>
>>>>>> To reproduce this issue, just crash the host while punching hole to files.
>>>>>>
>>>>>> Signed-off-by: Changwei Ge <ge.changwei at h3c.com>
>>>>>> ---
>>>>>>     fs/ocfs2/alloc.c | 15 +++++++++++++++
>>>>>>     1 file changed, 15 insertions(+)
>>>>>>
>>>>>> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
>>>>>> index d1cbb27..29bc777 100644
>>>>>> --- a/fs/ocfs2/alloc.c
>>>>>> +++ b/fs/ocfs2/alloc.c
>>>>>> @@ -6007,6 +6007,7 @@ int __ocfs2_flush_truncate_log(struct ocfs2_super *osb)
>>>>>>     	struct buffer_head *data_alloc_bh = NULL;
>>>>>>     	struct ocfs2_dinode *di;
>>>>>>     	struct ocfs2_truncate_log *tl;
>>>>>> +	struct ocfs2_journal *journal = osb->journal;
>>>>>>     
>>>>>>     	BUG_ON(inode_trylock(tl_inode));
>>>>>>     
>>>>>> @@ -6027,6 +6028,20 @@ int __ocfs2_flush_truncate_log(struct ocfs2_super *osb)
>>>>>>     		goto out;
>>>>>>     	}
>>>>>>     
>>>>>> +	/* Appending truncate log(TA) and and flushing truncate log(TF) are
>>>>>> +	 * two separated transactions. They can be both committed but not
>>>>>> +	 * checkpointed. If crash occurs then, both two transaction will be
>>>>>> +	 * replayed with several already released to global bitmap clusters.
>>>>>> +	 * Then truncate log will be replayed resulting in cluster double free.
>>>>>> +	 */
>>>>>> +	jbd2_journal_lock_updates(journal->j_journal);
>>>>>> +	status = jbd2_journal_flush(journal->j_journal);
>>>>>> +	jbd2_journal_unlock_updates(journal->j_journal);
>>>>>> +	if (status < 0) {
>>>>>> +		mlog_errno(status);
>>>>>> +		goto out;
>>>>>> +	}
>>>>>> +
>>>>>>     	data_alloc_inode = ocfs2_get_system_file_inode(osb,
>>>>>>     						       GLOBAL_BITMAP_SYSTEM_INODE,
>>>>>>     						       OCFS2_INVALID_SLOT);
>>>>>>
>>>>>
>>>> .
>>>>
>>>
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>
> .
>