[Ocfs2-devel] [PATCH] ocfs2: checkpoint appending truncate log transaction before flushing

Thu Feb 14 02:23:19 PST 2019

On 2019/2/14 18:06, piaojun wrote:
> Hi Changwei,
> 
> On 2019/2/14 16:53, Changwei Ge wrote:
>> Hi Jun,
>>
>> Thanks for looking into this :-)
>>
>> On 2019/2/14 16:24, piaojun wrote:
>>> Hi Changwei,
>>>
>>> On 2019/2/14 12:03, Changwei Ge wrote:
>>>> Appending truncate log(TA) and and flushing truncate log(TF) are
>>>> two separated transactions. They can be both committed but not
>>>> checkpointed. If crash occurs then, both two transaction will be
>>>> replayed with several already released to global bitmap clusters.
>>>
>>> Do you mean that both the two transactions will release cluster to
>>> global bitmap? But I think the TA won't give back clusters to global
>>> bitmap.
>>>
>>
>> No, I don't mean that both TA and TF are releasing clusters to global bitmap.
>>
>> But consideration into clusters reclaim , clusters will first be recorded in truncate
>> log and then be returned to global bitmap, which involves TA and TF jdb2/transactions.
>>
>> TA's job is to append cluster records to truncate log, by which we can overcome a potential space leak problem.
>> TF's job is to return clusters to global bitmap.
>>
>> It's possible that TA and TF are both committed to JBD but sadly none of them is check-pointed.
>> So journal replaying need to replay both TA and TF during next mount.
>> Then there is a record residing in truncate log representing the already released cluster
>> which has been returned to global bitmap by replaying TF.
>>
>> Now the double free shows up.
> 
> Do you mean that when mount again, truncate log recovery will find
> record residing in truncate log which already released? But after the
> TF transaction replayed during mount, truncate log won't be recovered
> as tl->tl_used is less than tl->tl_count.

Um, not just truncate log relaying but also involves a jbd2 transaction recording its last append operation.
That operation may meet the flush condition (ocfs2_truncate_log_needs_flush)

Thanks,
Changwei

> 
> Thanks,
> Jun
> 
>>
>>
>>>> Then truncate log will be replayed resulting in cluster double free.
>>>
>>> Does this problem only cause some error log? As below:
>>>
>>> ocfs2_replay_truncate_records
>>>     ocfs2_free_clusters
>>>       _ocfs2_free_clusters
>>>         _ocfs2_free_suballoc_bits
>>>           ocfs2_block_group_clear_bits
>>>             "Trying to clear %u bits at offset %u in group descriptor"
>>>
>>
>> Exactly, when the issue occurs, it will be printed as above.
>>
>> Thanks,
>> Changwei
>>
>>> Thanks,
>>> Jun
>>>
>>>>
>>>> To reproduce this issue, just crash the host while punching hole to files.
>>>>
>>>> Signed-off-by: Changwei Ge <ge.changwei at h3c.com>
>>>> ---
>>>>    fs/ocfs2/alloc.c | 15 +++++++++++++++
>>>>    1 file changed, 15 insertions(+)
>>>>
>>>> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
>>>> index d1cbb27..29bc777 100644
>>>> --- a/fs/ocfs2/alloc.c
>>>> +++ b/fs/ocfs2/alloc.c
>>>> @@ -6007,6 +6007,7 @@ int __ocfs2_flush_truncate_log(struct ocfs2_super *osb)
>>>>    	struct buffer_head *data_alloc_bh = NULL;
>>>>    	struct ocfs2_dinode *di;
>>>>    	struct ocfs2_truncate_log *tl;
>>>> +	struct ocfs2_journal *journal = osb->journal;
>>>>    
>>>>    	BUG_ON(inode_trylock(tl_inode));
>>>>    
>>>> @@ -6027,6 +6028,20 @@ int __ocfs2_flush_truncate_log(struct ocfs2_super *osb)
>>>>    		goto out;
>>>>    	}
>>>>    
>>>> +	/* Appending truncate log(TA) and and flushing truncate log(TF) are
>>>> +	 * two separated transactions. They can be both committed but not
>>>> +	 * checkpointed. If crash occurs then, both two transaction will be
>>>> +	 * replayed with several already released to global bitmap clusters.
>>>> +	 * Then truncate log will be replayed resulting in cluster double free.
>>>> +	 */
>>>> +	jbd2_journal_lock_updates(journal->j_journal);
>>>> +	status = jbd2_journal_flush(journal->j_journal);
>>>> +	jbd2_journal_unlock_updates(journal->j_journal);
>>>> +	if (status < 0) {
>>>> +		mlog_errno(status);
>>>> +		goto out;
>>>> +	}
>>>> +
>>>>    	data_alloc_inode = ocfs2_get_system_file_inode(osb,
>>>>    						       GLOBAL_BITMAP_SYSTEM_INODE,
>>>>    						       OCFS2_INVALID_SLOT);
>>>>
>>>
>> .
>>
>