[Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash

Mon Feb 24 15:30:17 PST 2014

Junxiao, thanks for looking into this issue. Please see my comment below

On 02/24/2014 01:07 AM, Junxiao Bi wrote:
> Hi,
>
> On 07/19/2012 09:59 AM, Sunil Mushran wrote:
>> Different issues.
>>
>> On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi <junxiao.bi at oracle.com 
>> <mailto:junxiao.bi at oracle.com>> wrote:
>>
>>     On 07/19/2012 12:36 AM, Sunil Mushran wrote:
>>>     This bug was detected during code audit. Never seen a crash. If
>>>     it does hit,
>>>     then we have bigger problems. So no point posting to stable.
>>
> I read a lot of dlm recovery code recently, I found this bug could 
> happen at the following scenario.
>
> node 1:                                             migrate target node x:
> dlm_unregister_domain()
>  dlm_migrate_all_locks()
>   dlm_empty_lockres()
>    select node x as migrate target node
>    since there is a node x lock on the granted list.
>    dlm_migrate_lockres()
>     dlm_mark_lockres_migrating() {
>      wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res));
> <<< node x unlock may happen here, res->granted list can be empty.
If the unlock request got sent at this point, and if the request was 
*processed*, lock must have been removed from the granted_list. If the 
request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set in 
dlm_lockres_release_ast would make dlm_unlock handler to return 
DLM_MIGRATING to the caller (in this case node x). So I don't see how 
granted_list could have stale lock. Am I missing something ?

I do think there is such race that you pointed below exist, but I am not 
sure if it was due to the above race described.

> dlm_lockres_release_ast(dlm, res);
>     }
>     dlm_send_one_lockres()
> dlm_process_recovery_data() {
>                                                     tmpq is 
> res->granted list and is empty.
> list_for_each_entry(lock, tmpq, list) {
>                                                      if 
> (lock->ml.cookie != ml->cookie)
>                                                       lock = NULL;
>                                                      else
>                                                       break;
>                                                     }
>                                                     lock will be 
> invalid here.
>                                                     if (lock->ml.node 
> != ml->node)
>                                                         BUG() --> 
> crash here.
>                                                    }
>
> Thanks,
> Junxiao.
>>
>>     Our customer can reproduce it. Also I saw you were assigned a
>>     similar bug before, see
>>     https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the
>>     same BUG?
>>>
>>>     On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi
>>>     <junxiao.bi at oracle.com <mailto:junxiao.bi at oracle.com>> wrote:
>>>
>>>         Hi Sunil,
>>>
>>>         On 07/18/2012 03:49 AM, Sunil Mushran wrote:
>>>>         On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi
>>>>         <junxiao.bi at oracle.com <mailto:junxiao.bi at oracle.com>> wrote:
>>>>
>>>>             In the target node of the dlm lock migration, the logic
>>>>             to find
>>>>             the local dlm lock is wrong, it shouldn't change the
>>>>             loop variable
>>>>             "lock" in the list_for_each_entry loop. This will cause
>>>>             a NULL-pointer
>>>>             accessing crash.
>>>>
>>>>             Signed-off-by: Junxiao Bi <junxiao.bi at oracle.com
>>>>             <mailto:junxiao.bi at oracle.com>>
>>>>             Cc: stable at vger.kernel.org <mailto:stable at vger.kernel.org>
>>>>             ---
>>>>              fs/ocfs2/dlm/dlmrecovery.c |   12 +++++++-----
>>>>              1 file changed, 7 insertions(+), 5 deletions(-)
>>>>
>>>>             diff --git a/fs/ocfs2/dlm/dlmrecovery.c
>>>>             b/fs/ocfs2/dlm/dlmrecovery.c
>>>>             index 01ebfd0..0b9cc88 100644
>>>>             --- a/fs/ocfs2/dlm/dlmrecovery.c
>>>>             +++ b/fs/ocfs2/dlm/dlmrecovery.c
>>>>             @@ -1762,6 +1762,7 @@ static int
>>>>             dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>>>                     u8 from = O2NM_MAX_NODES;
>>>>                     unsigned int added = 0;
>>>>                     __be64 c;
>>>>             +       int found;
>>>>
>>>>                     mlog(0, "running %d locks for this lockres\n",
>>>>             mres->num_locks);
>>>>                     for (i=0; i<mres->num_locks; i++) {
>>>>             @@ -1793,22 +1794,23 @@ static int
>>>>             dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>>>                                     /* MIGRATION ONLY! */
>>>>             BUG_ON(!(mres->flags & DLM_MRES_MIGRATION));
>>>>
>>>>             +                       found = 0;
>>>>             spin_lock(&res->spinlock);
>>>>                                     for (j = DLM_GRANTED_LIST; j <=
>>>>             DLM_BLOCKED_LIST; j++) {
>>>>                                             tmpq =
>>>>             dlm_list_idx_to_ptr(res, j);
>>>>             list_for_each_entry(lock, tmpq, list) {
>>>>             -                                       if
>>>>             (lock->ml.cookie != ml->cookie)
>>>>             -       lock = NULL;
>>>>             - else
>>>>             +                                       if
>>>>             (lock->ml.cookie == ml->cookie) {
>>>>             +       found = 1;
>>>>                   break;
>>>>             +                                       }
>>>>                                             }
>>>>             -                               if (lock)
>>>>             +                               if (found)
>>>>             break;
>>>>                                     }
>>>>
>>>>                                     /* lock is always created
>>>>             locally first, and
>>>>                                      * destroyed locally last.  it
>>>>             must be on the list */
>>>>             -                       if (!lock) {
>>>>             +                       if (!found) {
>>>>                                             c = ml->cookie;
>>>>             mlog(ML_ERROR, "Could not find local lock "
>>>>                  "with cookie %u:%llu, node %u, "
>>>>
>>>>
>>>>
>>>>         https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47
>>>>
>>>>
>>>>         We had decided to go back to list_for_each().
>>>
>>>         OK, thank you. It's OK to revert it back for a introduced
>>>         bug. But I think you'd better cc stable branch.
>>>
>>>
>>
>>
>>
>
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20140224/b704f9e0/attachment-0001.html