[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

Tue Jan 3 22:22:58 PST 2012

Hello
This is test cluster so i can again create new partition and do tests for 
You.
But You need tell me what exacly what i need to do.
For now i got kernel 3.1.6 with all 10 patches from:
https://wizja2.tktelekom.pl/ocfs2/2011.12.28-3.1.6/
Status:
With quota got : https://wizja2.tktelekom.pl/ocfs2/2012.01.03-3.1.6/
Without quota got: no problem ocfs2 looks stable.
Cheers

-----Oryginalna wiadomość----- 
From: Srinivas Eeda
Sent: Tuesday, January 03, 2012 6:35 PM
To: Marek Krolikowski
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

Thanks, and Happy New Year to you as well. Thanks for the kernel stacks.
This looks like the deadlock I was expecting and needs to be fixed. Can
you please file a bugzilla bug for this. You already filed a bug
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1339 which is for
different kernel stacks that were reported for different reason.

With the changes I gave you, you shouldn't be running into bugzilla
1339, but you will run into the current dead lock.

What you will need is a new fix that fixes the deadlock on top of the
current changes, but it will take some time. Meanwhile I propose the
following workarounds.

1) If you can afford to not do too much deletes you can run a kernel
without my changes. You can schedule all deletes to a particular time,
run deletes and then umount and remount the volume. This way you will
create too many orphans but they will get cleared during umount.

2) Or you can run with my changes but disable quotas.

These are only workarounds till we fix the deadlock.

Thanks,
--Srini

Marek Krolikowski wrote:
> Hello and happy new year!
> I do enable quota and i got oops on both servers and can`t login - console 
> frozen after give right login and password.
> I do sysrq t,s,b and this is what i get:
> https://wizja2.tktelekom.pl/ocfs2/2012.01.03-3.1.6/
> anything else You need?
> Cheers!
>
>
> -----Oryginalna wiadomość----- From: srinivas eeda
> Sent: Friday, December 23, 2011 10:52 PM
> To: Marek Królikowski
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from 
> both
>
> Please press sysrq key and t to dump kernel stacks on both nodes and
> please email me the messages files.
>
> On 12/23/2011 1:19 PM, Marek Królikowski wrote:
>> Hello
>> I get oops on TEST-MAIL2:
>>
>> INFO: task ocfs2dc:15430 blocked for more than 120 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> ocfs2dc         D ffff88107f232c40     0 15430      2 0x00000000
>> ffff881014889080 0000000000000046 ffff881000000000 ffff88102060c080
>> 0000000000012c40 ffff88101eefbfd8 0000000000012c40 ffff88101eefa010
>> ffff88101eefbfd8 0000000000012c40 0000000000000001 00000001130a4380
>> Call Trace:
>> [<ffffffff8148db41>] ? __mutex_lock_slowpath+0xd1/0x140
>> [<ffffffff8148da53>] ? mutex_lock+0x23/0x40
>> [<ffffffff81181eb6>] ? dqget+0x246/0x3a0
>> [<ffffffff81182281>] ? __dquot_initialize+0x121/0x210
>> [<ffffffff8114c90d>] ? d_kill+0x9d/0x100
>> [<ffffffffa0a601c3>] ? ocfs2_find_local_alias+0x23/0x100 [ocfs2]
>> [<ffffffffa0a7fca8>] ? ocfs2_delete_inode+0x98/0x3e0 [ocfs2]
>> [<ffffffffa0a7106c>] ? ocfs2_unblock_lock+0x10c/0x770 [ocfs2]
>> [<ffffffffa0a80969>] ? ocfs2_evict_inode+0x19/0x40 [ocfs2]
>> [<ffffffff8114e9cc>] ? evict+0x8c/0x170
>> [<ffffffffa0a5fccd>] ? ocfs2_dentry_lock_put+0x5d/0x90 [ocfs2]
>> [<ffffffffa0a7177a>] ? ocfs2_process_blocked_lock+0xaa/0x280 [ocfs2]
>> [<ffffffff8107beb2>] ? prepare_to_wait+0x82/0x90
>> [<ffffffff8107bceb>] ? finish_wait+0x4b/0xa0
>> [<ffffffffa0a71aa0>] ? ocfs2_downconvert_thread+0x150/0x270 [ocfs2]
>> [<ffffffff8107bb60>] ? wake_up_bit+0x40/0x40
>> [<ffffffffa0a71950>] ? ocfs2_process_blocked_lock+0x280/0x280 [ocfs2]
>> [<ffffffffa0a71950>] ? ocfs2_process_blocked_lock+0x280/0x280 [ocfs2]
>> [<ffffffff8107b686>] ? kthread+0x96/0xa0
>> [<ffffffff81498a74>] ? kernel_thread_helper+0x4/0x10
>> [<ffffffff8107b5f0>] ? kthread_worker_fn+0x190/0x190
>> [<ffffffff81498a70>] ? gs_change+0x13/0x13
>> INFO: task kworker/0:1:30806 blocked for more than 120 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> kworker/0:1     D ffff88107f212c40     0 30806      2 0x00000000
>> ffff8810152f4080 0000000000000046 0000000000000000 ffffffff81a0d020
>> 0000000000012c40 ffff880c28a57fd8 0000000000012c40 ffff880c28a56010
>> ffff880c28a57fd8 0000000000012c40 ffff880c28a57a08 00000001152f4080
>> Call Trace:
>> [<ffffffff8148d45d>] ? schedule_timeout+0x1ed/0x2d0
>> [<ffffffffa05244c0>] ? __jbd2_journal_file_buffer+0xd0/0x230 [jbd2]
>> [<ffffffff8148ce5c>] ? wait_for_common+0x12c/0x1a0
>> [<ffffffff81052230>] ? try_to_wake_up+0x280/0x280
>> [<ffffffff81085e21>] ? ktime_get+0x61/0xf0
>> [<ffffffffa0a6e850>] ? __ocfs2_cluster_lock+0x1f0/0x780 [ocfs2]
>> [<ffffffff81046fa7>] ? find_busiest_group+0x1f7/0xb00
>> [<ffffffffa0a73a56>] ? ocfs2_inode_lock_full_nested+0x126/0x540 [ocfs2]
>> [<ffffffffa0ad4da9>] ? ocfs2_lock_global_qf+0x29/0xd0 [ocfs2]
>> [<ffffffffa0ad4da9>] ? ocfs2_lock_global_qf+0x29/0xd0 [ocfs2]
>> [<ffffffffa0ad71df>] ? ocfs2_sync_dquot_helper+0xbf/0x330 [ocfs2]
>> [<ffffffffa0ad7120>] ? ocfs2_acquire_dquot+0x390/0x390 [ocfs2]
>> [<ffffffff81181c3a>] ? dquot_scan_active+0xda/0x110
>> [<ffffffffa0ad4ca0>] ? ocfs2_global_is_id+0x60/0x60 [ocfs2]
>> [<ffffffffa0ad4cc1>] ? qsync_work_fn+0x21/0x40 [ocfs2]
>> [<ffffffff810753f3>] ? process_one_work+0x123/0x450
>> [<ffffffff8107690b>] ? worker_thread+0x15b/0x370
>> [<ffffffff810767b0>] ? manage_workers+0x110/0x110
>> [<ffffffff810767b0>] ? manage_workers+0x110/0x110
>> [<ffffffff8107b686>] ? kthread+0x96/0xa0
>> [<ffffffff81498a74>] ? kernel_thread_helper+0x4/0x10
>> [<ffffffff8107b5f0>] ? kthread_worker_fn+0x190/0x190
>> [<ffffffff81498a70>] ? gs_change+0x13/0x13
>>
>> And i can`t login to TEST-MAIL1 after give login and password console say 
>> when i lastlog but i don`t get bash - console don`t answer... but there 
>> is no OOPS or something like this on screen.
>> I don`t restart both server tell me what to do now.
>> Thanks
>>
>>
>> -----Oryginalna wiadomość----- From: srinivas eeda
>> Sent: Thursday, December 22, 2011 9:12 PM
>> To: Marek Królikowski
>> Cc: ocfs2-users at oss.oracle.com
>> Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from 
>> both
>>
>> We need to know what happened to node 2. Was the node rebooted because
>> of a network timeout or kernel panic? can you please configure
>> netconsole, serial console and rerun the test?
>>
>> On 12/22/2011 8:08 AM, Marek Królikowski wrote:
>>> Hello
>>> After 24 hours i see TEST-MAIL2 reboot ( possible kernel panic) but 
>>> TEST-MAIL1 got in dmesg:
>>> TEST-MAIL1 ~ #dmesg
>>> [cut]
>>> o2net: accepted connection from node TEST-MAIL2 (num 1) at 
>>> 172.17.1.252:7777
>>> o2dlm: Node 1 joins domain B24C4493BBC74FEAA3371E2534BB3611
>>> o2dlm: Nodes in domain B24C4493BBC74FEAA3371E2534BB3611: 0 1
>>> o2net: connection to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 has 
>>> been idle for 60.0 seconds, shutting it down.
>>> (swapper,0,0):o2net_idle_timer:1562 Here are some times that might help 
>>> debug the situation: (Timer: 33127732045, Now 33187808090, DataReady 
>>> 33127732039, Advance 33127732051-33127732051, Key 0xebb9cd47, Func 506, 
>>> FuncTime 33127732045-33127732048)
>>> o2net: no longer connected to node TEST-MAIL2 (num 1) at 
>>> 172.17.1.252:7777
>>> (du,5099,12):dlm_do_master_request:1324 ERROR: link to 1 went down!
>>> (du,5099,12):dlm_get_lock_resource:907 ERROR: status = -112
>>> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
>>> B24C4493BBC74FEAA3371E2534BB3611: res M000000000000000000000cf023ef70, 
>>> error -112 send AST to node 1
>>> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -112
>>> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
>>> B24C4493BBC74FEAA3371E2534BB3611: res P000000000000000000000000000000, 
>>> error -107 send AST to node 1
>>> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -107
>>> (kworker/u:3,5071,0):o2net_connect_expired:1724 ERROR: no connection 
>>> established with node 1 after 60.0 seconds, giving up and returning 
>>> errors.
>>> (o2hb-B24C4493BB,14310,0):o2dlm_eviction_cb:267 o2dlm has evicted node 1 
>>> from group B24C4493BBC74FEAA3371E2534BB3611
>>> (ocfs2rec,5504,6):dlm_get_lock_resource:834 
>>> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at 
>>> least one node (1) to recover before lock mastery can begin
>>> (ocfs2rec,5504,6):dlm_get_lock_resource:888 
>>> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at 
>>> least one node (1) to recover before lock mastery can begin
>>> (du,5099,12):dlm_restart_lock_mastery:1213 ERROR: node down! 1
>>> (du,5099,12):dlm_wait_for_lock_mastery:1030 ERROR: status = -11
>>> (du,5099,12):dlm_get_lock_resource:888 
>>> B24C4493BBC74FEAA3371E2534BB3611:N000000000020924f: at least one node 
>>> (1) to recover before lock mastery can begin
>>> (dlm_reco_thread,14322,0):dlm_get_lock_resource:834 
>>> B24C4493BBC74FEAA3371E2534BB3611:$RECOVERY: at least one node (1) to 
>>> recover before lock mastery can begin
>>> (dlm_reco_thread,14322,0):dlm_get_lock_resource:868 
>>> B24C4493BBC74FEAA3371E2534BB3611: recovery map is not empty, but must 
>>> master $RECOVERY lock now
>>> (dlm_reco_thread,14322,0):dlm_do_recovery:523 (14322) Node 0 is the 
>>> Recovery Master for the Dead Node 1 for Domain 
>>> B24C4493BBC74FEAA3371E2534BB3611
>>> (ocfs2rec,5504,6):ocfs2_replay_journal:1549 Recovering node 1 from slot 
>>> 1 on device (253,0)
>>> (ocfs2rec,5504,6):ocfs2_begin_quota_recovery:407 Beginning quota 
>>> recovery in slot 1
>>> (kworker/u:0,2909,0):ocfs2_finish_quota_recovery:599 Finishing quota 
>>> recovery in slot 1
>>>
>>> And i try give this command:
>>> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
>>> allow
>>> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or 
>>> directory
>>> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
>>> off
>>> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or 
>>> directory
>>>
>>> But not working....
>>>
>>>
>>> -----Oryginalna wiadomość----- From: Srinivas Eeda
>>> Sent: Wednesday, December 21, 2011 8:43 PM
>>> To: Marek Królikowski
>>> Cc: ocfs2-users at oss.oracle.com
>>> Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from 
>>> both
>>>
>>> Those numbers look good. Basically with the fixes backed out and another
>>> fix I gave, you are not seeing that many orphans hanging around and
>>> hence not seeing the process stuck kernel stacks. You can run the test
>>> longer or if you are satisfied, please enable quotas and re-run the test
>>> with the modified kernel. You might see a dead lock which needs to be
>>> fixed(I was not able to reproduce this yet). If the system hangs, please
>>> capture the following and provide me the output
>>>
>>> 1. echo t > /proc/sysrq-trigger
>>> 2. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC 
>>> EXTENT_MAP allow
>>> 3. wait for 10 minutes
>>> 4. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC 
>>> EXTENT_MAP off
>>> 5. echo t > /proc/sysrq-trigger
>>>
>>
>