[Ocfs2-users] Linux kernel crash due to ocfs2

Fri Sep 16 11:00:19 PDT 2011

I got it. But I still don't see the symbols. Maybe we are corrupting the stack.
Maybe this is ppc specific. Do you have a x86/x86_64 box that can access
the same volume? If so I could give you a drop of the same for that arch.

Also, have to run fsck on this volume before? One reason o2image could
fail is if there is a bad block pointer. While it is supposed to handle all such
cases, it is known to miss some cases.

On 09/16/2011 12:06 AM, Betzos Giorgos wrote:
> Please try http://portal-md.glk.gr/ocfs2/core.32578.bz2
>
> Please let me know, in case you have any problem downloading it.
>
> Thanks,
>
> George
>
> On Thu, 2011-09-15 at 09:45 -0700, Sunil Mushran wrote:
>> I was hoping to get a readable stack. Please could you provide a link to
>> the coredump.
>>
>> On 09/15/2011 02:51 AM, Betzos Giorgos wrote:
>>> Hello,
>>>
>>> I am sorry for the delay in responding. Unfortunately, if faulted again.
>>>
>>> Here is the log. Although my email client folds the Memory Map lines.
>>> The core file is available.
>>>
>>> Thanks,
>>>
>>> George
>>>
>>> # ./o2image.ppc.dbg /dev/mapper/mpath0 /files_shared/u02.o2image
>>> *** glibc detected *** ./o2image.ppc.dbg: corrupted double-linked list:
>>> 0x10075000 ***
>>> ======= Backtrace: =========
>>> /lib/libc.so.6[0xfeb1ab4]
>>> /lib/libc.so.6(cfree+0xc8)[0xfeb5b68]
>>> ./o2image.ppc.dbg[0x1000d098]
>>> ./o2image.ppc.dbg[0x1000297c]
>>> ./o2image.ppc.dbg[0x10001eb8]
>>> ./o2image.ppc.dbg[0x1000228c]
>>> ./o2image.ppc.dbg[0x10002804]
>>> ./o2image.ppc.dbg[0x10001eb8]
>>> ./o2image.ppc.dbg[0x1000228c]
>>> ./o2image.ppc.dbg[0x10002804]
>>> ./o2image.ppc.dbg[0x10003bbc]
>>> ./o2image.ppc.dbg[0x10004480]
>>> /lib/libc.so.6[0xfe4dc60]
>>> /lib/libc.so.6[0xfe4dea0]
>>> ======= Memory map: ========
>>> 00100000-00120000 r-xp 00100000 00:00 0
>>> [vdso]
>>> 0f430000-0f440000 r-xp 00000000 08:13
>>> 180307                             /lib/libcom_err.so.2.1
>>> 0f440000-0f450000 rw-p 00000000 08:13
>>> 180307                             /lib/libcom_err.so.2.1
>>> 0f900000-0f9c0000 r-xp 00000000 08:13
>>> 180293                             /lib/libglib-2.0.so.0.1200.3
>>> 0f9c0000-0f9d0000 rw-p 000b0000 08:13
>>> 180293                             /lib/libglib-2.0.so.0.1200.3
>>> 0fa40000-0fa50000 r-xp 00000000 08:13
>>> 180292                             /lib/librt-2.5.so
>>> 0fa50000-0fa60000 r--p 00000000 08:13
>>> 180292                             /lib/librt-2.5.so
>>> 0fa60000-0fa70000 rw-p 00010000 08:13
>>> 180292                             /lib/librt-2.5.so
>>> 0fce0000-0fd00000 r-xp 00000000 08:13
>>> 180291                             /lib/libpthread-2.5.so
>>> 0fd00000-0fd10000 r--p 00010000 08:13
>>> 180291                             /lib/libpthread-2.5.so
>>> 0fd10000-0fd20000 rw-p 00020000 08:13
>>> 180291                             /lib/libpthread-2.5.so
>>> 0fe30000-0ffa0000 r-xp 00000000 08:13
>>> 180288                             /lib/libc-2.5.so
>>> 0ffa0000-0ffb0000 r--p 00160000 08:13
>>> 180288                             /lib/libc-2.5.so
>>> 0ffb0000-0ffc0000 rw-p 00170000 08:13
>>> 180288                             /lib/libc-2.5.so
>>> 0ffc0000-0ffe0000 r-xp 00000000 08:13
>>> 180287                             /lib/ld-2.5.so
>>> 0ffe0000-0fff0000 r--p 00010000 08:13
>>> 180287                             /lib/ld-2.5.so
>>> 0fff0000-10000000 rw-p 00020000 08:13
>>> 180287                             /lib/ld-2.5.so
>>> 10000000-10050000 r-xp 00000000 08:13
>>> 7487795                            /root/o2image.ppc.dbg
>>> 10050000-10060000 rw-p 00040000 08:13
>>> 7487795                            /root/o2image.ppc.dbg
>>> 10060000-10090000 rwxp 10060000 00:00 0
>>> [heap]
>>> f7680000-f7ff0000 rw-p f7680000 00:00 0
>>> ff9a0000-ffaf0000 rw-p ff9a0000 00:00 0
>>> [stack]
>>> Aborted (core dumped)
>>>
>>>
>>> On Thu, 2011-09-08 at 12:10 -0700, Sunil Mushran wrote:
>>>> http://oss.oracle.com/~smushran/o2image.ppc.dbg
>>>>
>>>> Use the above executable. Hoping it won't fault. But if it does
>>>> email me the backtrace. That trace will be readable as the exec
>>>> has debugging symbols enabled.
>>>>
>>>> On 09/07/2011 11:24 PM, Betzos Giorgos wrote:
>>>>> # rpm -q ocfs2-tools
>>>>> ocfs2-tools-1.4.4-1.el5.ppc
>>>>>
>>>>> On Wed, 2011-09-07 at 09:13 -0700, Sunil Mushran wrote:
>>>>>> version of ocfs2-tools?
>>>>>>
>>>>>> On 09/07/2011 09:10 AM, Betzos Giorgos wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I tried what you suggested but here is what I got:
>>>>>>>
>>>>>>> # o2image /dev/mapper/mpath0 /files_shared/u02.o2image
>>>>>>> *** glibc detected *** o2image: corrupted double-linked list: 0x10045000 ***
>>>>>>> ======= Backtrace: =========
>>>>>>> /lib/libc.so.6[0xfeb1ab4]
>>>>>>> /lib/libc.so.6(cfree+0xc8)[0xfeb5b68]
>>>>>>> o2image[0x10007bb0]
>>>>>>> o2image[0x10002748]
>>>>>>> o2image[0x10001f50]
>>>>>>> o2image[0x10002334]
>>>>>>> o2image[0x100026a0]
>>>>>>> o2image[0x10001f50]
>>>>>>> o2image[0x10002334]
>>>>>>> o2image[0x100026a0]
>>>>>>> o2image[0x1000358c]
>>>>>>> o2image[0x10003e28]
>>>>>>> /lib/libc.so.6[0xfe4dc60]
>>>>>>> /lib/libc.so.6[0xfe4dea0]
>>>>>>> ======= Memory map: ========
>>>>>>> 00100000-00120000 r-xp 00100000 00:00 0                                  [vdso]
>>>>>>> 0f550000-0f560000 r-xp 00000000 08:13 2881590                            /lib/libcom_err.so.2.1
>>>>>>> 0f560000-0f570000 rw-p 00000000 08:13 2881590                            /lib/libcom_err.so.2.1
>>>>>>> 0f900000-0f9c0000 r-xp 00000000 08:13 2881576                            /lib/libglib-2.0.so.0.1200.3
>>>>>>> 0f9c0000-0f9d0000 rw-p 000b0000 08:13 2881576                            /lib/libglib-2.0.so.0.1200.3
>>>>>>> 0fa40000-0fa50000 r-xp 00000000 08:13 2881575                            /lib/librt-2.5.so
>>>>>>> 0fa50000-0fa60000 r--p 00000000 08:13 2881575                            /lib/librt-2.5.so
>>>>>>> 0fa60000-0fa70000 rw-p 00010000 08:13 2881575                            /lib/librt-2.5.so
>>>>>>> 0fce0000-0fd00000 r-xp 00000000 08:13 2881574                            /lib/libpthread-2.5.so
>>>>>>> 0fd00000-0fd10000 r--p 00010000 08:13 2881574                            /lib/libpthread-2.5.so
>>>>>>> 0fd10000-0fd20000 rw-p 00020000 08:13 2881574                            /lib/libpthread-2.5.so
>>>>>>> 0fe30000-0ffa0000 r-xp 00000000 08:13 2881571                            /lib/libc-2.5.so
>>>>>>> 0ffa0000-0ffb0000 r--p 00160000 08:13 2881571                            /lib/libc-2.5.so
>>>>>>> 0ffb0000-0ffc0000 rw-p 00170000 08:13 2881571                            /lib/libc-2.5.so
>>>>>>> 0ffc0000-0ffe0000 r-xp 00000000 08:13 2881570                            /lib/ld-2.5.so
>>>>>>> 0ffe0000-0fff0000 r--p 00010000 08:13 2881570                            /lib/ld-2.5.so
>>>>>>> 0fff0000-10000000 rw-p 00020000 08:13 2881570                            /lib/ld-2.5.so
>>>>>>> 10000000-10020000 r-xp 00000000 08:13 15058799                           /sbin/o2image
>>>>>>> 10020000-10030000 rw-p 00010000 08:13 15058799                           /sbin/o2image
>>>>>>> 10030000-10060000 rwxp 10030000 00:00 0                                  [heap]
>>>>>>> f7680000-f7ff0000 rw-p f7680000 00:00 0
>>>>>>> ffc60000-ffdb0000 rw-p ffc60000 00:00 0                                  [stack]
>>>>>>> Aborted (core dumped)
>>>>>>>
>>>>>>> I have the core file, if you need it.
>>>>>>>
>>>>>>> Here is some information about the fs in question.
>>>>>>> It is used to store Oracle Archive Logs and also to store the rman backup of the DB
>>>>>>> In the last crash case the fs became full while rman was running. Maybe we can estimate from
>>>>>>> this the size of the write in that particular case. Oracle DB rman backup files are from 7 to 11Gb.
>>>>>>> Maybe Oracle DataGuard was also using on the same fs.
>>>>>>> After the crash, when we rebooted the servers, they would crash again. We then noticed that
>>>>>>> the fs was full and we removed some unneeded files.
>>>>>>>
>>>>>>> The system has crashed a couple more times when the above conditions may not have been the same.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> George
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: Sunil Mushran
>>>>>>> Sent: Friday, September 02, 2011 8:24 PM
>>>>>>> To: Betzos Giorgos
>>>>>>> Cc: ocfs2-users at oss.oracle.com
>>>>>>> Subject: Re: [Ocfs2-users] Linux kernel crash due to ocfs2
>>>>>>>
>>>>>>> Can you provide me with the o2image. It includes the entire fs metadata.
>>>>>>> The size of the image file depends on the number of files/dirs.
>>>>>>>
>>>>>>> # o2image /dev/sdX  /path/to/image/file
>>>>>>>
>>>>>>> So the error is clear. We have underestimated the amount of credits
>>>>>>> (num of blocks that need to be dirtied in that transaction). This is the most
>>>>>>> common write path in the fs and thus hit heavily. So I am surprised by this.
>>>>>>>
>>>>>>> One way to fix it is by reproducing it inhouse. And having the image will allow
>>>>>>> us to mount the fs and reproduce the issue. Do you know the size of the write?
>>>>>>>
>>>>>>> On 09/02/2011 07:23 AM, Betzos Giorgos wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> we have a pair of IBM P570 servers running RHEL5.2
>>>>>>>> kernel 2.6.18-92.el5.ppc64
>>>>>>>> We have Oracle RAC on ocfs2 storage
>>>>>>>> ocfs2 is 1.4.7-1 for the above kernel (downloaded from oracle oss site)
>>>>>>>>
>>>>>>>> Recently both servers have been crashing with the following error:
>>>>>>>>
>>>>>>>> Assertion failure in journal_dirty_metadata() at
>>>>>>>> fs/jbd/transaction.c:1130: "handle->h_buffer_credits>      0"
>>>>>>>> kernel BUG in journal_dirty_metadata at fs/jbd/transaction.c:1130!
>>>>>>>>
>>>>>>>> We get some kind of kernel debug prompt.
>>>>>>>>
>>>>>>>> the stack is as follows:
>>>>>>>>
>>>>>>>> .ocfs2_journal_dirty+0x78/0x13c [ocfs2]
>>>>>>>> .ocfs2_search_chain+0x131c/0x165c [ocfs2]
>>>>>>>> .ocfs2_claim_suballoc_bits+0xadc/0xd94 [ocfs2]
>>>>>>>> .__ocfs2_claim_clusters+0x1b0/0x348 [ocfs2]
>>>>>>>> .ocf2_do_extend_allocation+0x1f8/0x5b4 [ocfs2]
>>>>>>>> .ocfs2_write_cluster_by_desc+0x128/0x850 [ocfs2]
>>>>>>>> .ocfs2_write_begin_nolock+0xdc0/0xfbc [ocfs2]
>>>>>>>> .ocfs2_write_begin+0x124/0x224 [ocfs2]
>>>>>>>> .ocfs2_file_aio_write+0x6a4/0xb40 [ocfs2]
>>>>>>>> .aio_pwrite+0x50/0xb4
>>>>>>>> .aio_run_iocb+0x140/0x214
>>>>>>>> .io_submit_one+0x2fc/0x3a8
>>>>>>>> .sys_io_submit+0xd0/0x17c
>>>>>>>> syscall_exit+0x0/0x40
>>>>>>>>
>>>>>>>> In the last crash case, the file system was full.
>>>>>>>>
>>>>>>>> Any clues?
>>>>>>>>
>>>>>>>> There seems to be a ocfs2 kernel patch some time ago for the 2.6.20.2
>>>>>>>> kernel that fixed some journal credits updates.
>>>>>>>>
>>>>>>>> Is this another bug?
>>>>>>>>
>>>>>>>> Any help will be greatly appreciated, because this is a production
>>>>>>>> system.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> George