[Ocfs2-users] Linux kernel crash due to ocfs2

Thu Sep 8 12:10:43 PDT 2011

http://oss.oracle.com/~smushran/o2image.ppc.dbg

Use the above executable. Hoping it won't fault. But if it does
email me the backtrace. That trace will be readable as the exec
has debugging symbols enabled.

On 09/07/2011 11:24 PM, Betzos Giorgos wrote:
> # rpm -q ocfs2-tools
> ocfs2-tools-1.4.4-1.el5.ppc
>
> On Wed, 2011-09-07 at 09:13 -0700, Sunil Mushran wrote:
>> version of ocfs2-tools?
>>
>> On 09/07/2011 09:10 AM, Betzos Giorgos wrote:
>>> Hello,
>>>
>>> I tried what you suggested but here is what I got:
>>>
>>> # o2image /dev/mapper/mpath0 /files_shared/u02.o2image
>>> *** glibc detected *** o2image: corrupted double-linked list: 0x10045000 ***
>>> ======= Backtrace: =========
>>> /lib/libc.so.6[0xfeb1ab4]
>>> /lib/libc.so.6(cfree+0xc8)[0xfeb5b68]
>>> o2image[0x10007bb0]
>>> o2image[0x10002748]
>>> o2image[0x10001f50]
>>> o2image[0x10002334]
>>> o2image[0x100026a0]
>>> o2image[0x10001f50]
>>> o2image[0x10002334]
>>> o2image[0x100026a0]
>>> o2image[0x1000358c]
>>> o2image[0x10003e28]
>>> /lib/libc.so.6[0xfe4dc60]
>>> /lib/libc.so.6[0xfe4dea0]
>>> ======= Memory map: ========
>>> 00100000-00120000 r-xp 00100000 00:00 0                                  [vdso]
>>> 0f550000-0f560000 r-xp 00000000 08:13 2881590                            /lib/libcom_err.so.2.1
>>> 0f560000-0f570000 rw-p 00000000 08:13 2881590                            /lib/libcom_err.so.2.1
>>> 0f900000-0f9c0000 r-xp 00000000 08:13 2881576                            /lib/libglib-2.0.so.0.1200.3
>>> 0f9c0000-0f9d0000 rw-p 000b0000 08:13 2881576                            /lib/libglib-2.0.so.0.1200.3
>>> 0fa40000-0fa50000 r-xp 00000000 08:13 2881575                            /lib/librt-2.5.so
>>> 0fa50000-0fa60000 r--p 00000000 08:13 2881575                            /lib/librt-2.5.so
>>> 0fa60000-0fa70000 rw-p 00010000 08:13 2881575                            /lib/librt-2.5.so
>>> 0fce0000-0fd00000 r-xp 00000000 08:13 2881574                            /lib/libpthread-2.5.so
>>> 0fd00000-0fd10000 r--p 00010000 08:13 2881574                            /lib/libpthread-2.5.so
>>> 0fd10000-0fd20000 rw-p 00020000 08:13 2881574                            /lib/libpthread-2.5.so
>>> 0fe30000-0ffa0000 r-xp 00000000 08:13 2881571                            /lib/libc-2.5.so
>>> 0ffa0000-0ffb0000 r--p 00160000 08:13 2881571                            /lib/libc-2.5.so
>>> 0ffb0000-0ffc0000 rw-p 00170000 08:13 2881571                            /lib/libc-2.5.so
>>> 0ffc0000-0ffe0000 r-xp 00000000 08:13 2881570                            /lib/ld-2.5.so
>>> 0ffe0000-0fff0000 r--p 00010000 08:13 2881570                            /lib/ld-2.5.so
>>> 0fff0000-10000000 rw-p 00020000 08:13 2881570                            /lib/ld-2.5.so
>>> 10000000-10020000 r-xp 00000000 08:13 15058799                           /sbin/o2image
>>> 10020000-10030000 rw-p 00010000 08:13 15058799                           /sbin/o2image
>>> 10030000-10060000 rwxp 10030000 00:00 0                                  [heap]
>>> f7680000-f7ff0000 rw-p f7680000 00:00 0
>>> ffc60000-ffdb0000 rw-p ffc60000 00:00 0                                  [stack]
>>> Aborted (core dumped)
>>>
>>> I have the core file, if you need it.
>>>
>>> Here is some information about the fs in question.
>>> It is used to store Oracle Archive Logs and also to store the rman backup of the DB
>>> In the last crash case the fs became full while rman was running. Maybe we can estimate from
>>> this the size of the write in that particular case. Oracle DB rman backup files are from 7 to 11Gb.
>>> Maybe Oracle DataGuard was also using on the same fs.
>>> After the crash, when we rebooted the servers, they would crash again. We then noticed that
>>> the fs was full and we removed some unneeded files.
>>>
>>> The system has crashed a couple more times when the above conditions may not have been the same.
>>>
>>> Thanks,
>>>
>>> George
>>>
>>> ________________________________________
>>> From: Sunil Mushran
>>> Sent: Friday, September 02, 2011 8:24 PM
>>> To: Betzos Giorgos
>>> Cc: ocfs2-users at oss.oracle.com
>>> Subject: Re: [Ocfs2-users] Linux kernel crash due to ocfs2
>>>
>>> Can you provide me with the o2image. It includes the entire fs metadata.
>>> The size of the image file depends on the number of files/dirs.
>>>
>>> # o2image /dev/sdX  /path/to/image/file
>>>
>>> So the error is clear. We have underestimated the amount of credits
>>> (num of blocks that need to be dirtied in that transaction). This is the most
>>> common write path in the fs and thus hit heavily. So I am surprised by this.
>>>
>>> One way to fix it is by reproducing it inhouse. And having the image will allow
>>> us to mount the fs and reproduce the issue. Do you know the size of the write?
>>>
>>> On 09/02/2011 07:23 AM, Betzos Giorgos wrote:
>>>> Hello,
>>>>
>>>> we have a pair of IBM P570 servers running RHEL5.2
>>>> kernel 2.6.18-92.el5.ppc64
>>>> We have Oracle RAC on ocfs2 storage
>>>> ocfs2 is 1.4.7-1 for the above kernel (downloaded from oracle oss site)
>>>>
>>>> Recently both servers have been crashing with the following error:
>>>>
>>>> Assertion failure in journal_dirty_metadata() at
>>>> fs/jbd/transaction.c:1130: "handle->h_buffer_credits>    0"
>>>> kernel BUG in journal_dirty_metadata at fs/jbd/transaction.c:1130!
>>>>
>>>> We get some kind of kernel debug prompt.
>>>>
>>>> the stack is as follows:
>>>>
>>>> .ocfs2_journal_dirty+0x78/0x13c [ocfs2]
>>>> .ocfs2_search_chain+0x131c/0x165c [ocfs2]
>>>> .ocfs2_claim_suballoc_bits+0xadc/0xd94 [ocfs2]
>>>> .__ocfs2_claim_clusters+0x1b0/0x348 [ocfs2]
>>>> .ocf2_do_extend_allocation+0x1f8/0x5b4 [ocfs2]
>>>> .ocfs2_write_cluster_by_desc+0x128/0x850 [ocfs2]
>>>> .ocfs2_write_begin_nolock+0xdc0/0xfbc [ocfs2]
>>>> .ocfs2_write_begin+0x124/0x224 [ocfs2]
>>>> .ocfs2_file_aio_write+0x6a4/0xb40 [ocfs2]
>>>> .aio_pwrite+0x50/0xb4
>>>> .aio_run_iocb+0x140/0x214
>>>> .io_submit_one+0x2fc/0x3a8
>>>> .sys_io_submit+0xd0/0x17c
>>>> syscall_exit+0x0/0x40
>>>>
>>>> In the last crash case, the file system was full.
>>>>
>>>> Any clues?
>>>>
>>>> There seems to be a ocfs2 kernel patch some time ago for the 2.6.20.2
>>>> kernel that fixed some journal credits updates.
>>>>
>>>> Is this another bug?
>>>>
>>>> Any help will be greatly appreciated, because this is a production
>>>> system.
>>>>
>>>> Thanks,
>>>>
>>>> George