[Ocfs2-users] OCFS2 Trace

Sunil Mushran sunil.mushran at oracle.com
Thu Sep 24 11:36:54 PDT 2009


I would always recommend an enterprise kernel. OEL, RHEL or SLES. If you are
using a non-enterprise kernels, then whatever the stable-kernel team is
supporting. Currently it is 2.6.27 and 2.6.30, I believe.

The reason is simple. We continue to test ocfs2 on enterprise and the 
current
mainline kernel. All bugs are fixed in enterprise and mainline. Critical 
bugs
are fixed in stable-kernel(s).

Laurence Mayer wrote:
> ok thank you.
>  
> You mentioned the kernel being old, which kernel would you recommend 
> at this point?
>
> On Thu, Sep 24, 2009 at 8:42 PM, Sunil Mushran 
> <sunil.mushran at oracle.com <mailto:sunil.mushran at oracle.com>> wrote:
>
>     Then remove (temporarily) the node from the cluster. You don't
>     want one
>     node to negatively affect the functioning of the rest.
>
>     The reason we recommend forcing a reset on oops is because we cannot
>     predict its effect on the cluster. Because the oops could be in any
>     component in the kernel. Sticking to ocfs2, say if dlm_thread oopses.
>     Well, then the node would be unable to respond to dlm messages. The
>     cluster would grind to a halt. If reset was enabled, the other 9 would
>     pause, recover the dead node and continue working. The dead node
>     would reset and then rejoin the cluster.
>
>     In your specific case, it could be harmless. But I wouldn't bet on it.
>
>     Laurence Mayer wrote:
>
>         ok will do.
>         Just a little background:
>         We are doing reads of up to 220MB/s for 20min (aggregated on
>         all 10 nodes) and towards the end of the 20min we are
>         writing ~45 x 2k files to the OCFS2 volume. During the read, I
>         notice that the Cache Buffers on all the nodes are exhausted .
>          This oops only happens currently on one of the nodes. I am
>         relucatnt to force a reboot on oops.
>         Is this a must?
>          Thanks
>         Laurence
>          
>
>          On Thu, Sep 24, 2009 at 8:06 PM, Sunil Mushran
>         <sunil.mushran at oracle.com <mailto:sunil.mushran at oracle.com>
>         <mailto:sunil.mushran at oracle.com
>         <mailto:sunil.mushran at oracle.com>>> wrote:
>
>            So a read on some file on a xfs volume, triggered a mem alloc
>            which inturn
>            triggered the kernel to free up some memory. The oops
>         happens when
>            it is
>            trying to free up an ocfs2 inode.
>
>            Do:
>            # cat /proc/sys/kernel/panic_on_oops
>
>            If this returns 0, do:
>            # echo 1 > /proc/sys/kernel/panic_on_oops
>            This is documented in the user's guide.
>
>            File a bugzilla in oss.oracle.com/bugzilla
>         <http://oss.oracle.com/bugzilla>
>            <http://oss.oracle.com/bugzilla>. _Attach_ this oops report.
>
>            Do not cut-paste. It is hard to read. Also _attach_ the objdump
>            output.
>            # objdump -DSl /lib/modules/`uname -r`/kernel/fs/ocfs2/ocfs2.ko
>            >/tmp/ocfs2.out
>
>            Bottomline, that it is working just means that you will
>         encounter
>            the problem
>            later. The problem in this case will most likely be another
>         oops.
>            Or, a hang.
>
>            Upload the outputs. I'll try to see if we have already
>         addressed
>            this issue.
>            This kernel is fairly old, btw.
>
>            Sunil
>
>            Laurence Mayer wrote:
>
>                OS: Ubuntu 8.04 x64
>                Kern: Linux n1 2.6.24-24-server #1 SMP Tue Jul 7
>         19:39:36 UTC
>                2009 x86_64 GNU/Linux
>                10 Node Cluster
>                OCFS2 Version:
>                ocfs2-tools                      1.3.9-0ubuntu1        
>                                  ocfs2-tools-static-dev  
>          1.3.9-0ubuntu1                            ocfs2console      
>                    1.3.9-0ubuntu1                                    
>          root at n1:~# cat /proc/meminfo
>                MemTotal:     16533296 kB
>                MemFree:         47992 kB
>                Buffers:        179240 kB
>                Cached:       13185084 kB
>                SwapCached:         72 kB
>                Active:        4079712 kB
>                Inactive:     12088860 kB
>                SwapTotal:    31246416 kB
>                SwapFree:     31246344 kB
>                Dirty:            2772 kB
>                Writeback:           4 kB
>                AnonPages:     2804460 kB
>                Mapped:          51556 kB
>                Slab:           223976 kB
>                SReclaimable:    61192 kB
>                SUnreclaim:     162784 kB
>                PageTables:      12148 kB
>                NFS_Unstable:        8 kB
>                Bounce:              0 kB
>                CommitLimit:  39513064 kB
>                Committed_AS:  3698728 kB
>                VmallocTotal: 34359738367 kB
>                VmallocUsed:     53888 kB
>                VmallocChunk: 34359684419 kB
>                HugePages_Total:     0
>                HugePages_Free:      0
>                HugePages_Rsvd:      0
>                HugePages_Surp:      0
>                Hugepagesize:     2048 kB
>
>                I have started seeing the below on one of the nodes.
>         The node
>                does not reboot it continues to function "normally"
>
>                Is this a memory issue?
>
>                Please can you provide direction.
>
>
>                Sep 24 16:31:46 n1 kernel: [75206.689992] CPU 0
>                Sep 24 16:31:46 n1 kernel: [75206.690018] Modules
>         linked in:
>                ocfs2 crc32c libcrc32c nfsd auth_rpcgss exportfs
>         ipmi_devintf
>                ipmi_si ipmi_msghandler ipv6 ocfs2_dlmfs ocfs2_dlm
>                ocfs2_nodemanager configfs iptable_filter ip_tables
>         x_tables
>                xfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core
>         ib_addr
>                iscsi_tcp libiscsi scsi_transport_iscsi nfs lockd nfs_acl
>                sunrpc parport_pc lp parport loop serio_raw psmouse
>         i2c_piix4
>                i2c_core dcdbas evdev button k8temp shpchp pci_hotplug
>         pcspkr
>                ext3 jbd mbcache sg sr_mod cdrom sd_mod ata_generic
>         pata_acpi
>                usbhid hid ehci_hcd tg3 sata_svw pata_serverworks ohci_hcd
>                libata scsi_mod usbcore thermal processor fan fbcon
>         tileblit
>                font bitblit softcursor fuse
>                Sep 24 16:31:46 n1 kernel: [75206.690455] Pid: 15931, comm:
>                read_query Tainted: G      D 2.6.24-24-server #1
>                Sep 24 16:31:46 n1 kernel: [75206.690509] RIP:
>                0010:[<ffffffff8856c404>]  [<ffffffff8856c404>]
>                :ocfs2:ocfs2_meta_lock_full+0x6a4/0xec0
>                Sep 24 16:31:46 n1 kernel: [75206.690591] RSP:
>                0018:ffff8101c64c9848  EFLAGS: 00010292
>                Sep 24 16:31:46 n1 kernel: [75206.690623] RAX:
>                0000000000000092 RBX: ffff81034ba74000 RCX:
>         00000000ffffffff
>                Sep 24 16:31:46 n1 kernel: [75206.690659] RDX:
>                00000000ffffffff RSI: 0000000000000000 RDI:
>         ffffffff8058ffa4
>                Sep 24 16:31:46 n1 kernel: [75206.690695] RBP:
>                0000000100080000 R08: 0000000000000000 R09:
>         00000000ffffffff
>                Sep 24 16:31:46 n1 kernel: [75206.690730] R10:
>                0000000000000000 R11: 0000000000000000 R12:
>         ffff81033fca4e00
>                Sep 24 16:31:46 n1 kernel: [75206.690766] R13:
>                ffff81033fca4f08 R14: ffff81033fca52b8 R15:
>         ffff81033fca4f08
>                Sep 24 16:31:46 n1 kernel: [75206.690802] FS:
>                 00002b312f0119f0(0000) GS:ffffffff805c5000(0000)
>                knlGS:00000000f546bb90
>                Sep 24 16:31:46 n1 kernel: [75206.690857] CS:  0010 DS:
>         0000
>                ES: 0000 CR0: 000000008005003b
>                Sep 24 16:31:46 n1 kernel: [75206.690890] CR2:
>                00002b89f1e81000 CR3: 0000000168971000 CR4:
>         00000000000006e0
>                Sep 24 16:31:46 n1 kernel: [75206.690925] DR0:
>                0000000000000000 DR1: 0000000000000000 DR2:
>         0000000000000000
>                Sep 24 16:31:46 n1 kernel: [75206.690961] DR3:
>                0000000000000000 DR6: 00000000ffff0ff0 DR7:
>         0000000000000400
>                Sep 24 16:31:46 n1 kernel: [75206.690998] Process
>         read_query
>                (pid: 15931, threadinfo ffff8101c64c8000, task
>         ffff81021543f7d0)
>                Sep 24 16:31:46 n1 kernel: [75206.691054] Stack:
>                 ffff810243c402af ffff810243c40299 ffff81021b462408
>                000000011b462440
>                Sep 24 16:31:46 n1 kernel: [75206.691116]  ffff8101c64c9910
>                0000000100000000 ffff810217564e00 ffffffff8029018a
>                Sep 24 16:31:46 n1 kernel: [75206.691176]  0000000000000296
>                0000000000000001 ffffffffffffffff ffff81004c052f70
>                Sep 24 16:31:46 n1 kernel: [75206.691217] Call Trace:
>                Sep 24 16:31:46 n1 kernel: [75206.691273]
>                 [isolate_lru_pages+0x8a/0x210]
>         isolate_lru_pages+0x8a/0x210
>                Sep 24 16:31:46 n1 kernel: [75206.691323]
>                 [<ffffffff8857d4db>] :ocfs2:ocfs2_delete_inode+0x16b/0x7e0
>                Sep 24 16:31:46 n1 kernel: [75206.691362]
>                 [shrink_inactive_list+0x202/0x3c0]
>                shrink_inactive_list+0x202/0x3c0
>                Sep 24 16:31:46 n1 kernel: [75206.691409]
>                 [<ffffffff8857d370>] :ocfs2:ocfs2_delete_inode+0x0/0x7e0
>                Sep 24 16:31:46 n1 kernel: [75206.691449]
>                 [fuse:generic_delete_inode+0xa8/0x450]
>                generic_delete_inode+0xa8/0x140
>                Sep 24 16:31:46 n1 kernel: [75206.691495]
>                 [<ffffffff8857cd6d>] :ocfs2:ocfs2_drop_inode+0x7d/0x160
>                Sep 24 16:31:46 n1 kernel: [75206.691533]
>          [d_kill+0x3c/0x70]
>                d_kill+0x3c/0x70
>                Sep 24 16:31:46 n1 kernel: [75206.691566]
>                 [prune_one_dentry+0xc1/0xe0] prune_one_dentry+0xc1/0xe0
>                Sep 24 16:31:46 n1 kernel: [75206.691600]
>                 [prune_dcache+0x166/0x1c0] prune_dcache+0x166/0x1c0
>                Sep 24 16:31:46 n1 kernel: [75206.691635]
>                 [shrink_dcache_memory+0x3e/0x50]
>         shrink_dcache_memory+0x3e/0x50
>                Sep 24 16:31:46 n1 kernel: [75206.691670]
>                 [shrink_slab+0x124/0x180] shrink_slab+0x124/0x180
>                Sep 24 16:31:46 n1 kernel: [75206.691707]
>                 [try_to_free_pages+0x1e4/0x2f0]
>         try_to_free_pages+0x1e4/0x2f0
>                Sep 24 16:31:46 n1 kernel: [75206.691749]
>                 [__alloc_pages+0x196/0x3d0] __alloc_pages+0x196/0x3d0
>                Sep 24 16:31:46 n1 kernel: [75206.691790]
>                 [__do_page_cache_readahead+0xe0/0x210]
>                __do_page_cache_readahead+0xe0/0x210
>                Sep 24 16:31:46 n1 kernel: [75206.691834]
>                 [ondemand_readahead+0x117/0x1c0]
>         ondemand_readahead+0x117/0x1c0
>                Sep 24 16:31:46 n1 kernel: [75206.691871]
>                 [do_generic_mapping_read+0x13d/0x3c0]
>                do_generic_mapping_read+0x13d/0x3c0
>                Sep 24 16:31:46 n1 kernel: [75206.691908]
>                 [file_read_actor+0x0/0x160] file_read_actor+0x0/0x160
>                Sep 24 16:31:46 n1 kernel: [75206.691949]
>                 [xfs:generic_file_aio_read+0xff/0x1b0]
>                generic_file_aio_read+0xff/0x1b0
>                Sep 24 16:31:46 n1 kernel: [75206.692026]
>                 [xfs:xfs_read+0x11c/0x250] :xfs:xfs_read+0x11c/0x250
>                Sep 24 16:31:46 n1 kernel: [75206.692067]
>                 [xfs:do_sync_read+0xd9/0xbb0] do_sync_read+0xd9/0x120
>                Sep 24 16:31:46 n1 kernel: [75206.692101]
>                 [getname+0x1a9/0x220] getname+0x1a9/0x220
>                Sep 24 16:31:46 n1 kernel: [75206.692140]
>                 [<ffffffff80254530>] autoremove_wake_function+0x0/0x30
>                Sep 24 16:31:46 n1 kernel: [75206.692185]
>                 [vfs_read+0xed/0x190] vfs_read+0xed/0x190
>                Sep 24 16:31:46 n1 kernel: [75206.692220]
>                 [sys_read+0x53/0x90] sys_read+0x53/0x90
>                Sep 24 16:31:46 n1 kernel: [75206.692256]
>                 [system_call+0x7e/0x83] system_call+0x7e/0x83
>                Sep 24 16:31:46 n1 kernel: [75206.692293]
>                Sep 24 16:31:46 n1 kernel: [75206.692316]
>                Sep 24 16:31:46 n1 kernel: [75206.692317] Code: 0f 0b
>         eb fe 83
>                fd fe 0f 84 73 fc ff ff 81 fd 00 fe ff ff 0f
>                Sep 24 16:31:46 n1 kernel: [75206.692483]  RSP
>         <ffff8101c64c9848>
>
>
>                Thanks
>                Laurence
>
>                _______________________________________________
>                Ocfs2-users mailing list
>                Ocfs2-users at oss.oracle.com
>         <mailto:Ocfs2-users at oss.oracle.com>
>         <mailto:Ocfs2-users at oss.oracle.com
>         <mailto:Ocfs2-users at oss.oracle.com>>
>
>                http://oss.oracle.com/mailman/listinfo/ocfs2-users
>                
>
>
>
>




More information about the Ocfs2-users mailing list