<div dir="ltr"><div>ok thank you.</div>
<div> </div>
<div>You mentioned the kernel being old, which kernel would you recommend at this point?<br><br></div>
<div class="gmail_quote">On Thu, Sep 24, 2009 at 8:42 PM, Sunil Mushran <span dir="ltr"><<a href="mailto:sunil.mushran@oracle.com">sunil.mushran@oracle.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">Then remove (temporarily) the node from the cluster. You don't want one<br>node to negatively affect the functioning of the rest.<br>
<br>The reason we recommend forcing a reset on oops is because we cannot<br>predict its effect on the cluster. Because the oops could be in any<br>component in the kernel. Sticking to ocfs2, say if dlm_thread oopses.<br>
Well, then the node would be unable to respond to dlm messages. The<br>
cluster would grind to a halt. If reset was enabled, the other 9 would<br>pause, recover the dead node and continue working. The dead node<br>would reset and then rejoin the cluster.<br><br>In your specific case, it could be harmless. But I wouldn't bet on it.<br>
<br>Laurence Mayer wrote:<br>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">
<div class="im">ok will do.<br>Just a little background:<br>We are doing reads of up to 220MB/s for 20min (aggregated on all 10 nodes) and towards the end of the 20min we are<br>writing ~45 x 2k files to the OCFS2 volume. During the read, I notice that the Cache Buffers on all the nodes are exhausted .<br>
This oops only happens currently on one of the nodes. I am relucatnt to force a reboot on oops.<br>Is this a must?<br> Thanks<br>Laurence<br> <br><br></div>
<div class="im"> On Thu, Sep 24, 2009 at 8:06 PM, Sunil Mushran <<a href="mailto:sunil.mushran@oracle.com" target="_blank">sunil.mushran@oracle.com</a> <mailto:<a href="mailto:sunil.mushran@oracle.com" target="_blank">sunil.mushran@oracle.com</a>>> wrote:<br>
<br> So a read on some file on a xfs volume, triggered a mem alloc<br> which inturn<br> triggered the kernel to free up some memory. The oops happens when<br> it is<br> trying to free up an ocfs2 inode.<br><br>
Do:<br>
# cat /proc/sys/kernel/panic_on_oops<br><br> If this returns 0, do:<br> # echo 1 > /proc/sys/kernel/panic_on_oops<br> This is documented in the user's guide.<br><br> File a bugzilla in <a href="http://oss.oracle.com/bugzilla" target="_blank">oss.oracle.com/bugzilla</a><br>
</div> <<a href="http://oss.oracle.com/bugzilla" target="_blank">http://oss.oracle.com/bugzilla</a>>. _Attach_ this oops report.
<div>
<div></div>
<div class="h5"><br> Do not cut-paste. It is hard to read. Also _attach_ the objdump<br> output.<br> # objdump -DSl /lib/modules/`uname -r`/kernel/fs/ocfs2/ocfs2.ko<br> >/tmp/ocfs2.out<br><br> Bottomline, that it is working just means that you will encounter<br>
the problem<br> later. The problem in this case will most likely be another oops.<br> Or, a hang.<br><br> Upload the outputs. I'll try to see if we have already addressed<br> this issue.<br> This kernel is fairly old, btw.<br>
<br> Sunil<br><br> Laurence Mayer wrote:<br><br> OS: Ubuntu 8.04 x64<br> Kern: Linux n1 2.6.24-24-server #1 SMP Tue Jul 7 19:39:36 UTC<br> 2009 x86_64 GNU/Linux<br> 10 Node Cluster<br> OCFS2 Version:<br>
ocfs2-tools 1.3.9-0ubuntu1 ocfs2-tools-static-dev 1.3.9-0ubuntu1 ocfs2console 1.3.9-0ubuntu1 root@n1:~# cat /proc/meminfo<br>
MemTotal: 16533296 kB<br> MemFree: 47992 kB<br> Buffers: 179240 kB<br> Cached: 13185084 kB<br> SwapCached: 72 kB<br> Active: 4079712 kB<br> Inactive: 12088860 kB<br>
SwapTotal: 31246416 kB<br> SwapFree: 31246344 kB<br> Dirty: 2772 kB<br> Writeback: 4 kB<br> AnonPages: 2804460 kB<br> Mapped: 51556 kB<br> Slab: 223976 kB<br>
SReclaimable: 61192 kB<br> SUnreclaim: 162784 kB<br> PageTables: 12148 kB<br> NFS_Unstable: 8 kB<br> Bounce: 0 kB<br> CommitLimit: 39513064 kB<br> Committed_AS: 3698728 kB<br>
VmallocTotal: 34359738367 kB<br> VmallocUsed: 53888 kB<br> VmallocChunk: 34359684419 kB<br> HugePages_Total: 0<br> HugePages_Free: 0<br> HugePages_Rsvd: 0<br> HugePages_Surp: 0<br>
Hugepagesize: 2048 kB<br><br> I have started seeing the below on one of the nodes. The node<br> does not reboot it continues to function "normally"<br><br> Is this a memory issue?<br>
<br> Please can you provide direction.<br><br><br> Sep 24 16:31:46 n1 kernel: [75206.689992] CPU 0<br> Sep 24 16:31:46 n1 kernel: [75206.690018] Modules linked in:<br> ocfs2 crc32c libcrc32c nfsd auth_rpcgss exportfs ipmi_devintf<br>
ipmi_si ipmi_msghandler ipv6 ocfs2_dlmfs ocfs2_dlm<br> ocfs2_nodemanager configfs iptable_filter ip_tables x_tables<br> xfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr<br> iscsi_tcp libiscsi scsi_transport_iscsi nfs lockd nfs_acl<br>
sunrpc parport_pc lp parport loop serio_raw psmouse i2c_piix4<br> i2c_core dcdbas evdev button k8temp shpchp pci_hotplug pcspkr<br> ext3 jbd mbcache sg sr_mod cdrom sd_mod ata_generic pata_acpi<br> usbhid hid ehci_hcd tg3 sata_svw pata_serverworks ohci_hcd<br>
libata scsi_mod usbcore thermal processor fan fbcon tileblit<br> font bitblit softcursor fuse<br> Sep 24 16:31:46 n1 kernel: [75206.690455] Pid: 15931, comm:<br> read_query Tainted: G D 2.6.24-24-server #1<br>
Sep 24 16:31:46 n1 kernel: [75206.690509] RIP:<br> 0010:[<ffffffff8856c404>] [<ffffffff8856c404>]<br> :ocfs2:ocfs2_meta_lock_full+0x6a4/0xec0<br> Sep 24 16:31:46 n1 kernel: [75206.690591] RSP:<br>
0018:ffff8101c64c9848 EFLAGS: 00010292<br> Sep 24 16:31:46 n1 kernel: [75206.690623] RAX:<br> 0000000000000092 RBX: ffff81034ba74000 RCX: 00000000ffffffff<br> Sep 24 16:31:46 n1 kernel: [75206.690659] RDX:<br>
00000000ffffffff RSI: 0000000000000000 RDI: ffffffff8058ffa4<br> Sep 24 16:31:46 n1 kernel: [75206.690695] RBP:<br> 0000000100080000 R08: 0000000000000000 R09: 00000000ffffffff<br> Sep 24 16:31:46 n1 kernel: [75206.690730] R10:<br>
0000000000000000 R11: 0000000000000000 R12: ffff81033fca4e00<br> Sep 24 16:31:46 n1 kernel: [75206.690766] R13:<br> ffff81033fca4f08 R14: ffff81033fca52b8 R15: ffff81033fca4f08<br> Sep 24 16:31:46 n1 kernel: [75206.690802] FS:<br>
00002b312f0119f0(0000) GS:ffffffff805c5000(0000)<br> knlGS:00000000f546bb90<br> Sep 24 16:31:46 n1 kernel: [75206.690857] CS: 0010 DS: 0000<br> ES: 0000 CR0: 000000008005003b<br> Sep 24 16:31:46 n1 kernel: [75206.690890] CR2:<br>
00002b89f1e81000 CR3: 0000000168971000 CR4: 00000000000006e0<br> Sep 24 16:31:46 n1 kernel: [75206.690925] DR0:<br> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000<br> Sep 24 16:31:46 n1 kernel: [75206.690961] DR3:<br>
0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400<br> Sep 24 16:31:46 n1 kernel: [75206.690998] Process read_query<br> (pid: 15931, threadinfo ffff8101c64c8000, task ffff81021543f7d0)<br> Sep 24 16:31:46 n1 kernel: [75206.691054] Stack:<br>
ffff810243c402af ffff810243c40299 ffff81021b462408<br> 000000011b462440<br> Sep 24 16:31:46 n1 kernel: [75206.691116] ffff8101c64c9910<br> 0000000100000000 ffff810217564e00 ffffffff8029018a<br>
Sep 24 16:31:46 n1 kernel: [75206.691176] 0000000000000296<br>
0000000000000001 ffffffffffffffff ffff81004c052f70<br> Sep 24 16:31:46 n1 kernel: [75206.691217] Call Trace:<br> Sep 24 16:31:46 n1 kernel: [75206.691273]<br> [isolate_lru_pages+0x8a/0x210] isolate_lru_pages+0x8a/0x210<br>
Sep 24 16:31:46 n1 kernel: [75206.691323]<br> [<ffffffff8857d4db>] :ocfs2:ocfs2_delete_inode+0x16b/0x7e0<br> Sep 24 16:31:46 n1 kernel: [75206.691362]<br> [shrink_inactive_list+0x202/0x3c0]<br>
shrink_inactive_list+0x202/0x3c0<br> Sep 24 16:31:46 n1 kernel: [75206.691409]<br> [<ffffffff8857d370>] :ocfs2:ocfs2_delete_inode+0x0/0x7e0<br> Sep 24 16:31:46 n1 kernel: [75206.691449]<br>
[fuse:generic_delete_inode+0xa8/0x450]<br> generic_delete_inode+0xa8/0x140<br> Sep 24 16:31:46 n1 kernel: [75206.691495]<br> [<ffffffff8857cd6d>] :ocfs2:ocfs2_drop_inode+0x7d/0x160<br> Sep 24 16:31:46 n1 kernel: [75206.691533] [d_kill+0x3c/0x70]<br>
d_kill+0x3c/0x70<br> Sep 24 16:31:46 n1 kernel: [75206.691566]<br> [prune_one_dentry+0xc1/0xe0] prune_one_dentry+0xc1/0xe0<br> Sep 24 16:31:46 n1 kernel: [75206.691600]<br> [prune_dcache+0x166/0x1c0] prune_dcache+0x166/0x1c0<br>
Sep 24 16:31:46 n1 kernel: [75206.691635]<br> [shrink_dcache_memory+0x3e/0x50] shrink_dcache_memory+0x3e/0x50<br> Sep 24 16:31:46 n1 kernel: [75206.691670]<br> [shrink_slab+0x124/0x180] shrink_slab+0x124/0x180<br>
Sep 24 16:31:46 n1 kernel: [75206.691707]<br> [try_to_free_pages+0x1e4/0x2f0] try_to_free_pages+0x1e4/0x2f0<br> Sep 24 16:31:46 n1 kernel: [75206.691749]<br> [__alloc_pages+0x196/0x3d0] __alloc_pages+0x196/0x3d0<br>
Sep 24 16:31:46 n1 kernel: [75206.691790]<br> [__do_page_cache_readahead+0xe0/0x210]<br> __do_page_cache_readahead+0xe0/0x210<br> Sep 24 16:31:46 n1 kernel: [75206.691834]<br> [ondemand_readahead+0x117/0x1c0] ondemand_readahead+0x117/0x1c0<br>
Sep 24 16:31:46 n1 kernel: [75206.691871]<br> [do_generic_mapping_read+0x13d/0x3c0]<br> do_generic_mapping_read+0x13d/0x3c0<br> Sep 24 16:31:46 n1 kernel: [75206.691908]<br> [file_read_actor+0x0/0x160] file_read_actor+0x0/0x160<br>
Sep 24 16:31:46 n1 kernel: [75206.691949]<br> [xfs:generic_file_aio_read+0xff/0x1b0]<br> generic_file_aio_read+0xff/0x1b0<br> Sep 24 16:31:46 n1 kernel: [75206.692026]<br> [xfs:xfs_read+0x11c/0x250] :xfs:xfs_read+0x11c/0x250<br>
Sep 24 16:31:46 n1 kernel: [75206.692067]<br> [xfs:do_sync_read+0xd9/0xbb0] do_sync_read+0xd9/0x120<br> Sep 24 16:31:46 n1 kernel: [75206.692101]<br> [getname+0x1a9/0x220] getname+0x1a9/0x220<br>
Sep 24 16:31:46 n1 kernel: [75206.692140]<br> [<ffffffff80254530>] autoremove_wake_function+0x0/0x30<br> Sep 24 16:31:46 n1 kernel: [75206.692185]<br> [vfs_read+0xed/0x190] vfs_read+0xed/0x190<br>
Sep 24 16:31:46 n1 kernel: [75206.692220]<br> [sys_read+0x53/0x90] sys_read+0x53/0x90<br> Sep 24 16:31:46 n1 kernel: [75206.692256]<br> [system_call+0x7e/0x83] system_call+0x7e/0x83<br> Sep 24 16:31:46 n1 kernel: [75206.692293]<br>
Sep 24 16:31:46 n1 kernel: [75206.692316]<br> Sep 24 16:31:46 n1 kernel: [75206.692317] Code: 0f 0b eb fe 83<br> fd fe 0f 84 73 fc ff ff 81 fd 00 fe ff ff 0f<br> Sep 24 16:31:46 n1 kernel: [75206.692483] RSP <ffff8101c64c9848><br>
<br><br> Thanks<br> Laurence<br><br> _______________________________________________<br> Ocfs2-users mailing list<br></div></div> <a href="mailto:Ocfs2-users@oss.oracle.com" target="_blank">Ocfs2-users@oss.oracle.com</a> <mailto:<a href="mailto:Ocfs2-users@oss.oracle.com" target="_blank">Ocfs2-users@oss.oracle.com</a>>
<div class="im"><br> <a href="http://oss.oracle.com/mailman/listinfo/ocfs2-users" target="_blank">http://oss.oracle.com/mailman/listinfo/ocfs2-users</a><br> <br><br><br></div></blockquote><br></blockquote></div>
<br></div>