[Ocfs-devel] Re: URGENT: OCFS2 hang - 32 node cluster POC

Wed Aug 9 19:24:29 PDT 2006

alt-sysrq-t should still work w/ netdump configured

On Thu, Aug 10, 2006 at 12:22:39PM +1000, Colin Laird wrote:
> The problem is during the hang you can't get on to the box, its 
> completely dead.
> 
> Something we have found is that the heartbeat is set to 7, on the test 
> cluster which has worked fine it is at 61.  We are setting this value to 
> 61 across the cluster.
> 
> Sunil Mushran wrote:
> >Run:
> ># top
> ># vmstat 1
> ># iostat -x /dev/emcpowerb 1
> >
> >The latter two you can save to a file. For top, just monitor cpu usage
> >and see if any process is hogging all of it.
> >
> >Colin Laird wrote:
> >>and the fstab settings:
> >>
> >># This file is edited by fstab-sync - see 'man fstab-sync' for details
> >>/dev/VolGroup00/LogVol01 /                       ext3    
> >>defaults        1 1
> >>LABEL=/boot             /boot                   ext3    
> >>defaults        1 2
> >>none                    /dev/pts                devpts  
> >>gid=5,mode=620  0 0
> >>none                    /dev/shm                tmpfs   
> >>defaults        0 0
> >>/dev/VolGroup00/LogVol02 /home                   ext3    
> >>defaults        1 2
> >>none                    /proc                   proc    
> >>defaults        0 0
> >>none                    /sys                    sysfs   
> >>defaults        0 0
> >>/dev/VolGroup00/LogVol00 swap                    swap    
> >>defaults        0 0
> >>/dev/emcpowerb          /ocfs2                  ocfs2   
> >>_netdev         0 0
> >>/dev/hda                /media/cdrom            auto    
> >>pamconsole,exec,noauto,managed 0 0
> >>/dev/fd0                /media/floppy           auto    
> >>pamconsole,exec,noauto,managed 0 0
> >>
> >>We are not storing the voting disk and cluster reg for RAC in here.
> >>
> >>Thanks
> >>
> >>
> >>Colin Laird wrote:
> >>>Hi,
> >>>
> >>>We are in the middle of a very large bid (Centrelink, Australia) 
> >>>with time at a premium.  So PLEASE HELP.  we have been experiencing 
> >>>machine hangs whenever we do large copies (5-18G) into OCFS2.  
> >>>Either from ftp or local disk.  The whole machine just freezes and 
> >>>we need to run off and on.  we now cannot get the data available for 
> >>>the POC across the nodes!
> >>>
> >>>The setup is:
> >>>
> >>>32 clustered Dell 6850 nodes running RHEL4 U3 - Linux 
> >>>c2.au.oracle.com 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 
> >>>x86_64 x86_64 x86_64 GNU/Linux
> >>>
> >>>We have the following ocfs2 packages installed:
> >>>ocfs2-2.6.9-34.ELsmp-1.2.3-1
> >>>ocfs2-2.6.9-34.EL-1.2.3-1
> >>>ocfs2-tools-debuginfo-1.2.1-1
> >>>ocfs2-2.6.9-34.ELlargesmp-1.2.3-1
> >>>ocfs2console-1.2.1-1
> >>>ocfs2-tools-1.2.1-1
> >>>
> >>>We have* elevator=deadline* set as per instructions too.
> >>>
> >>>We are currently looking for a log to see if we can find anything.  
> >>>The system and ftp logs show nothing.
> >>>
> >>>Can anyone provide any pointers?  Have we missed applying anything?
> >>>
> >>>Thanks,
> >>>
> >>>-- 
> >>>Colin Laird
> >>>Principal Solutions Consultant
> >>>
> >>>Oracle New Zealand Ltd
> >>>Level 10
> >>>Todd Building
> >>>93-97 Customhouse Quay
> >>>Wellington
> >>>New Zealand
> >>>
> >>>main: +64 4 978 5400
> >>>ddi:  +64 4 978 5423
> >>>mob:  +64 21 617 025
> >>>fax:  +64 4 978 5401 
> >>
> >>-- 
> >>Colin Laird
> >>Principal Solutions Consultant
> >>
> >>Oracle New Zealand Ltd
> >>Level 10
> >>Todd Building
> >>93-97 Customhouse Quay
> >>Wellington
> >>New Zealand
> >>
> >>main: +64 4 978 5400
> >>ddi:  +64 4 978 5423
> >>mob:  +64 21 617 025
> >>fax:  +64 4 978 5401 
> 
> -- 
> Colin Laird
> Principal Solutions Consultant
> 
> Oracle New Zealand Ltd
> Level 10
> Todd Building
> 93-97 Customhouse Quay
> Wellington
> New Zealand
> 
> main: +64 4 978 5400
> ddi:  +64 4 978 5423
> mob:  +64 21 617 025
> fax:  +64 4 978 5401 
>