[Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
mike
mike503 at gmail.com
Mon Apr 21 15:23:04 PDT 2008
Thanks.
If I have the opportunity to run the (buggy) new kernel again I will
try this. That is a definately problem and I think I need to set the
oracle behavior to crash and not auto reboot for this to be effective,
right?
That is just one issue.
1) 2.6.24-16 with load completely crashes node producing largest i/o
2) 2.6.22-19 utilization drops to 0% and causes a hiccup randomly (I
don't see a pattern and no batch jobs, or other things running at the
time it happens) - this is more important as it still is happening
even though I'm runnign the more "stable" kernel.
On 4/21/08, Sunil Mushran <Sunil.Mushran at oracle.com> wrote:
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/netconsole.txt;h=3c2f2b3286385337ce5ec24afebd4699dd1e6e0a;hb=HEAD
>
> netconsole is a facility to capture oops traces. It is not a console
> per se and does not require a head/gtk/x11 etc to work. The link above
> explains the usage, etc.
>
>
> mike wrote:
> > Well these are headless production servers, CLI only. no GTK, no X11.
> > also I am not running the newer kernels (and I can't...) it looks like
> > I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one has
> > mounted the drive first is the winner.
> >
> > If I mix them, I can get the 2.6.24's to mount, then the older ones
> > give the "number too large" error or whatever. So I can't currently
> > use one server on my cluster to test because it would require
> > upgrading all of them just for this test.
> >
> > On 4/21/08, Sunil Mushran <Sunil.Mushran at oracle.com> wrote:
> >
> >
> > > Setting up netconsole does not require a reboot. The idea is to
> > > catch the oops trace when the oops happens. Without that trace,
> > > we are flying blind.
> > >
> > >
> > > mike wrote:
> > >
> > >
> > > > Since these are production I can't do much.
> > > >
> > > > But I did get an error (it's not happening as much but it still blips
> > > > here and there)
> > > >
> > > > Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00%
> > > > utilization, 3 seconds before my proxy says "hey, timeout" - every
> > > > other second there is -always- some utilization going on.
> > > >
> > > > What could be steps to figure out this issue? Using debugfs.ocfs2 or
> > > >
> > > >
> > > something?
> > >
> > >
> > > > It's mounted as:
> > > > /dev/sdb1 on /home type ocfs2
> > > > (rw,_netdev,noatime,data=writeback,heartbeat=local)
> > > >
> > > > I know I'm not being much help, but I'm willing to try almost anything
> > > > as long as it doesn't cause downtime or require cluster-wide changes
> > > > (since those require downtime...) - I want to try to go back to
> > > > 2.6.24-16 with data=writeback and see if that fixes the crashing
> > > > issue, but if I'm having issues already like this perhaps I should
> > > > resolve this before moving up.
> > > >
> > > >
> > > >
> > > > [root at web03 ~]# cat /root/web03-iostat.txt
> > > >
> > > > Time: 02:11:46 PM
> > > > avg-cpu: %user %nice %system %iowait %steal %idle
> > > > 3.71 0.00 27.23 8.91 0.00 60.15
> > > >
> > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> > > > avgrq-sz avgqu-sz await svctm %util
> > > > sda 0.00 54.46 0.00 309.90 0.00 2914.85
> > > > 9.41 23.08 74.47 0.93 28.71
> > > > sdb 12.87 0.00 17.82 0.00 245.54 0.00
> > > > 13.78 0.33 17.78 18.33 32.67
> > > >
> > > > Time: 02:11:47 PM
> > > > avg-cpu: %user %nice %system %iowait %steal %idle
> > > > 0.25 0.00 26.24 2.23 0.00 71.29
> > > >
> > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> > > > avgrq-sz avgqu-sz await svctm %util
> > > > sda 0.00 0.00 0.00 0.00 0.00 0.00
> > > > 0.00 0.00 0.00 0.00 0.00
> > > > sdb 5.94 0.00 22.77 0.99 228.71 0.99
> > > > 9.67 0.42 17.92 17.08 40.59
> > > >
> > > > Time: 02:11:48 PM <- THIS HAS THE ISSUE
> > > > avg-cpu: %user %nice %system %iowait %steal %idle
> > > > 0.00 0.00 25.99 0.00 0.00 74.01
> > > >
> > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> > > > avgrq-sz avgqu-sz await svctm %util
> > > > sda 0.00 10.89 0.00 2.97 0.00 110.89
> > > > 37.33 0.00 0.00 0.00 0.00
> > > > sdb 0.00 0.00 0.00 0.00 0.00 0.00
> > > > 0.00 0.00 0.00 0.00 0.00
> > > >
> > > >
> > > > Time: 02:11:49 PM
> > > > avg-cpu: %user %nice %system %iowait %steal %idle
> > > > 0.25 0.00 14.85 0.99 0.00 83.91
> > > >
> > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> > > > avgrq-sz avgqu-sz await svctm %util
> > > > sda 0.00 0.00 0.00 0.00 0.00 0.00
> > > > 0.00 0.00 0.00 0.00 0.00
> > > > sdb 0.99 0.00 2.97 0.99 30.69 0.99
> > > > 8.00 0.07 17.50 17.50 6.93
> > > >
> > > > Time: 02:11:50 PM
> > > > avg-cpu: %user %nice %system %iowait %steal %idle
> > > > 0.74 0.00 1.24 1.73 0.00 96.29
> > > >
> > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> > > > avgrq-sz avgqu-sz await svctm %util
> > > > sda 0.00 0.00 0.00 0.00 0.00 0.00
> > > > 0.00 0.00 0.00 0.00 0.00
> > > > sdb 0.99 0.00 5.94 0.00 55.45 0.00
> > > > 9.33 0.07 11.67 11.67 6.93
> > > >
> > > > Time: 02:11:51 PM
> > > > avg-cpu: %user %nice %system %iowait %steal %idle
> > > > 0.00 0.00 1.24 16.34 0.00 82.43
> > > >
> > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> > > > avgrq-sz avgqu-sz await svctm %util
> > > > sda 0.00 153.47 0.00 494.06 0.00 5156.44
> > > > 10.44 55.62 107.23 1.16 57.43
> > > > sdb 2.97 0.00 11.88 0.99 117.82 0.99
> > > > 9.23 0.26 13.08 20.00 25.74
> > > >
> > > > Time: 02:11:52 PM
> > > > avg-cpu: %user %nice %system %iowait %steal %idle
> > > > 0.00 0.00 0.25 3.22 0.00 96.53
> > > >
> > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> > > > avgrq-sz avgqu-sz await svctm %util
> > > > sda 0.00 0.00 0.00 16.83 0.00 158.42
> > > > 9.41 0.13 164.71 1.18 1.98
> > > > sdb 1.98 0.00 2.97 0.00 39.60 0.00
> > > > 13.33 0.13 73.33 43.33 12.87
> > > >
> > > > Time: 02:11:53 PM
> > > > avg-cpu: %user %nice %system %iowait %steal %idle
> > > > 0.50 0.00 0.25 4.70 0.00 94.55
> > > >
> > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> > > > avgrq-sz avgqu-sz await svctm %util
> > > > sda 0.00 0.00 0.00 0.00 0.00 0.00
> > > > 0.00 0.00 0.00 0.00 0.00
> > > > sdb 5.94 0.00 11.88 0.99 141.58 0.99
> > > > 11.08 0.20 15.38 15.38 19.80
> > > >
> > > > Time: 02:11:54 PM
> > > > avg-cpu: %user %nice %system %iowait %steal %idle
> > > > 3.96 0.00 10.15 0.74 0.00 85.15
> > > >
> > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> > > > avgrq-sz avgqu-sz await svctm %util
> > > > sda 0.00 20.79 0.00 4.95 0.00 205.94
> > > > 41.60 0.00 0.00 0.00 0.00
> > > > sdb 4.95 0.00 5.94 0.00 87.13 0.00
> > > > 14.67 0.07 11.67 11.67 6.93
> > > >
> > > >
> > > >
> > > > On 4/21/08, Sunil Mushran <Sunil.Mushran at oracle.com> wrote:
> > > >
> > > >
> > > >
> > > >
> > > > > Do you have the panic output... kernel stack trace. We'll need
> > > > > that to figure this out. Without that, we can only speculate.
> > > > >
> > > > > mike wrote:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > On 4/21/08, Tao Ma <tao.ma at oracle.com> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > mike wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > I have changed my kernel back to 2.6.22-14-server, and now I
> don't
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > get
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > the kernel panics. It seems like an issue with 2.6.24-16 and
> some
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > i/o
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > made it crash...
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > OK, so it seems that it is a bug for ocfs2 kernel, not the
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > ocfs2-tools.
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > :)
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > Then could you please describe it in more detail about how the
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > kernel
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > panic
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > happens?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > Yeah, this specific issue seems like a kernel issue.
> > > > > >
> > > > > > I don't know, these are production systems and I am already
> getting
> > > > > > angry customers. I can't really test anymore. Both are standard
> Ubuntu
> > > > > > kernels.
> > > > > >
> > > > > > Okay: 2.6.22-14-server (I think still minor file access issues)
> > > > > > Breaks under load: 2.6.24-16-server
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > > However I am still getting file access timeouts once in a
> while. I
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > am
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > nervous about putting more load on the setup.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > Also please provide more details about it.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > I am using nginx for a frontend load balancer, and nginx for a
> > > > > > webserver as well. This doesn't seem to be related to the
> webserver at
> > > > > > all though, it was happening before this.
> > > > > >
> > > > > > lvs01 proxies traffic in to web01, web02, and web03 (currently
> using
> > > > > > nginx, before I was using LVS/ipvsadm)
> > > > > >
> > > > > > Every so often, one of the webservers sends me back
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > > [root at raid01 .batch]# cat /etc/default/o2cb
> > > > > > > >
> > > > > > > > # O2CB_ENABLED: 'true' means to load the driver on boot.
> > > > > > > > O2CB_ENABLED=true
> > > > > > > >
> > > > > > > > # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to
> start.
> > > > > > > > O2CB_BOOTCLUSTER=mycluster
> > > > > > > >
> > > > > > > > # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is
> considered
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > dead.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > O2CB_HEARTBEAT_THRESHOLD=7
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > This value is a little smaller, so how did you build up your
> shared
> > > > > > > disk(iSCSI or ...)? The most common value I heard of is 61. It
> is
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > about
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > 120
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > secs. I don't know the reason and maybe Sunil can tell you. ;)
> > > > > > > You can also refer to
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT.
> > >
> > >
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection
> is
> > > > > > > > considered dead.
> > > > > > > > O2CB_IDLE_TIMEOUT_MS=10000
> > > > > > > >
> > > > > > > > # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > packet is
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > sent
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > O2CB_KEEPALIVE_DELAY_MS=5000
> > > > > > > >
> > > > > > > > # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > attempts
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > O2CB_RECONNECT_DELAY_MS=2000
> > > > > > > >
> > > > > > > >
> > > > > > > > On 4/21/08, Tao Ma <tao.ma at oracle.com> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Hi Mike,
> > > > > > > > > Are you sure it is caused by the update of ocfs2-tools?
> > > > > > > > > AFAIK, the ocfs2-tools only include tools like mkfs, fsck
> and
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > tunefs
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > etc. So
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > if you don't make any change to the disk(by using this new
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > tools),
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > it
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > shouldn't cause the problem of kernel panic since they are
> all
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > user
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > space
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > tools.
> > > > > > > > > Then there is only one thing maybe. Have you modify
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > /etc/sysconfig/o2cb(This
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > is the place for RHEL, not sure the place in ubuntu)? I have
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > checked
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > the
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > rpm
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > package for RHEL, it will update /etc/sysconfig/o2cb and
> this
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > file
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > has
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > some
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > timeouts defined in it.
> > > > > > > > > So do you have some backups for this file? If yes, please
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > restore it
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > to
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > see
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > whether it helps(I can't say it for sure).
> > > > > > > > > If not, do you remember the old value of some timeouts you
> set
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > for
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > ocfs2? If
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > yes, you can use o2cb configure to set them by yourself.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > _______________________________________________
> > > > > > Ocfs2-users mailing list
> > > > > > Ocfs2-users at oss.oracle.com
> > > > > >
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > _______________________________________________
> > > > Ocfs2-users mailing list
> > > > Ocfs2-users at oss.oracle.com
> > > > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> >
>
>
More information about the Ocfs2-users
mailing list