[Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Mon Apr 21 15:01:26 PDT 2008

Well these are headless production servers, CLI only. no GTK, no X11.
also I am not running the newer kernels (and I can't...) it looks like
I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one has
mounted the drive first is the winner.

If I mix them, I can get the 2.6.24's to mount, then the older ones
give the "number too large" error or whatever. So I can't currently
use one server on my cluster to test because it would require
upgrading all of them just for this test.

On 4/21/08, Sunil Mushran <Sunil.Mushran at oracle.com> wrote:
> Setting up netconsole does not require a reboot. The idea is to
> catch the oops trace when the oops happens. Without that trace,
> we are flying blind.
>
>
> mike wrote:
> > Since these are production I can't do much.
> >
> > But I did get an error (it's not happening as much but it still blips
> > here and there)
> >
> > Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00%
> > utilization, 3 seconds before my proxy says "hey, timeout" - every
> > other second there is -always- some utilization going on.
> >
> > What could be steps to figure out this issue? Using debugfs.ocfs2 or
> something?
> >
> > It's mounted as:
> > /dev/sdb1 on /home type ocfs2
> > (rw,_netdev,noatime,data=writeback,heartbeat=local)
> >
> > I know I'm not being much help, but I'm willing to try almost anything
> > as long as it doesn't cause downtime or require cluster-wide changes
> > (since those require downtime...) - I want to try to go back to
> > 2.6.24-16 with data=writeback and see if that fixes the crashing
> > issue, but if I'm having issues already like this perhaps I should
> > resolve this before moving up.
> >
> >
> >
> > [root at web03 ~]# cat /root/web03-iostat.txt
> >
> > Time: 02:11:46 PM
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >           3.71    0.00   27.23    8.91    0.00   60.15
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> > sda               0.00    54.46    0.00  309.90     0.00  2914.85
> > 9.41    23.08   74.47   0.93  28.71
> > sdb              12.87     0.00   17.82    0.00   245.54     0.00
> > 13.78     0.33   17.78  18.33  32.67
> >
> > Time: 02:11:47 PM
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >           0.25    0.00   26.24    2.23    0.00   71.29
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> > sda               0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00     0.00    0.00   0.00   0.00
> > sdb               5.94     0.00   22.77    0.99   228.71     0.99
> > 9.67     0.42   17.92  17.08  40.59
> >
> > Time: 02:11:48 PM   <- THIS HAS THE ISSUE
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >           0.00    0.00   25.99    0.00    0.00   74.01
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> > sda               0.00    10.89    0.00    2.97     0.00   110.89
> > 37.33     0.00    0.00   0.00   0.00
> > sdb               0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00     0.00    0.00   0.00   0.00
> >
> >
> > Time: 02:11:49 PM
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >           0.25    0.00   14.85    0.99    0.00   83.91
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> > sda               0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00     0.00    0.00   0.00   0.00
> > sdb               0.99     0.00    2.97    0.99    30.69     0.99
> > 8.00     0.07   17.50  17.50   6.93
> >
> > Time: 02:11:50 PM
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >           0.74    0.00    1.24    1.73    0.00   96.29
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> > sda               0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00     0.00    0.00   0.00   0.00
> > sdb               0.99     0.00    5.94    0.00    55.45     0.00
> > 9.33     0.07   11.67  11.67   6.93
> >
> > Time: 02:11:51 PM
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >           0.00    0.00    1.24   16.34    0.00   82.43
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> > sda               0.00   153.47    0.00  494.06     0.00  5156.44
> > 10.44    55.62  107.23   1.16  57.43
> > sdb               2.97     0.00   11.88    0.99   117.82     0.99
> > 9.23     0.26   13.08  20.00  25.74
> >
> > Time: 02:11:52 PM
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >           0.00    0.00    0.25    3.22    0.00   96.53
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> > sda               0.00     0.00    0.00   16.83     0.00   158.42
> > 9.41     0.13  164.71   1.18   1.98
> > sdb               1.98     0.00    2.97    0.00    39.60     0.00
> > 13.33     0.13   73.33  43.33  12.87
> >
> > Time: 02:11:53 PM
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >           0.50    0.00    0.25    4.70    0.00   94.55
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> > sda               0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00     0.00    0.00   0.00   0.00
> > sdb               5.94     0.00   11.88    0.99   141.58     0.99
> > 11.08     0.20   15.38  15.38  19.80
> >
> > Time: 02:11:54 PM
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >           3.96    0.00   10.15    0.74    0.00   85.15
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> > sda               0.00    20.79    0.00    4.95     0.00   205.94
> > 41.60     0.00    0.00   0.00   0.00
> > sdb               4.95     0.00    5.94    0.00    87.13     0.00
> > 14.67     0.07   11.67  11.67   6.93
> >
> >
> >
> > On 4/21/08, Sunil Mushran <Sunil.Mushran at oracle.com> wrote:
> >
> >
> > > Do you have the panic output... kernel stack trace. We'll need
> > > that to figure this out. Without that, we can only speculate.
> > >
> > > mike wrote:
> > >
> > >
> > > > On 4/21/08, Tao Ma <tao.ma at oracle.com> wrote:
> > > >
> > > >
> > > >
> > > >
> > > > > mike wrote:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > I have changed my kernel back to 2.6.22-14-server, and now I don't
> get
> > > > > > the kernel panics. It seems like an issue with 2.6.24-16 and some
> i/o
> > > > > > made it crash...
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > OK, so it seems that it is a bug for ocfs2 kernel, not the
> ocfs2-tools.
> > > > >
> > > > >
> > > >
> > > :)
> > >
> > >
> > > >
> > > > > Then could you please describe it in more detail about how the
> kernel
> > > > >
> > > > >
> > > >
> > > panic
> > >
> > >
> > > >
> > > > > happens?
> > > > >
> > > > >
> > > > >
> > > > >
> > > > Yeah, this specific issue seems like a kernel issue.
> > > >
> > > > I don't know, these are production systems and I am already getting
> > > > angry customers. I can't really test anymore. Both are standard Ubuntu
> > > > kernels.
> > > >
> > > > Okay: 2.6.22-14-server (I think still minor file access issues)
> > > > Breaks under load: 2.6.24-16-server
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > >
> > > > > > However I am still getting file access timeouts once in a while. I
> am
> > > > > > nervous about putting more load on the setup.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > Also please provide more details about it.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > I am using nginx for a frontend load balancer, and nginx for a
> > > > webserver as well. This doesn't seem to be related to the webserver at
> > > > all though, it was happening before this.
> > > >
> > > > lvs01 proxies traffic in to web01, web02, and web03 (currently using
> > > > nginx, before I was using LVS/ipvsadm)
> > > >
> > > > Every so often, one of the webservers sends me back
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > >
> > > > > > [root at raid01 .batch]# cat /etc/default/o2cb
> > > > > >
> > > > > > # O2CB_ENABLED: 'true' means to load the driver on boot.
> > > > > > O2CB_ENABLED=true
> > > > > >
> > > > > > # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
> > > > > > O2CB_BOOTCLUSTER=mycluster
> > > > > >
> > > > > > # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered
> > > > > >
> > > > > >
> > > > >
> > > >
> > > dead.
> > >
> > >
> > > >
> > > > >
> > > > > > O2CB_HEARTBEAT_THRESHOLD=7
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > This value is a little smaller, so how did you build up your shared
> > > > > disk(iSCSI or ...)? The most common value I heard of is 61. It is
> about
> > > > >
> > > > >
> > > >
> > > 120
> > >
> > >
> > > >
> > > > > secs. I don't know the reason and maybe Sunil can tell you. ;)
> > > > > You can also refer to
> > > > >
> > > > >
> > > > >
> > > >
> > >
> http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT.
> > >
> > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > > # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is
> > > > > > considered dead.
> > > > > > O2CB_IDLE_TIMEOUT_MS=10000
> > > > > >
> > > > > > # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive
> packet is
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > sent
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > O2CB_KEEPALIVE_DELAY_MS=5000
> > > > > >
> > > > > > # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection
> attempts
> > > > > > O2CB_RECONNECT_DELAY_MS=2000
> > > > > >
> > > > > >
> > > > > > On 4/21/08, Tao Ma <tao.ma at oracle.com> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Hi Mike,
> > > > > > >     Are you sure it is caused by the update of ocfs2-tools?
> > > > > > > AFAIK, the ocfs2-tools only include tools like mkfs, fsck and
> tunefs
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > etc. So
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > if you don't make any change to the disk(by using this new
> tools),
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > it
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > shouldn't cause the problem of kernel panic since they are all
> user
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > space
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > tools.
> > > > > > > Then there is only one thing maybe. Have you modify
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > /etc/sysconfig/o2cb(This
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > is the place for RHEL, not sure the place in ubuntu)? I have
> checked
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > the
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > rpm
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > package for RHEL, it will update /etc/sysconfig/o2cb and this
> file
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > has
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > some
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > timeouts defined in it.
> > > > > > > So do you have some backups for this file? If yes, please
> restore it
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > to
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > see
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > whether it helps(I can't say it for sure).
> > > > > > > If not, do you remember the old value of some timeouts you set
> for
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > ocfs2? If
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > yes, you can use o2cb configure to set them by yourself.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > _______________________________________________
> > > > Ocfs2-users mailing list
> > > > Ocfs2-users at oss.oracle.com
> > > > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >
> >
>
>