[Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Mon Apr 21 17:02:15 PDT 2008

Mike,

Are you sure it's not possible for sdb to be idle for just 1 second?  If 
you look at the interval right after the one you pointed out, you'll see 
r/s is 2.97 and w/s is .99, so it did 3 reads and 1 write in that one 
second interval.  The device appears to be used very little.  I think 
it's quite possible that some 1 second intervals have no reads or writes 
at all, don't you think?

Thanks,
Herbert.

mike wrote:
> Thanks.
>
> If I have the opportunity to run the (buggy) new kernel again I will
> try this. That is a definately problem and I think I need to set the
> oracle behavior to crash and not auto reboot for this to be effective,
> right?
>
> That is just one issue.
> 1) 2.6.24-16 with load completely crashes node producing largest i/o
> 2) 2.6.22-19 utilization drops to 0% and causes a hiccup randomly (I
> don't see a pattern and no batch jobs, or other things running at the
> time it happens) - this is more important as it still is happening
> even though I'm runnign the more "stable" kernel.
>
>
> On 4/21/08, Sunil Mushran <Sunil.Mushran at oracle.com> wrote:
>   
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/netconsole.txt;h=3c2f2b3286385337ce5ec24afebd4699dd1e6e0a;hb=HEAD
>>
>> netconsole is a facility to capture oops traces. It is not a console
>> per se and does not require a head/gtk/x11 etc to work. The link above
>> explains the usage, etc.
>>
>>
>> mike wrote:
>>     
>>> Well these are headless production servers, CLI only. no GTK, no X11.
>>> also I am not running the newer kernels (and I can't...) it looks like
>>> I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one has
>>> mounted the drive first is the winner.
>>>
>>> If I mix them, I can get the 2.6.24's to mount, then the older ones
>>> give the "number too large" error or whatever. So I can't currently
>>> use one server on my cluster to test because it would require
>>> upgrading all of them just for this test.
>>>
>>> On 4/21/08, Sunil Mushran <Sunil.Mushran at oracle.com> wrote:
>>>
>>>
>>>       
>>>> Setting up netconsole does not require a reboot. The idea is to
>>>> catch the oops trace when the oops happens. Without that trace,
>>>> we are flying blind.
>>>>
>>>>
>>>> mike wrote:
>>>>
>>>>
>>>>         
>>>>> Since these are production I can't do much.
>>>>>
>>>>> But I did get an error (it's not happening as much but it still blips
>>>>> here and there)
>>>>>
>>>>> Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00%
>>>>> utilization, 3 seconds before my proxy says "hey, timeout" - every
>>>>> other second there is -always- some utilization going on.
>>>>>
>>>>> What could be steps to figure out this issue? Using debugfs.ocfs2 or
>>>>>
>>>>>
>>>>>           
>>>> something?
>>>>
>>>>
>>>>         
>>>>> It's mounted as:
>>>>> /dev/sdb1 on /home type ocfs2
>>>>> (rw,_netdev,noatime,data=writeback,heartbeat=local)
>>>>>
>>>>> I know I'm not being much help, but I'm willing to try almost anything
>>>>> as long as it doesn't cause downtime or require cluster-wide changes
>>>>> (since those require downtime...) - I want to try to go back to
>>>>> 2.6.24-16 with data=writeback and see if that fixes the crashing
>>>>> issue, but if I'm having issues already like this perhaps I should
>>>>> resolve this before moving up.
>>>>>
>>>>>
>>>>>
>>>>> [root at web03 ~]# cat /root/web03-iostat.txt
>>>>>
>>>>> Time: 02:11:46 PM
>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>          3.71    0.00   27.23    8.91    0.00   60.15
>>>>>
>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>>>> avgrq-sz avgqu-sz   await  svctm  %util
>>>>> sda               0.00    54.46    0.00  309.90     0.00  2914.85
>>>>> 9.41    23.08   74.47   0.93  28.71
>>>>> sdb              12.87     0.00   17.82    0.00   245.54     0.00
>>>>> 13.78     0.33   17.78  18.33  32.67
>>>>>
>>>>> Time: 02:11:47 PM
>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>          0.25    0.00   26.24    2.23    0.00   71.29
>>>>>
>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>>>> avgrq-sz avgqu-sz   await  svctm  %util
>>>>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>>>>> 0.00     0.00    0.00   0.00   0.00
>>>>> sdb               5.94     0.00   22.77    0.99   228.71     0.99
>>>>> 9.67     0.42   17.92  17.08  40.59
>>>>>
>>>>> Time: 02:11:48 PM   <- THIS HAS THE ISSUE
>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>          0.00    0.00   25.99    0.00    0.00   74.01
>>>>>
>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>>>> avgrq-sz avgqu-sz   await  svctm  %util
>>>>> sda               0.00    10.89    0.00    2.97     0.00   110.89
>>>>> 37.33     0.00    0.00   0.00   0.00
>>>>> sdb               0.00     0.00    0.00    0.00     0.00     0.00
>>>>> 0.00     0.00    0.00   0.00   0.00
>>>>>
>>>>>
>>>>> Time: 02:11:49 PM
>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>          0.25    0.00   14.85    0.99    0.00   83.91
>>>>>
>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>>>> avgrq-sz avgqu-sz   await  svctm  %util
>>>>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>>>>> 0.00     0.00    0.00   0.00   0.00
>>>>> sdb               0.99     0.00    2.97    0.99    30.69     0.99
>>>>> 8.00     0.07   17.50  17.50   6.93
>>>>>
>>>>> Time: 02:11:50 PM
>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>          0.74    0.00    1.24    1.73    0.00   96.29
>>>>>
>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>>>> avgrq-sz avgqu-sz   await  svctm  %util
>>>>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>>>>> 0.00     0.00    0.00   0.00   0.00
>>>>> sdb               0.99     0.00    5.94    0.00    55.45     0.00
>>>>> 9.33     0.07   11.67  11.67   6.93
>>>>>
>>>>> Time: 02:11:51 PM
>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>          0.00    0.00    1.24   16.34    0.00   82.43
>>>>>
>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>>>> avgrq-sz avgqu-sz   await  svctm  %util
>>>>> sda               0.00   153.47    0.00  494.06     0.00  5156.44
>>>>> 10.44    55.62  107.23   1.16  57.43
>>>>> sdb               2.97     0.00   11.88    0.99   117.82     0.99
>>>>> 9.23     0.26   13.08  20.00  25.74
>>>>>
>>>>> Time: 02:11:52 PM
>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>          0.00    0.00    0.25    3.22    0.00   96.53
>>>>>
>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>>>> avgrq-sz avgqu-sz   await  svctm  %util
>>>>> sda               0.00     0.00    0.00   16.83     0.00   158.42
>>>>> 9.41     0.13  164.71   1.18   1.98
>>>>> sdb               1.98     0.00    2.97    0.00    39.60     0.00
>>>>> 13.33     0.13   73.33  43.33  12.87
>>>>>
>>>>> Time: 02:11:53 PM
>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>          0.50    0.00    0.25    4.70    0.00   94.55
>>>>>
>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>>>> avgrq-sz avgqu-sz   await  svctm  %util
>>>>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>>>>> 0.00     0.00    0.00   0.00   0.00
>>>>> sdb               5.94     0.00   11.88    0.99   141.58     0.99
>>>>> 11.08     0.20   15.38  15.38  19.80
>>>>>
>>>>> Time: 02:11:54 PM
>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>          3.96    0.00   10.15    0.74    0.00   85.15
>>>>>
>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>>>> avgrq-sz avgqu-sz   await  svctm  %util
>>>>> sda               0.00    20.79    0.00    4.95     0.00   205.94
>>>>> 41.60     0.00    0.00   0.00   0.00
>>>>> sdb               4.95     0.00    5.94    0.00    87.13     0.00
>>>>> 14.67     0.07   11.67  11.67   6.93
>>>>>
>>>>>
>>>>>
>>>>> On 4/21/08, Sunil Mushran <Sunil.Mushran at oracle.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> Do you have the panic output... kernel stack trace. We'll need
>>>>>> that to figure this out. Without that, we can only speculate.
>>>>>>
>>>>>> mike wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> On 4/21/08, Tao Ma <tao.ma at oracle.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> mike wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> I have changed my kernel back to 2.6.22-14-server, and now I
>>>>>>>>>                   
>> don't
>>     
>>>>>>>>>                   
>>>> get
>>>>
>>>>
>>>>         
>>>>>>>>> the kernel panics. It seems like an issue with 2.6.24-16 and
>>>>>>>>>                   
>> some
>>     
>>>>>>>>>                   
>>>> i/o
>>>>
>>>>
>>>>         
>>>>>>>>> made it crash...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> OK, so it seems that it is a bug for ocfs2 kernel, not the
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>> ocfs2-tools.
>>>>
>>>>
>>>>         
>>>>>>>>                 
>>>>>> :)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>> Then could you please describe it in more detail about how the
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>> kernel
>>>>
>>>>
>>>>         
>>>>>>>>                 
>>>>>> panic
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>> happens?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>> Yeah, this specific issue seems like a kernel issue.
>>>>>>>
>>>>>>> I don't know, these are production systems and I am already
>>>>>>>               
>> getting
>>     
>>>>>>> angry customers. I can't really test anymore. Both are standard
>>>>>>>               
>> Ubuntu
>>     
>>>>>>> kernels.
>>>>>>>
>>>>>>> Okay: 2.6.22-14-server (I think still minor file access issues)
>>>>>>> Breaks under load: 2.6.24-16-server
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>>> However I am still getting file access timeouts once in a
>>>>>>>>>                   
>> while. I
>>     
>>>>>>>>>                   
>>>> am
>>>>
>>>>
>>>>         
>>>>>>>>> nervous about putting more load on the setup.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> Also please provide more details about it.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>> I am using nginx for a frontend load balancer, and nginx for a
>>>>>>> webserver as well. This doesn't seem to be related to the
>>>>>>>               
>> webserver at
>>     
>>>>>>> all though, it was happening before this.
>>>>>>>
>>>>>>> lvs01 proxies traffic in to web01, web02, and web03 (currently
>>>>>>>               
>> using
>>     
>>>>>>> nginx, before I was using LVS/ipvsadm)
>>>>>>>
>>>>>>> Every so often, one of the webservers sends me back
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>>> [root at raid01 .batch]# cat /etc/default/o2cb
>>>>>>>>>
>>>>>>>>> # O2CB_ENABLED: 'true' means to load the driver on boot.
>>>>>>>>> O2CB_ENABLED=true
>>>>>>>>>
>>>>>>>>> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to
>>>>>>>>>                   
>> start.
>>     
>>>>>>>>> O2CB_BOOTCLUSTER=mycluster
>>>>>>>>>
>>>>>>>>> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is
>>>>>>>>>                   
>> considered
>>     
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>> dead.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>>> O2CB_HEARTBEAT_THRESHOLD=7
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> This value is a little smaller, so how did you build up your
>>>>>>>>                 
>> shared
>>     
>>>>>>>> disk(iSCSI or ...)? The most common value I heard of is 61. It
>>>>>>>>                 
>> is
>>     
>>>>>>>>                 
>>>> about
>>>>
>>>>
>>>>         
>>>>>>>>                 
>>>>>> 120
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>> secs. I don't know the reason and maybe Sunil can tell you. ;)
>>>>>>>> You can also refer to
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>> http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT.
>>     
>>>>         
>>>>>>             
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection
>>>>>>>>>                   
>> is
>>     
>>>>>>>>> considered dead.
>>>>>>>>> O2CB_IDLE_TIMEOUT_MS=10000
>>>>>>>>>
>>>>>>>>> # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>> packet is
>>>>
>>>>
>>>>         
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> sent
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> O2CB_KEEPALIVE_DELAY_MS=5000
>>>>>>>>>
>>>>>>>>> # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>> attempts
>>>>
>>>>
>>>>         
>>>>>>>>> O2CB_RECONNECT_DELAY_MS=2000
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 4/21/08, Tao Ma <tao.ma at oracle.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>>> Hi Mike,
>>>>>>>>>>    Are you sure it is caused by the update of ocfs2-tools?
>>>>>>>>>> AFAIK, the ocfs2-tools only include tools like mkfs, fsck
>>>>>>>>>>                     
>> and
>>     
>>>>>>>>>>                     
>>>> tunefs
>>>>
>>>>
>>>>         
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>> etc. So
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>>> if you don't make any change to the disk(by using this new
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>> tools),
>>>>
>>>>
>>>>         
>>>>>>>>>>                     
>>>>>> it
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>>>> shouldn't cause the problem of kernel panic since they are
>>>>>>>>>>                     
>> all
>>     
>>>>>>>>>>                     
>>>> user
>>>>
>>>>
>>>>         
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>> space
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>>> tools.
>>>>>>>>>> Then there is only one thing maybe. Have you modify
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>> /etc/sysconfig/o2cb(This
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>>> is the place for RHEL, not sure the place in ubuntu)? I have
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>> checked
>>>>
>>>>
>>>>         
>>>>>>>>>>                     
>>>>>> the
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>>>>                     
>>>>>>>> rpm
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>>> package for RHEL, it will update /etc/sysconfig/o2cb and
>>>>>>>>>>                     
>> this
>>     
>>>>>>>>>>                     
>>>> file
>>>>
>>>>
>>>>         
>>>>>>>>>>                     
>>>>>> has
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>>>>                     
>>>>>>>> some
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>>> timeouts defined in it.
>>>>>>>>>> So do you have some backups for this file? If yes, please
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>> restore it
>>>>
>>>>
>>>>         
>>>>>>>>>>                     
>>>>>> to
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>>>>                     
>>>>>>>> see
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>>> whether it helps(I can't say it for sure).
>>>>>>>>>> If not, do you remember the old value of some timeouts you
>>>>>>>>>>                     
>> set
>>     
>>>>>>>>>>                     
>>>> for
>>>>
>>>>
>>>>         
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>> ocfs2? If
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>>> yes, you can use o2cb configure to set them by yourself.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>                 
>>>>>>> _______________________________________________
>>>>>>> Ocfs2-users mailing list
>>>>>>> Ocfs2-users at oss.oracle.com
>>>>>>>
>>>>>>>               
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>     
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>             
>>>>> _______________________________________________
>>>>> Ocfs2-users mailing list
>>>>> Ocfs2-users at oss.oracle.com
>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>         
>>     
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>