[Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
Sunil Mushran
Sunil.Mushran at oracle.com
Mon Apr 21 14:41:14 PDT 2008
Setting up netconsole does not require a reboot. The idea is to
catch the oops trace when the oops happens. Without that trace,
we are flying blind.
mike wrote:
> Since these are production I can't do much.
>
> But I did get an error (it's not happening as much but it still blips
> here and there)
>
> Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00%
> utilization, 3 seconds before my proxy says "hey, timeout" - every
> other second there is -always- some utilization going on.
>
> What could be steps to figure out this issue? Using debugfs.ocfs2 or something?
>
> It's mounted as:
> /dev/sdb1 on /home type ocfs2
> (rw,_netdev,noatime,data=writeback,heartbeat=local)
>
> I know I'm not being much help, but I'm willing to try almost anything
> as long as it doesn't cause downtime or require cluster-wide changes
> (since those require downtime...) - I want to try to go back to
> 2.6.24-16 with data=writeback and see if that fixes the crashing
> issue, but if I'm having issues already like this perhaps I should
> resolve this before moving up.
>
>
>
> [root at web03 ~]# cat /root/web03-iostat.txt
>
> Time: 02:11:46 PM
> avg-cpu: %user %nice %system %iowait %steal %idle
> 3.71 0.00 27.23 8.91 0.00 60.15
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> sda 0.00 54.46 0.00 309.90 0.00 2914.85
> 9.41 23.08 74.47 0.93 28.71
> sdb 12.87 0.00 17.82 0.00 245.54 0.00
> 13.78 0.33 17.78 18.33 32.67
>
> Time: 02:11:47 PM
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.25 0.00 26.24 2.23 0.00 71.29
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> sda 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00
> sdb 5.94 0.00 22.77 0.99 228.71 0.99
> 9.67 0.42 17.92 17.08 40.59
>
> Time: 02:11:48 PM <- THIS HAS THE ISSUE
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.00 0.00 25.99 0.00 0.00 74.01
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> sda 0.00 10.89 0.00 2.97 0.00 110.89
> 37.33 0.00 0.00 0.00 0.00
> sdb 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00
>
>
> Time: 02:11:49 PM
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.25 0.00 14.85 0.99 0.00 83.91
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> sda 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00
> sdb 0.99 0.00 2.97 0.99 30.69 0.99
> 8.00 0.07 17.50 17.50 6.93
>
> Time: 02:11:50 PM
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.74 0.00 1.24 1.73 0.00 96.29
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> sda 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00
> sdb 0.99 0.00 5.94 0.00 55.45 0.00
> 9.33 0.07 11.67 11.67 6.93
>
> Time: 02:11:51 PM
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.00 0.00 1.24 16.34 0.00 82.43
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> sda 0.00 153.47 0.00 494.06 0.00 5156.44
> 10.44 55.62 107.23 1.16 57.43
> sdb 2.97 0.00 11.88 0.99 117.82 0.99
> 9.23 0.26 13.08 20.00 25.74
>
> Time: 02:11:52 PM
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.00 0.00 0.25 3.22 0.00 96.53
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> sda 0.00 0.00 0.00 16.83 0.00 158.42
> 9.41 0.13 164.71 1.18 1.98
> sdb 1.98 0.00 2.97 0.00 39.60 0.00
> 13.33 0.13 73.33 43.33 12.87
>
> Time: 02:11:53 PM
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.50 0.00 0.25 4.70 0.00 94.55
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> sda 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00
> sdb 5.94 0.00 11.88 0.99 141.58 0.99
> 11.08 0.20 15.38 15.38 19.80
>
> Time: 02:11:54 PM
> avg-cpu: %user %nice %system %iowait %steal %idle
> 3.96 0.00 10.15 0.74 0.00 85.15
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> sda 0.00 20.79 0.00 4.95 0.00 205.94
> 41.60 0.00 0.00 0.00 0.00
> sdb 4.95 0.00 5.94 0.00 87.13 0.00
> 14.67 0.07 11.67 11.67 6.93
>
>
>
> On 4/21/08, Sunil Mushran <Sunil.Mushran at oracle.com> wrote:
>
>> Do you have the panic output... kernel stack trace. We'll need
>> that to figure this out. Without that, we can only speculate.
>>
>> mike wrote:
>>
>>> On 4/21/08, Tao Ma <tao.ma at oracle.com> wrote:
>>>
>>>
>>>
>>>> mike wrote:
>>>>
>>>>
>>>>
>>>>> I have changed my kernel back to 2.6.22-14-server, and now I don't get
>>>>> the kernel panics. It seems like an issue with 2.6.24-16 and some i/o
>>>>> made it crash...
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> OK, so it seems that it is a bug for ocfs2 kernel, not the ocfs2-tools.
>>>>
>> :)
>>
>>>> Then could you please describe it in more detail about how the kernel
>>>>
>> panic
>>
>>>> happens?
>>>>
>>>>
>>>>
>>> Yeah, this specific issue seems like a kernel issue.
>>>
>>> I don't know, these are production systems and I am already getting
>>> angry customers. I can't really test anymore. Both are standard Ubuntu
>>> kernels.
>>>
>>> Okay: 2.6.22-14-server (I think still minor file access issues)
>>> Breaks under load: 2.6.24-16-server
>>>
>>>
>>>
>>>
>>>
>>>>> However I am still getting file access timeouts once in a while. I am
>>>>> nervous about putting more load on the setup.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> Also please provide more details about it.
>>>>
>>>>
>>>>
>>> I am using nginx for a frontend load balancer, and nginx for a
>>> webserver as well. This doesn't seem to be related to the webserver at
>>> all though, it was happening before this.
>>>
>>> lvs01 proxies traffic in to web01, web02, and web03 (currently using
>>> nginx, before I was using LVS/ipvsadm)
>>>
>>> Every so often, one of the webservers sends me back
>>>
>>>
>>>
>>>
>>>>> [root at raid01 .batch]# cat /etc/default/o2cb
>>>>>
>>>>> # O2CB_ENABLED: 'true' means to load the driver on boot.
>>>>> O2CB_ENABLED=true
>>>>>
>>>>> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
>>>>> O2CB_BOOTCLUSTER=mycluster
>>>>>
>>>>> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered
>>>>>
>> dead.
>>
>>>>> O2CB_HEARTBEAT_THRESHOLD=7
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> This value is a little smaller, so how did you build up your shared
>>>> disk(iSCSI or ...)? The most common value I heard of is 61. It is about
>>>>
>> 120
>>
>>>> secs. I don't know the reason and maybe Sunil can tell you. ;)
>>>> You can also refer to
>>>>
>>>>
>> http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT.
>>
>>>>
>>>>
>>>>> # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is
>>>>> considered dead.
>>>>> O2CB_IDLE_TIMEOUT_MS=10000
>>>>>
>>>>> # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is
>>>>>
>>>>>
>>>>>
>>>> sent
>>>>
>>>>
>>>>
>>>>> O2CB_KEEPALIVE_DELAY_MS=5000
>>>>>
>>>>> # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
>>>>> O2CB_RECONNECT_DELAY_MS=2000
>>>>>
>>>>>
>>>>> On 4/21/08, Tao Ma <tao.ma at oracle.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi Mike,
>>>>>> Are you sure it is caused by the update of ocfs2-tools?
>>>>>> AFAIK, the ocfs2-tools only include tools like mkfs, fsck and tunefs
>>>>>>
>>>>>>
>>>>>>
>>>> etc. So
>>>>
>>>>
>>>>
>>>>>> if you don't make any change to the disk(by using this new tools),
>>>>>>
>> it
>>
>>>>>> shouldn't cause the problem of kernel panic since they are all user
>>>>>>
>>>>>>
>>>>>>
>>>> space
>>>>
>>>>
>>>>
>>>>>> tools.
>>>>>> Then there is only one thing maybe. Have you modify
>>>>>>
>>>>>>
>>>>>>
>>>> /etc/sysconfig/o2cb(This
>>>>
>>>>
>>>>
>>>>>> is the place for RHEL, not sure the place in ubuntu)? I have checked
>>>>>>
>> the
>>
>>>>>>
>>>> rpm
>>>>
>>>>
>>>>
>>>>>> package for RHEL, it will update /etc/sysconfig/o2cb and this file
>>>>>>
>> has
>>
>>>>>>
>>>> some
>>>>
>>>>
>>>>
>>>>>> timeouts defined in it.
>>>>>> So do you have some backups for this file? If yes, please restore it
>>>>>>
>> to
>>
>>>>>>
>>>> see
>>>>
>>>>
>>>>
>>>>>> whether it helps(I can't say it for sure).
>>>>>> If not, do you remember the old value of some timeouts you set for
>>>>>>
>>>>>>
>>>>>>
>>>> ocfs2? If
>>>>
>>>>
>>>>
>>>>>> yes, you can use o2cb configure to set them by yourself.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>> _______________________________________________
>>> Ocfs2-users mailing list
>>> Ocfs2-users at oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>
>>>
>>>
>>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
More information about the Ocfs2-users
mailing list