[Ocfs2-users] Problem with OCFS2 disk on some moments (slow until stalls)

Tue Sep 15 19:04:08 PDT 2015

Hi Area,
data=writeback improves things greatly. In ordered mode , the default, 
before writing
a transaction(which only logs meta data changes) data is written. This 
is very conservative
to ensure that before journal log buffer is written to disk journal 
area, data has hit the disk and
transaction can be safely replayed in case of a crash -- only complete 
transactions are replayed,
by complete I mean: begin-trans changebuf1, changebuf2, ... , 
changebufnn end-trans. Replay means buffer
are dispatched from the journal area on disk to their ultimate home loc 
on disk. You can see now why
ordered mode generates so much i/o. In write back mode, transaction can 
hit the disk but data
will be written whenever the kernel wants, asynchronously and without 
knowing any relationship
to its related data. The danger is in case of a crash, we can replay a 
transaction but its associated
data is not on disk. For example, if you truncate up a file to a new 
bigger size and then
write something to a page beyond the old size, the page could hang 
around in core for a long time
after transaction is written to the journal area on disk. If there is a 
crash while the data page is still
in core, after replay, the file will have new size but the page with 
data will show all zeros instead of
what you wrote. At any rate, this is a digression, just for your info.

The commit is the interval at which data is synced to disc. I think it 
may also be the interval
after which journal log buffer is written to disk. So decreasing it 
reduces number of unecessary
writes.

Now for the threads blocked for more than 120 sec in /var/log/messages. 
There are two types.
First type is blocked on mutex on ocfs2 system file, mostly the global 
bit map file shared by
all nodes. All writes to system files are done under transactions and 
that may require
flushing to disk the journal buffer, depending upon your journal file 
size. The smaller the size,
the fewer transactions it can hold, so more frequently the journal log 
on disk needs to be
reclaimed by dispatching the meta data blocks from the journal space to 
their home locations,
thus freeing up on-disk journal space. This requires reading meta data 
blocks from journal area
on disk, and writing them to their home location. So again, lot of i/o. 
I think the threads are
waiting on mutex because journal code must do this reclaiming to free up 
space. The other kind
  of blocked threads are NOT in ocfs2 code but they
all are blocked on mutex. I don't know why. That would require getting a 
vmcore and chasing
the mutex owner and finding out why is it taking long time. I don't 
think that is warranted at
this time.
Let me know if you have any further questions.
Thanks
-Tariq
On 09/15/2015 01:55 AM, Area de Sistemas wrote:
> Hi Tariq,
>
> Yesterday one node was under load but not as high as past week, and 
> iostat showed:
> - 10% of samples with %util >90% (some peaks of 100%) and an average 
> value of 18%
> - %iowait peaks of 37% with an average value of 4%
>
> BUT:
> - none of the indicated error messages appeared in /var/log/messages
> - we have mounted the OCFS2 filesystem with TWO extra options:
>      data=writeback
>      commit=20
> * Question about these extra options:
>     Perhaps they help to mitigate in some way the problem?
>     I've read about using them (usually commit=60) but I don't know if 
> they really helps and/or they are even some other useful options to use
>     Before, the volume as mounted using only the options 
> "_netdev,rw,noatime"
>
> NOTE:
> - we have left only one node active (not the three nodes of the 
> cluster) to "force" overloads
> - although only one node is serving the app, all the three nodes have 
> the OCFS volume mounted
>
>
> About the EACCESS/ENOENT errors...we don't know if they are originated by:
> - an abnormal behavior of the application
> - the OCFS2 problem (a user tries to unlink/rename something and if 
> system is slow due to OCFS the users retries again and again this 
> operation, causing first operation to complete successfully but 
> following fail)
> - a possible problem in the concurrency: now with only one node 
> servicing the application errors doesn't appear but with the three 
> nodes in service errors appeared (several nodes trying to do the same 
> operation)
>
> And about the messages about blocked proccess in /var/log/messages 
> I'll send directly to you (instead to the list) the file.
>
> Regards.
>
> ------------------------------------------------------------------------
>
> *Area de Sistemas
> Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
> Universidad de Valladolid
> Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA
> Telefono: 983 18-6410, Fax: 983 423271
> E-mail: sistemas at uva.es
> *
>
> *
> ------------------------------------------------------------------------
> *
> El 14/09/15 a las 20:29, Tariq Saeed escribió:
>>
>> On 09/14/2015 01:20 AM, Area de Sistemas wrote:
>>> Hello everyone,
>>>
>>> We have a problem in a 3 member OCFS2 cluster used to serve an 
>>> web/php application that access (read and/or write) files located in 
>>> the OCFS2 volume.
>>> The problem appears only some times (apparently during high load 
>>> periods).
>>>
>>> SYMPTOMS:
>>> - access to OCFS2 content becomes more an more slow until stalls
>>>     * a "ls" command that normally takes <=1s takes 30s, 40s, 1m,...
>>> - load average of the system grows to 150, 200 or even more
>>>
>>> - high iowait values: 70-90%
>>>
>>          This is hint that disk is under pressure. Run iostat (see 
>> man page)
>>          when this happens, producing report every 3 seconds or and 
>> look at
>>          %util col
>>                        %util
>>                      Percentage of CPU time during which I/O requests 
>> were issued to the  device  (bandwidth
>>                      utilization for the device). Device saturation 
>> occurs when this value is close to 100%.
>>
>>> * but CPU usage is low
>>>
>>> - in the syslog appears a lot of messages like:
>>>     (httpd,XXXXX,Y):ocfs2_rename:1474 ERROR: status = -13
>> EACCES    Permission denied. find the filename and check perms ls -l.
>>>   or
>>>     (httpd,XXXXX,Y):ocfs2_unlink:951 ERROR: status = -2
>> ENOENT     All we can say is an attempt to delete a file from a 
>> directory that has already been deleted.
>>                         This requires some knowledge of the 
>> environment. Is there an application log.
>>>
>>>   and the more "worrying":
>>>      kernel: INFO: task httpd:3488 blocked for more than 120 seconds.
>>>      kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>> disables this message.
>>>      kernel: httpd           D c6fe5d74     0  3488   1616 0x00000080
>>>      kernel: c6fe5e04 00000082 00000000 c6fe5d74 c6fe5d74 000041fd 
>>> c6fe5d88 c0439b18
>>>      kernel: c0b976c0 c0b976c0 c0b976c0 c0b976c0 ed0f0ac0 c6fe5de8 
>>> c0b976c0 f75ac6c0
>>>      kernel: f2f0cd60 c0a95060 00000001 c6fe5dbc c0874b8d c6fe5de8 
>>> f8fd9a86 00000001
>>>      kernel: Call Trace:
>>>      kernel: [<c0439b18>] ? default_spin_lock_flags+0x8/0x10
>>>      kernel: [<c0874b8d>] ? _raw_spin_lock+0xd/0x10
>>>      kernel: [<f8fd9a86>] ? ocfs2_dentry_revalidate+0xc6/0x2d0 [ocfs2]
>>>      kernel: [<f8ff17be>] ? ocfs2_permission+0xfe/0x110 [ocfs2]
>>>      kernel: [<f905b6f0>] ? ocfs2_acl_chmod+0xd0/0xd0 [ocfs2]
>>>      kernel: [<c0873105>] schedule+0x35/0x50
>>>      kernel: [<c0873b2e>] __mutex_lock_slowpath+0xbe/0x120
>>>      ....
>>>
>> the important part of bt is cut off. Where is the rest of it? The 
>> entries starting with "?"
>> are junk. You can attach /v/l/messages to give us a complete pic.My 
>> guess is blocking on
>> mutex for so long is that the thread holding mutex is blocked on i/o.
>> Run "ps -e -o pid,stat,comm,whchan=WIDE_WCHAN-COLUMN" and look at 'D' 
>> state (uninterruptable slee)
>> process. These are processes usually blocked on i/o.
>>>
>>> (UNACCEPTABLE) WORKAROUND:
>>>    stop httpd (really slow)
>>>    stop ocfs2 service (really slow)
>>>    start ocfs2 an httpd
>>>
>>> MORE INFO:
>>> - OS information:
>>>     Oracle Linux 6.4 32bit
>>>     4GB RAM
>>>     uname -a: 2.6.39-400.109.6.el6uek.i686 #1 SMP Wed Aug 28 
>>> 09:55:10 PDT 2013 i686 i686 i386 GNU/Linux
>>>     * anyway: we have another 5 nodes cluster with Oracle Linux 7.1 
>>> (so 64bit OS) serving a newer version of the same application and 
>>> the problems are similar, so it appears not to be a OS problem but a 
>>> more specific OCFS2 problem (bug? some tuning? other?)
>>>
>>> - standard configuration
>>>     * if you want I can show the cluster.conf configuration but is 
>>> the "expected configuration"
>>>
>>> - standard configuration in o2cb:
>>>     Driver for "configfs": Loaded
>>>     Filesystem "configfs": Mounted
>>>     Stack glue driver: Loaded
>>>     Stack plugin "o2cb": Loaded
>>>     Driver for "ocfs2_dlmfs": Loaded
>>>     Filesystem "ocfs2_dlmfs": Mounted
>>>     Checking O2CB cluster "MoodleOCFS2": Online
>>>       Heartbeat dead threshold: 31
>>>       Network idle timeout: 30000
>>>       Network keepalive delay: 2000
>>>       Network reconnect delay: 2000
>>>       Heartbeat mode: Local
>>>     Checking O2CB heartbeat: Active
>>>
>>> - mount options: _netdev,rw,noatime
>>>     * so other options (commit, data, ...) have their default values
>>>
>>>
>>> Any ideas/suggestion?
>>>
>>> Regards.
>>>
>>> -- 
>>> ------------------------------------------------------------------------
>>>
>>> *Area de Sistemas
>>> Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
>>> Universidad de Valladolid
>>> Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA
>>> Telefono: 983 18-6410, Fax: 983 423271
>>> E-mail: sistemas at uva.es
>>> *
>>>
>>> *
>>> ------------------------------------------------------------------------
>>> *
>>>
>>>
>>> _______________________________________________
>>> Ocfs2-users mailing list
>>> Ocfs2-users at oss.oracle.com
>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20150915/aa423168/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1517 bytes
Desc: not available
Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20150915/aa423168/attachment-0002.gif 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1517 bytes
Desc: not available
Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20150915/aa423168/attachment-0003.gif