[Ocfs2-users] Problem with OCFS2 disk on some moments (slow until stalls)

Thu Sep 17 00:39:14 PDT 2015

Hi Tariq,

------------------------------------------------------------------------

*Area de Sistemas
Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
Universidad de Valladolid
Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA
Telefono: 983 18-6410, Fax: 983 423271
E-mail: sistemas at uva.es
*

*
------------------------------------------------------------------------
*
El 16/09/15 a las 20:51, Tariq Saeed escribió:
>
> On 09/16/2015 01:19 AM, Area de Sistemas wrote:
>> Hi Tariq,
>>
>> DATA=WRITEBACK:
>> From your words I understand that IF WE CAN ASSUME/TOLERATE A 
>> POSSIBLE FILES CONTENTS "CORRUPTION" IN CASE OF FAILURE, this option 
>> IMPROVES CLEARLY PERFORMANCE...right?
>> * I ask to you this cause, according mount.ocfs2 man page, this 
>> option "is rumored to be the highest-throughput option"...which is a 
>> very vague/unclear idea of its benefits
> I don't understand what exactly is unclear? Can you be more specific? 
> Is this a question or
> a comment?

It's a question: data=writeback REALLY IMPROVES PERFORMANCE??
That is: the improvement is noticeable?

>>
>> COMMIT=XX (higher than 5s default):
>> I am a bit confused: after searching on internet, many people 
>> recommends to use a value higher than 5s default (typically 30s or 
>> 60s) in case of performance issues, but you suggests that higher 
>> values can increment number of unecessary writes, that sounds the 
>> opposite so...can you clarify if/when commit="higher value than 
>> default" can be useful?
> I stand corrected. I said the opposite of what I meant to say. Yes, 
> the higher the commit interval,
> the longer the time lapse between syncing to disc, hence fewer the 
> comparative number of i/os due to
> commit.
>>

OK, so increasing commit interval also helps to improve 
performance/response of filesystem...right?

>>
>> THREADS BLOCKED ON MUTEX ON OCFS2 FILESYSTEM:
>> From your words I understand that perhaps log area size is too small 
>> which can cause too "extra" I/Os in order to free/empty/clear log 
>> during high write loads...
>> - There is some method to monitor the log usage?
> I never had a need to do that and therefore don't know. If tunefs man 
> page does
> not show anything, then there is none. You will have to google. ocfs2 
> uses jbd2, which is
> used by other filesystems (ext3,4 ...) in Linux. The only control 
> ocfs2 has over jbd2 is to call 'checkpoint', which means all blocks 
> upto the latest transaction need to be written
> to their home locations on disc from the log area. This is i/o 
> intensive and happens
> for example when a meta data cluster wide lock held in exclusive mode 
> is given up
> (the on disk meta data protected by the lock must be upto date before 
> releasing it).
>> * I've seen the tunefs.ocfs2 -J option and I suppose doesn't harm the 
>> filesystem but I prefer be sure about this. Anyway if I don't know 
>> the actual size of the log, I can't set an "acceptable" higher value
> Again, I have not done it myself, so we are in the same 
> boat.tunfefs.ocfs2 should do it. You
> should not lose data (unless there is a bug in tunefs.ocfs2). google 
> is your best source
> for customer experiences.
>>
>> LOGS/ERRORS (ocfs2_unlink, ocfs2_rename, task blocked for more 120s):
>> Apparently the load in the usage of the app continues being high BUT 
>> ALL THESE errors have dissapeared...
>> * The usage pattern of the app these days is:
>>     - high number of users generating "backups" of partial contents 
>> of OCFS disk (so: high read+write) <--this is specific to the last weeks
>>     - "normal/low" reading access to contents
>>
>> CHANGES MADE AND ERRORS EVOLUTION:
>> 1) First change:
>>     * two nodes disabled (httpd stopped but ocfs2 volume continues 
>> mounted) so ONLY ONE NODE IS SERVICING THE APP
>>     * After that, errors dissapeared...although %util was high
>> 2) Three days after:
>>     * added commit=20 and data=writeback mount options to OCFS2 
>> volume (maintaining only one node servicing app)
>>     * Situation persists: NO errors, although %util high
>>
>> So...It's possible that the concurrent use of the OCFS2 (2-3 nodes 
>> servicing app simultaneously) generate to much overload (caused by 
>> OCFS2 operation)?
> You are on the spot. If you get high %util with one node, with more 
> nodes, there will be even
> more apps served simultaneously, burning more disc bandwidth, which 
> seems to be the bottleneck here.
> And then there is the overhead of internode communication through 
> disk, even if you don't service more
> apps. Adding nodes gives you HA at cost of consuming some i/o 
> bandwidth and won't help you since
> you already are using plenty of bandwidth.You should consider if you 
> already have not, stripping, making a volume out of many discs with 
> the logical volume manager etc.
>
> Question: Are you using a separate disc for global heart beat?
No: we use "local" heart beat.
* We have another different OCFS2 cluster (different disks, nodes, etc) 
servicing a newer version of the same app. In this cluster we use a 
"global" heart beat BUT in the same OCFS2 disk (not in a separate disk)

>> * Obviously, the OCFS2 operation generates an "extra" load 
>> but...perhaps under some circumstances (like these days usage of the 
>> app) the extra load becomes REALLY HIGH?
>>
>> Regards.
>>
>> ------------------------------------------------------------------------
>>
>> *Area de Sistemas
>> Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
>> Universidad de Valladolid
>> Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA
>> Telefono: 983 18-6410, Fax: 983 423271
>> E-mail: sistemas at uva.es
>> *
>>
>> *
>> ------------------------------------------------------------------------
>> *
>> El 16/09/15 a las 4:04, Tariq Saeed escribió:
>>> Hi Area,
>>> data=writeback improves things greatly. In ordered mode , the 
>>> default, before writing
>>> a transaction(which only logs meta data changes) data is written. 
>>> This is very conservative
>>> to ensure that before journal log buffer is written to disk journal 
>>> area, data has hit the disk and
>>> transaction can be safely replayed in case of a crash -- only 
>>> complete transactions are replayed,
>>> by complete I mean: begin-trans changebuf1, changebuf2, ... , 
>>> changebufnn end-trans. Replay means buffer
>>> are dispatched from the journal area on disk to their ultimate home 
>>> loc on disk. You can see now why
>>> ordered mode generates so much i/o. In write back mode, transaction 
>>> can hit the disk but data
>>> will be written whenever the kernel wants, asynchronously and 
>>> without knowing any relationship
>>> to its related data. The danger is in case of a crash, we can replay 
>>> a transaction but its associated
>>> data is not on disk. For example, if you truncate up a file to a new 
>>> bigger size and then
>>> write something to a page beyond the old size, the page could hang 
>>> around in core for a long time
>>> after transaction is written to the journal area on disk. If there 
>>> is a crash while the data page is still
>>> in core, after replay, the file will have new size but the page with 
>>> data will show all zeros instead of
>>> what you wrote. At any rate, this is a digression, just for your info.
>>>
>>> The commit is the interval at which data is synced to disc. I think 
>>> it may also be the interval
>>> after which journal log buffer is written to disk. So decreasing it 
>>> reduces number of unecessary
>>> writes.
>>>
>>> Now for the threads blocked for more than 120 sec in 
>>> /var/log/messages. There are two types.
>>> First type is blocked on mutex on ocfs2 system file, mostly the 
>>> global bit map file shared by
>>> all nodes. All writes to system files are done under transactions 
>>> and that may require
>>> flushing to disk the journal buffer, depending upon your journal 
>>> file size. The smaller the size,
>>> the fewer transactions it can hold, so more frequently the journal 
>>> log on disk needs to be
>>> reclaimed by dispatching the meta data blocks from the journal space 
>>> to their home locations,
>>> thus freeing up on-disk journal space. This requires reading meta 
>>> data blocks from journal area
>>> on disk, and writing them to their home location. So again, lot of 
>>> i/o. I think the threads are
>>> waiting on mutex because journal code must do this reclaiming to 
>>> free up space. The other kind
>>>  of blocked threads are NOT in ocfs2 code but they
>>> all are blocked on mutex. I don't know why. That would require 
>>> getting a vmcore and chasing
>>> the mutex owner and finding out why is it taking long time. I don't 
>>> think that is warranted at
>>> this time.
>>> Let me know if you have any further questions.
>>> Thanks
>>> -Tariq
>>> On 09/15/2015 01:55 AM, Area de Sistemas wrote:
>>>> Hi Tariq,
>>>>
>>>> Yesterday one node was under load but not as high as past week, and 
>>>> iostat showed:
>>>> - 10% of samples with %util >90% (some peaks of 100%) and an 
>>>> average value of 18%
>>>> - %iowait peaks of 37% with an average value of 4%
>>>>
>>>> BUT:
>>>> - none of the indicated error messages appeared in /var/log/messages
>>>> - we have mounted the OCFS2 filesystem with TWO extra options:
>>>>      data=writeback
>>>>      commit=20
>>>> * Question about these extra options:
>>>>     Perhaps they help to mitigate in some way the problem?
>>>>     I've read about using them (usually commit=60) but I don't know 
>>>> if they really helps and/or they are even some other useful options 
>>>> to use
>>>>     Before, the volume as mounted using only the options 
>>>> "_netdev,rw,noatime"
>>>>
>>>> NOTE:
>>>> - we have left only one node active (not the three nodes of the 
>>>> cluster) to "force" overloads
>>>> - although only one node is serving the app, all the three nodes 
>>>> have the OCFS volume mounted
>>>>
>>>>
>>>> About the EACCESS/ENOENT errors...we don't know if they are 
>>>> originated by:
>>>> - an abnormal behavior of the application
>>>> - the OCFS2 problem (a user tries to unlink/rename something and if 
>>>> system is slow due to OCFS the users retries again and again this 
>>>> operation, causing first operation to complete successfully but 
>>>> following fail)
>>>> - a possible problem in the concurrency: now with only one node 
>>>> servicing the application errors doesn't appear but with the three 
>>>> nodes in service errors appeared (several nodes trying to do the 
>>>> same operation)
>>>>
>>>> And about the messages about blocked proccess in /var/log/messages 
>>>> I'll send directly to you (instead to the list) the file.
>>>>
>>>> Regards.
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> *Area de Sistemas
>>>> Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
>>>> Universidad de Valladolid
>>>> Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA
>>>> Telefono: 983 18-6410, Fax: 983 423271
>>>> E-mail: sistemas at uva.es
>>>> *
>>>>
>>>> *
>>>> ------------------------------------------------------------------------
>>>> *
>>>> El 14/09/15 a las 20:29, Tariq Saeed escribió:
>>>>>
>>>>> On 09/14/2015 01:20 AM, Area de Sistemas wrote:
>>>>>> Hello everyone,
>>>>>>
>>>>>> We have a problem in a 3 member OCFS2 cluster used to serve an 
>>>>>> web/php application that access (read and/or write) files located 
>>>>>> in the OCFS2 volume.
>>>>>> The problem appears only some times (apparently during high load 
>>>>>> periods).
>>>>>>
>>>>>> SYMPTOMS:
>>>>>> - access to OCFS2 content becomes more an more slow until stalls
>>>>>>     * a "ls" command that normally takes <=1s takes 30s, 40s, 1m,...
>>>>>> - load average of the system grows to 150, 200 or even more
>>>>>>
>>>>>> - high iowait values: 70-90%
>>>>>>
>>>>>          This is hint that disk is under pressure. Run iostat (see 
>>>>> man page)
>>>>>          when this happens, producing report every 3 seconds or 
>>>>> and look at
>>>>>          %util col
>>>>>                        %util
>>>>>                      Percentage of CPU time during which I/O 
>>>>> requests were issued to the  device  (bandwidth
>>>>>                      utilization for the device). Device 
>>>>> saturation occurs when this value is close to 100%.
>>>>>
>>>>>> * but CPU usage is low
>>>>>>
>>>>>> - in the syslog appears a lot of messages like:
>>>>>>     (httpd,XXXXX,Y):ocfs2_rename:1474 ERROR: status = -13
>>>>> EACCES    Permission denied. find the filename and check perms ls -l.
>>>>>>   or
>>>>>>     (httpd,XXXXX,Y):ocfs2_unlink:951 ERROR: status = -2
>>>>> ENOENT     All we can say is an attempt to delete a file from a 
>>>>> directory that has already been deleted.
>>>>>                         This requires some knowledge of the 
>>>>> environment. Is there an application log.
>>>>>>
>>>>>>   and the more "worrying":
>>>>>>      kernel: INFO: task httpd:3488 blocked for more than 120 seconds.
>>>>>>      kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>>>>> disables this message.
>>>>>>      kernel: httpd           D c6fe5d74     0  3488 1616 0x00000080
>>>>>>      kernel: c6fe5e04 00000082 00000000 c6fe5d74 c6fe5d74 
>>>>>> 000041fd c6fe5d88 c0439b18
>>>>>>      kernel: c0b976c0 c0b976c0 c0b976c0 c0b976c0 ed0f0ac0 
>>>>>> c6fe5de8 c0b976c0 f75ac6c0
>>>>>>      kernel: f2f0cd60 c0a95060 00000001 c6fe5dbc c0874b8d 
>>>>>> c6fe5de8 f8fd9a86 00000001
>>>>>>      kernel: Call Trace:
>>>>>>      kernel: [<c0439b18>] ? default_spin_lock_flags+0x8/0x10
>>>>>>      kernel: [<c0874b8d>] ? _raw_spin_lock+0xd/0x10
>>>>>>      kernel: [<f8fd9a86>] ? ocfs2_dentry_revalidate+0xc6/0x2d0 
>>>>>> [ocfs2]
>>>>>>      kernel: [<f8ff17be>] ? ocfs2_permission+0xfe/0x110 [ocfs2]
>>>>>>      kernel: [<f905b6f0>] ? ocfs2_acl_chmod+0xd0/0xd0 [ocfs2]
>>>>>>      kernel: [<c0873105>] schedule+0x35/0x50
>>>>>>      kernel: [<c0873b2e>] __mutex_lock_slowpath+0xbe/0x120
>>>>>>      ....
>>>>>>
>>>>> the important part of bt is cut off. Where is the rest of it? The 
>>>>> entries starting with "?"
>>>>> are junk. You can attach /v/l/messages to give us a complete 
>>>>> pic.My guess is blocking on
>>>>> mutex for so long is that the thread holding mutex is blocked on i/o.
>>>>> Run "ps -e -o pid,stat,comm,whchan=WIDE_WCHAN-COLUMN" and look at 
>>>>> 'D' state (uninterruptable slee)
>>>>> process. These are processes usually blocked on i/o.
>>>>>>
>>>>>> (UNACCEPTABLE) WORKAROUND:
>>>>>>    stop httpd (really slow)
>>>>>>    stop ocfs2 service (really slow)
>>>>>>    start ocfs2 an httpd
>>>>>>
>>>>>> MORE INFO:
>>>>>> - OS information:
>>>>>>     Oracle Linux 6.4 32bit
>>>>>>     4GB RAM
>>>>>>     uname -a: 2.6.39-400.109.6.el6uek.i686 #1 SMP Wed Aug 28 
>>>>>> 09:55:10 PDT 2013 i686 i686 i386 GNU/Linux
>>>>>>     * anyway: we have another 5 nodes cluster with Oracle Linux 
>>>>>> 7.1 (so 64bit OS) serving a newer version of the same application 
>>>>>> and the problems are similar, so it appears not to be a OS 
>>>>>> problem but a more specific OCFS2 problem (bug? some tuning? other?)
>>>>>>
>>>>>> - standard configuration
>>>>>>     * if you want I can show the cluster.conf configuration but 
>>>>>> is the "expected configuration"
>>>>>>
>>>>>> - standard configuration in o2cb:
>>>>>>     Driver for "configfs": Loaded
>>>>>>     Filesystem "configfs": Mounted
>>>>>>     Stack glue driver: Loaded
>>>>>>     Stack plugin "o2cb": Loaded
>>>>>>     Driver for "ocfs2_dlmfs": Loaded
>>>>>>     Filesystem "ocfs2_dlmfs": Mounted
>>>>>>     Checking O2CB cluster "MoodleOCFS2": Online
>>>>>>       Heartbeat dead threshold: 31
>>>>>>       Network idle timeout: 30000
>>>>>>       Network keepalive delay: 2000
>>>>>>       Network reconnect delay: 2000
>>>>>>       Heartbeat mode: Local
>>>>>>     Checking O2CB heartbeat: Active
>>>>>>
>>>>>> - mount options: _netdev,rw,noatime
>>>>>>     * so other options (commit, data, ...) have their default values
>>>>>>
>>>>>>
>>>>>> Any ideas/suggestion?
>>>>>>
>>>>>> Regards.
>>>>>>
>>>>>> -- 
>>>>>> ------------------------------------------------------------------------
>>>>>>
>>>>>> *Area de Sistemas
>>>>>> Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
>>>>>> Universidad de Valladolid
>>>>>> Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - 
>>>>>> ESPAÑA
>>>>>> Telefono: 983 18-6410, Fax: 983 423271
>>>>>> E-mail: sistemas at uva.es
>>>>>> *
>>>>>>
>>>>>> *
>>>>>> ------------------------------------------------------------------------
>>>>>> *
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Ocfs2-users mailing list
>>>>>> Ocfs2-users at oss.oracle.com
>>>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>
>>>>
>>>
>>
>