[Ocfs2-users] Problem with OCFS2 disk on some moments (slow until stalls)
Area de Sistemas
sistemas at uva.es
Thu Sep 17 00:39:14 PDT 2015
Hi Tariq,
------------------------------------------------------------------------
*Area de Sistemas
Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
Universidad de Valladolid
Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA
Telefono: 983 18-6410, Fax: 983 423271
E-mail: sistemas at uva.es
*
*
------------------------------------------------------------------------
*
El 16/09/15 a las 20:51, Tariq Saeed escribió:
>
> On 09/16/2015 01:19 AM, Area de Sistemas wrote:
>> Hi Tariq,
>>
>> DATA=WRITEBACK:
>> From your words I understand that IF WE CAN ASSUME/TOLERATE A
>> POSSIBLE FILES CONTENTS "CORRUPTION" IN CASE OF FAILURE, this option
>> IMPROVES CLEARLY PERFORMANCE...right?
>> * I ask to you this cause, according mount.ocfs2 man page, this
>> option "is rumored to be the highest-throughput option"...which is a
>> very vague/unclear idea of its benefits
> I don't understand what exactly is unclear? Can you be more specific?
> Is this a question or
> a comment?
It's a question: data=writeback REALLY IMPROVES PERFORMANCE??
That is: the improvement is noticeable?
>>
>> COMMIT=XX (higher than 5s default):
>> I am a bit confused: after searching on internet, many people
>> recommends to use a value higher than 5s default (typically 30s or
>> 60s) in case of performance issues, but you suggests that higher
>> values can increment number of unecessary writes, that sounds the
>> opposite so...can you clarify if/when commit="higher value than
>> default" can be useful?
> I stand corrected. I said the opposite of what I meant to say. Yes,
> the higher the commit interval,
> the longer the time lapse between syncing to disc, hence fewer the
> comparative number of i/os due to
> commit.
>>
OK, so increasing commit interval also helps to improve
performance/response of filesystem...right?
>>
>> THREADS BLOCKED ON MUTEX ON OCFS2 FILESYSTEM:
>> From your words I understand that perhaps log area size is too small
>> which can cause too "extra" I/Os in order to free/empty/clear log
>> during high write loads...
>> - There is some method to monitor the log usage?
> I never had a need to do that and therefore don't know. If tunefs man
> page does
> not show anything, then there is none. You will have to google. ocfs2
> uses jbd2, which is
> used by other filesystems (ext3,4 ...) in Linux. The only control
> ocfs2 has over jbd2 is to call 'checkpoint', which means all blocks
> upto the latest transaction need to be written
> to their home locations on disc from the log area. This is i/o
> intensive and happens
> for example when a meta data cluster wide lock held in exclusive mode
> is given up
> (the on disk meta data protected by the lock must be upto date before
> releasing it).
>> * I've seen the tunefs.ocfs2 -J option and I suppose doesn't harm the
>> filesystem but I prefer be sure about this. Anyway if I don't know
>> the actual size of the log, I can't set an "acceptable" higher value
> Again, I have not done it myself, so we are in the same
> boat.tunfefs.ocfs2 should do it. You
> should not lose data (unless there is a bug in tunefs.ocfs2). google
> is your best source
> for customer experiences.
>>
>> LOGS/ERRORS (ocfs2_unlink, ocfs2_rename, task blocked for more 120s):
>> Apparently the load in the usage of the app continues being high BUT
>> ALL THESE errors have dissapeared...
>> * The usage pattern of the app these days is:
>> - high number of users generating "backups" of partial contents
>> of OCFS disk (so: high read+write) <--this is specific to the last weeks
>> - "normal/low" reading access to contents
>>
>> CHANGES MADE AND ERRORS EVOLUTION:
>> 1) First change:
>> * two nodes disabled (httpd stopped but ocfs2 volume continues
>> mounted) so ONLY ONE NODE IS SERVICING THE APP
>> * After that, errors dissapeared...although %util was high
>> 2) Three days after:
>> * added commit=20 and data=writeback mount options to OCFS2
>> volume (maintaining only one node servicing app)
>> * Situation persists: NO errors, although %util high
>>
>> So...It's possible that the concurrent use of the OCFS2 (2-3 nodes
>> servicing app simultaneously) generate to much overload (caused by
>> OCFS2 operation)?
> You are on the spot. If you get high %util with one node, with more
> nodes, there will be even
> more apps served simultaneously, burning more disc bandwidth, which
> seems to be the bottleneck here.
> And then there is the overhead of internode communication through
> disk, even if you don't service more
> apps. Adding nodes gives you HA at cost of consuming some i/o
> bandwidth and won't help you since
> you already are using plenty of bandwidth.You should consider if you
> already have not, stripping, making a volume out of many discs with
> the logical volume manager etc.
>
> Question: Are you using a separate disc for global heart beat?
No: we use "local" heart beat.
* We have another different OCFS2 cluster (different disks, nodes, etc)
servicing a newer version of the same app. In this cluster we use a
"global" heart beat BUT in the same OCFS2 disk (not in a separate disk)
>> * Obviously, the OCFS2 operation generates an "extra" load
>> but...perhaps under some circumstances (like these days usage of the
>> app) the extra load becomes REALLY HIGH?
>>
>> Regards.
>>
>> ------------------------------------------------------------------------
>>
>> *Area de Sistemas
>> Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
>> Universidad de Valladolid
>> Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA
>> Telefono: 983 18-6410, Fax: 983 423271
>> E-mail: sistemas at uva.es
>> *
>>
>> *
>> ------------------------------------------------------------------------
>> *
>> El 16/09/15 a las 4:04, Tariq Saeed escribió:
>>> Hi Area,
>>> data=writeback improves things greatly. In ordered mode , the
>>> default, before writing
>>> a transaction(which only logs meta data changes) data is written.
>>> This is very conservative
>>> to ensure that before journal log buffer is written to disk journal
>>> area, data has hit the disk and
>>> transaction can be safely replayed in case of a crash -- only
>>> complete transactions are replayed,
>>> by complete I mean: begin-trans changebuf1, changebuf2, ... ,
>>> changebufnn end-trans. Replay means buffer
>>> are dispatched from the journal area on disk to their ultimate home
>>> loc on disk. You can see now why
>>> ordered mode generates so much i/o. In write back mode, transaction
>>> can hit the disk but data
>>> will be written whenever the kernel wants, asynchronously and
>>> without knowing any relationship
>>> to its related data. The danger is in case of a crash, we can replay
>>> a transaction but its associated
>>> data is not on disk. For example, if you truncate up a file to a new
>>> bigger size and then
>>> write something to a page beyond the old size, the page could hang
>>> around in core for a long time
>>> after transaction is written to the journal area on disk. If there
>>> is a crash while the data page is still
>>> in core, after replay, the file will have new size but the page with
>>> data will show all zeros instead of
>>> what you wrote. At any rate, this is a digression, just for your info.
>>>
>>> The commit is the interval at which data is synced to disc. I think
>>> it may also be the interval
>>> after which journal log buffer is written to disk. So decreasing it
>>> reduces number of unecessary
>>> writes.
>>>
>>> Now for the threads blocked for more than 120 sec in
>>> /var/log/messages. There are two types.
>>> First type is blocked on mutex on ocfs2 system file, mostly the
>>> global bit map file shared by
>>> all nodes. All writes to system files are done under transactions
>>> and that may require
>>> flushing to disk the journal buffer, depending upon your journal
>>> file size. The smaller the size,
>>> the fewer transactions it can hold, so more frequently the journal
>>> log on disk needs to be
>>> reclaimed by dispatching the meta data blocks from the journal space
>>> to their home locations,
>>> thus freeing up on-disk journal space. This requires reading meta
>>> data blocks from journal area
>>> on disk, and writing them to their home location. So again, lot of
>>> i/o. I think the threads are
>>> waiting on mutex because journal code must do this reclaiming to
>>> free up space. The other kind
>>> of blocked threads are NOT in ocfs2 code but they
>>> all are blocked on mutex. I don't know why. That would require
>>> getting a vmcore and chasing
>>> the mutex owner and finding out why is it taking long time. I don't
>>> think that is warranted at
>>> this time.
>>> Let me know if you have any further questions.
>>> Thanks
>>> -Tariq
>>> On 09/15/2015 01:55 AM, Area de Sistemas wrote:
>>>> Hi Tariq,
>>>>
>>>> Yesterday one node was under load but not as high as past week, and
>>>> iostat showed:
>>>> - 10% of samples with %util >90% (some peaks of 100%) and an
>>>> average value of 18%
>>>> - %iowait peaks of 37% with an average value of 4%
>>>>
>>>> BUT:
>>>> - none of the indicated error messages appeared in /var/log/messages
>>>> - we have mounted the OCFS2 filesystem with TWO extra options:
>>>> data=writeback
>>>> commit=20
>>>> * Question about these extra options:
>>>> Perhaps they help to mitigate in some way the problem?
>>>> I've read about using them (usually commit=60) but I don't know
>>>> if they really helps and/or they are even some other useful options
>>>> to use
>>>> Before, the volume as mounted using only the options
>>>> "_netdev,rw,noatime"
>>>>
>>>> NOTE:
>>>> - we have left only one node active (not the three nodes of the
>>>> cluster) to "force" overloads
>>>> - although only one node is serving the app, all the three nodes
>>>> have the OCFS volume mounted
>>>>
>>>>
>>>> About the EACCESS/ENOENT errors...we don't know if they are
>>>> originated by:
>>>> - an abnormal behavior of the application
>>>> - the OCFS2 problem (a user tries to unlink/rename something and if
>>>> system is slow due to OCFS the users retries again and again this
>>>> operation, causing first operation to complete successfully but
>>>> following fail)
>>>> - a possible problem in the concurrency: now with only one node
>>>> servicing the application errors doesn't appear but with the three
>>>> nodes in service errors appeared (several nodes trying to do the
>>>> same operation)
>>>>
>>>> And about the messages about blocked proccess in /var/log/messages
>>>> I'll send directly to you (instead to the list) the file.
>>>>
>>>> Regards.
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> *Area de Sistemas
>>>> Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
>>>> Universidad de Valladolid
>>>> Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA
>>>> Telefono: 983 18-6410, Fax: 983 423271
>>>> E-mail: sistemas at uva.es
>>>> *
>>>>
>>>> *
>>>> ------------------------------------------------------------------------
>>>> *
>>>> El 14/09/15 a las 20:29, Tariq Saeed escribió:
>>>>>
>>>>> On 09/14/2015 01:20 AM, Area de Sistemas wrote:
>>>>>> Hello everyone,
>>>>>>
>>>>>> We have a problem in a 3 member OCFS2 cluster used to serve an
>>>>>> web/php application that access (read and/or write) files located
>>>>>> in the OCFS2 volume.
>>>>>> The problem appears only some times (apparently during high load
>>>>>> periods).
>>>>>>
>>>>>> SYMPTOMS:
>>>>>> - access to OCFS2 content becomes more an more slow until stalls
>>>>>> * a "ls" command that normally takes <=1s takes 30s, 40s, 1m,...
>>>>>> - load average of the system grows to 150, 200 or even more
>>>>>>
>>>>>> - high iowait values: 70-90%
>>>>>>
>>>>> This is hint that disk is under pressure. Run iostat (see
>>>>> man page)
>>>>> when this happens, producing report every 3 seconds or
>>>>> and look at
>>>>> %util col
>>>>> %util
>>>>> Percentage of CPU time during which I/O
>>>>> requests were issued to the device (bandwidth
>>>>> utilization for the device). Device
>>>>> saturation occurs when this value is close to 100%.
>>>>>
>>>>>> * but CPU usage is low
>>>>>>
>>>>>> - in the syslog appears a lot of messages like:
>>>>>> (httpd,XXXXX,Y):ocfs2_rename:1474 ERROR: status = -13
>>>>> EACCES Permission denied. find the filename and check perms ls -l.
>>>>>> or
>>>>>> (httpd,XXXXX,Y):ocfs2_unlink:951 ERROR: status = -2
>>>>> ENOENT All we can say is an attempt to delete a file from a
>>>>> directory that has already been deleted.
>>>>> This requires some knowledge of the
>>>>> environment. Is there an application log.
>>>>>>
>>>>>> and the more "worrying":
>>>>>> kernel: INFO: task httpd:3488 blocked for more than 120 seconds.
>>>>>> kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>>>> disables this message.
>>>>>> kernel: httpd D c6fe5d74 0 3488 1616 0x00000080
>>>>>> kernel: c6fe5e04 00000082 00000000 c6fe5d74 c6fe5d74
>>>>>> 000041fd c6fe5d88 c0439b18
>>>>>> kernel: c0b976c0 c0b976c0 c0b976c0 c0b976c0 ed0f0ac0
>>>>>> c6fe5de8 c0b976c0 f75ac6c0
>>>>>> kernel: f2f0cd60 c0a95060 00000001 c6fe5dbc c0874b8d
>>>>>> c6fe5de8 f8fd9a86 00000001
>>>>>> kernel: Call Trace:
>>>>>> kernel: [<c0439b18>] ? default_spin_lock_flags+0x8/0x10
>>>>>> kernel: [<c0874b8d>] ? _raw_spin_lock+0xd/0x10
>>>>>> kernel: [<f8fd9a86>] ? ocfs2_dentry_revalidate+0xc6/0x2d0
>>>>>> [ocfs2]
>>>>>> kernel: [<f8ff17be>] ? ocfs2_permission+0xfe/0x110 [ocfs2]
>>>>>> kernel: [<f905b6f0>] ? ocfs2_acl_chmod+0xd0/0xd0 [ocfs2]
>>>>>> kernel: [<c0873105>] schedule+0x35/0x50
>>>>>> kernel: [<c0873b2e>] __mutex_lock_slowpath+0xbe/0x120
>>>>>> ....
>>>>>>
>>>>> the important part of bt is cut off. Where is the rest of it? The
>>>>> entries starting with "?"
>>>>> are junk. You can attach /v/l/messages to give us a complete
>>>>> pic.My guess is blocking on
>>>>> mutex for so long is that the thread holding mutex is blocked on i/o.
>>>>> Run "ps -e -o pid,stat,comm,whchan=WIDE_WCHAN-COLUMN" and look at
>>>>> 'D' state (uninterruptable slee)
>>>>> process. These are processes usually blocked on i/o.
>>>>>>
>>>>>> (UNACCEPTABLE) WORKAROUND:
>>>>>> stop httpd (really slow)
>>>>>> stop ocfs2 service (really slow)
>>>>>> start ocfs2 an httpd
>>>>>>
>>>>>> MORE INFO:
>>>>>> - OS information:
>>>>>> Oracle Linux 6.4 32bit
>>>>>> 4GB RAM
>>>>>> uname -a: 2.6.39-400.109.6.el6uek.i686 #1 SMP Wed Aug 28
>>>>>> 09:55:10 PDT 2013 i686 i686 i386 GNU/Linux
>>>>>> * anyway: we have another 5 nodes cluster with Oracle Linux
>>>>>> 7.1 (so 64bit OS) serving a newer version of the same application
>>>>>> and the problems are similar, so it appears not to be a OS
>>>>>> problem but a more specific OCFS2 problem (bug? some tuning? other?)
>>>>>>
>>>>>> - standard configuration
>>>>>> * if you want I can show the cluster.conf configuration but
>>>>>> is the "expected configuration"
>>>>>>
>>>>>> - standard configuration in o2cb:
>>>>>> Driver for "configfs": Loaded
>>>>>> Filesystem "configfs": Mounted
>>>>>> Stack glue driver: Loaded
>>>>>> Stack plugin "o2cb": Loaded
>>>>>> Driver for "ocfs2_dlmfs": Loaded
>>>>>> Filesystem "ocfs2_dlmfs": Mounted
>>>>>> Checking O2CB cluster "MoodleOCFS2": Online
>>>>>> Heartbeat dead threshold: 31
>>>>>> Network idle timeout: 30000
>>>>>> Network keepalive delay: 2000
>>>>>> Network reconnect delay: 2000
>>>>>> Heartbeat mode: Local
>>>>>> Checking O2CB heartbeat: Active
>>>>>>
>>>>>> - mount options: _netdev,rw,noatime
>>>>>> * so other options (commit, data, ...) have their default values
>>>>>>
>>>>>>
>>>>>> Any ideas/suggestion?
>>>>>>
>>>>>> Regards.
>>>>>>
>>>>>> --
>>>>>> ------------------------------------------------------------------------
>>>>>>
>>>>>> *Area de Sistemas
>>>>>> Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
>>>>>> Universidad de Valladolid
>>>>>> Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid -
>>>>>> ESPAÑA
>>>>>> Telefono: 983 18-6410, Fax: 983 423271
>>>>>> E-mail: sistemas at uva.es
>>>>>> *
>>>>>>
>>>>>> *
>>>>>> ------------------------------------------------------------------------
>>>>>> *
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Ocfs2-users mailing list
>>>>>> Ocfs2-users at oss.oracle.com
>>>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>
>>>>
>>>
>>
>
More information about the Ocfs2-users
mailing list