<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <br>

    <div class="moz-cite-prefix">On 09/16/2015 01:19 AM, Area de

      Sistemas wrote:<br>

    </div>

    <blockquote cite="mid:55F925FF.60102@uva.es" type="cite">

      <meta content="text/html; charset=windows-1252"

        http-equiv="Content-Type">

      <tt>Hi Tariq,<br>

        <br>

        DATA=WRITEBACK:<br>

        From your words I understand that IF WE CAN ASSUME/TOLERATE A

        POSSIBLE FILES CONTENTS "CORRUPTION" IN CASE OF FAILURE, this

        option IMPROVES CLEARLY PERFORMANCE...right?<br>

        * I ask to you this cause, according mount.ocfs2 man page, this

        option "is rumored to be the highest-throughput option"...which

        is a very vague/unclear idea of its benefits<br>

      </tt></blockquote>

    <tt>I don't understand what exactly is unclear?</tt> Can you be more

    specific? Is this a question or<br>

    a comment?<br>

    <blockquote cite="mid:55F925FF.60102@uva.es" type="cite"><tt> <br>

        COMMIT=XX (higher than 5s default):<br>

        I am a bit confused: after searching on internet, many people

        recommends to use a value higher than 5s default (typically 30s

        or 60s) in case of performance issues, but you suggests that

        higher values can increment number of unecessary writes, that

        sounds the opposite so...can you clarify if/when commit="higher

        value than default" can be useful?<br>

      </tt></blockquote>

    <tt>I stand corrected. I said the opposite of what I meant to say.

      Yes, the higher the commit interval,<br>

      the longer the time lapse between syncing to disc, hence fewer the

      comparative number of i/os due to <br>

      commit. <br>

    </tt>

    <blockquote cite="mid:55F925FF.60102@uva.es" type="cite"><tt> <br>

        <br>

        THREADS BLOCKED ON MUTEX ON OCFS2 FILESYSTEM:<br>

        From your words I understand that perhaps log area size is too

        small which can cause too "extra" I/Os in order to

        free/empty/clear log during high write loads...<br>

        - There is some method to monitor the log usage?<br>

      </tt></blockquote>

    <tt>I never had a need to do that and therefore don't know. If

      tunefs man page does<br>

      not show anything, then there is none. You will have to google.

      ocfs2 uses jbd2, which is<br>

      used by other filesystems (ext3,4 ...) in Linux. The only control

      ocfs2 has over jbd2 is to call 'checkpoint', which means all

      blocks upto the latest transaction need to be written<br>

      to their home locations on disc from the log area. This is i/o

      intensive and happens<br>

      for example when a meta data cluster wide lock held in exclusive

      mode is given up<br>

      (the on disk meta data protected by the lock must be upto date</tt>

    before releasing it). <br>

    <blockquote cite="mid:55F925FF.60102@uva.es" type="cite"><tt>     *

        I've seen the tunefs.ocfs2 -J option and I suppose doesn't harm

        the filesystem but I prefer be sure about this. Anyway if I

        don't know the actual size of the log, I can't set an

        "acceptable" higher value<br>

      </tt></blockquote>

    <tt>Again, I have not done it myself, so we are in the same

      boat.tunfefs.ocfs2 should do it. You<br>

      should not lose data (unless there is a bug in tunefs.ocfs2).

      google is your best source <br>

      for customer experiences.<br>

    </tt>

    <blockquote cite="mid:55F925FF.60102@uva.es" type="cite"><tt> <br>

        LOGS/ERRORS (ocfs2_unlink, ocfs2_rename, task blocked for more

        120s):<br>

        Apparently the load in the usage of the app continues being high

        BUT ALL THESE errors have dissapeared...<br>

        * The usage pattern of the app these days is:<br>

            - high number of users generating "backups" of partial

        contents of OCFS disk (so: high read+write) &lt;--this is

        specific to the last weeks<br>

            - "normal/low" reading access to contents<br>

        <br>

        CHANGES MADE AND ERRORS EVOLUTION:<br>

        1) First change:<br>

            * two nodes disabled (httpd stopped but ocfs2 volume

        continues mounted) so ONLY ONE NODE IS SERVICING THE APP<br>

            * After that, errors dissapeared...although %util was high<br>

        2) Three days after:<br>

            * added commit=20 and data=writeback mount options to OCFS2

        volume (maintaining only one node servicing app)<br>

            * Situation persists: NO errors, although %util high<br>

        <br>

        So...It's possible that the concurrent use of the OCFS2 (2-3

        nodes servicing app simultaneously) generate to much overload

        (caused by OCFS2 operation)?<br>

      </tt></blockquote>

    <tt>You are on the spot. If you get high %util with one node, with

      more nodes, there will be even<br>

      more apps served simultaneously, burning more disc bandwidth,

      which seems to be the bottleneck here.<br>

      And then there is the overhead of internode communication through

      disk, even if you don't service more<br>

      apps. Adding nodes gives you HA at cost of consuming some i/o

      bandwidth and won't help you since<br>

      you already are using plenty of bandwidth.You should consider if

      you already have not, stripping, making a volume out of many discs

      with the logical volume manager etc.<br>

      <br>

      Question: Are you using a separate disc for global heart beat? <br>

    </tt>

    <blockquote cite="mid:55F925FF.60102@uva.es" type="cite"><tt> *

        Obviously, the OCFS2 operation generates an "extra" load

        but...perhaps under some circumstances (like these days usage of

        the app) the extra load becomes REALLY HIGH?<br>

        <br>

        Regards.<br>

        <br>

      </tt>

      <div class="moz-signature">

        <hr> <img src="cid:part1.00030708.01040606@oracle.com">

        <p class="MsoNormal"><b><font color="gray" face="Franklin Gothic

              Book" size="1"><span

                style="font-size:8.0pt;font-family:&quot;Franklin Gothic

                Book&quot;;color:gray; font-weight:bold"> Area de

                Sistemas<br>

                Servicio de las Tecnologias de la Informacion y

                Comunicaciones (STIC)<br>

                Universidad de Valladolid<br>

                Edificio Alfonso VIII, C/Real de Burgos s/n. 47011,

                Valladolid - ESPAÑA<br>

                Telefono: 983 18-6410, Fax: 983 423271<br>

                E-mail: <a moz-do-not-send="true"

                  class="moz-txt-link-abbreviated"

                  href="mailto:sistemas@uva.es">sistemas@uva.es</a><br>

              </span></font></b></p>

        <b><font color="gray" face="Franklin Gothic Book" size="1">

            <hr> </font></b></div>

      <div class="moz-cite-prefix">El 16/09/15 a las 4:04, Tariq Saeed

        escribió:<br>

      </div>

      <blockquote cite="mid:55F8CE18.80703@oracle.com" type="cite">

        <meta content="text/html; charset=windows-1252"

          http-equiv="Content-Type">

        Hi Area,<br>

        data=writeback improves things greatly. In ordered mode , the

        default, before writing<br>

        a transaction(which only logs meta data changes) data is

        written. This is very conservative<br>

        to ensure that before journal log buffer is written to disk

        journal area, data has hit the disk and<br>

        transaction can be safely replayed in case of a crash -- only

        complete transactions are replayed,<br>

        by complete I mean: begin-trans changebuf1, changebuf2, ... ,

        changebufnn end-trans. Replay means buffer<br>

        are dispatched from the journal area on disk to their ultimate

        home loc on disk. You can see now why<br>

        ordered mode generates so much i/o. In write back mode,

        transaction can hit the disk but data<br>

        will be written whenever the kernel wants, asynchronously and

        without knowing any relationship<br>

        to its related data. The danger is in case of a crash, we can

        replay a transaction but its associated<br>

        data is not on disk. For example, if you truncate up a file to a

        new bigger size and then<br>

        write something to a page beyond the old size, the page could

        hang around in core for a long time<br>

        after transaction is written to the journal area on disk. If

        there is a crash while the data page is still<br>

        in core, after replay, the file will have new size but the page

        with data will show all zeros instead of<br>

        what you wrote. At any rate, this is a digression, just for your

        info. <br>

        <br>

        The commit is the interval at which data is synced to disc. I

        think it may also be the interval<br>

        after which journal log buffer is written to disk. So decreasing

        it reduces number of unecessary <br>

        writes.<br>

        <br>

        Now for the threads blocked for more than 120 sec in

        /var/log/messages. There are two types. <br>

        First type is blocked on mutex on ocfs2 system file, mostly the

        global bit map file shared by<br>

        all nodes. All writes to system files are done under

        transactions and that may require<br>

        flushing to disk the journal buffer, depending upon your journal

        file size. The smaller the size,<br>

        the fewer transactions it can hold, so more frequently the

        journal log on disk needs to be<br>

        reclaimed by dispatching the meta data blocks from the journal

        space to their home locations,<br>

        thus freeing up on-disk journal space. This requires reading

        meta data blocks from journal area<br>

        on disk, and writing them to their home location. So again, lot

        of i/o. I think the threads are <br>

        waiting on mutex because journal code must do this reclaiming to

        free up space. The other kind<br>

         of blocked threads are NOT in ocfs2 code but they<br>

        all are blocked on mutex. I don't know why. That would require

        getting a vmcore and chasing<br>

        the mutex owner and finding out why is it taking long time. I

        don't think that is warranted at<br>

        this time. <br>

        Let me know if you have any further questions. <br>

        Thanks<br>

        -Tariq <br>

        <div class="moz-cite-prefix">On 09/15/2015 01:55 AM, Area de

          Sistemas wrote:<br>

        </div>

        <blockquote cite="mid:55F7DCE5.7090601@uva.es" type="cite">

          <meta content="text/html; charset=windows-1252"

            http-equiv="Content-Type">

          <tt>Hi Tariq,<br>

            <br>

            Yesterday one node was under load but not as high as past

            week, and iostat showed:<br>

            - 10% of samples with %util &gt;90% (some peaks of 100%) and

            an average value of 18%<br>

            - %iowait peaks of 37% with an average value of 4%<br>

            <br>

            BUT:<br>

            - none of the indicated error messages appeared in

            /var/log/messages<br>

            - we have mounted the OCFS2 filesystem with TWO extra

            options:<br>

                 data=writeback<br>

                 commit=20<br>

            * Question about these extra options:<br>

                Perhaps they help to mitigate in some way the problem?<br>

                I've read about using them (usually commit=60) but I

            don't know if they really helps and/or they are even some

            other useful options to use<br>

                Before, the volume as mounted using only the options

            "_netdev,rw,noatime"<br>

            <br>

            NOTE:<br>

            - we have left only one node active (not the three nodes of

            the cluster) to "force" overloads<br>

            - although only one node is serving the app, all the three

            nodes have the OCFS volume mounted<br>

            <br>

            <br>

            About the EACCESS/ENOENT errors...we don't know if they are

            originated by:<br>

            - an abnormal behavior of the application<br>

            - the OCFS2 problem (a user tries to unlink/rename something

            and if system is slow due to OCFS the users retries again

            and again this operation, causing first operation to

            complete successfully but following fail)<br>

            - a possible problem in the concurrency: now with only one

            node servicing the application errors doesn't appear but

            with the three nodes in service errors appeared (several

            nodes trying to do the same operation)<br>

            <br>

            And about the messages about blocked proccess in

            /var/log/messages I'll send directly to you (instead to the

            list) the file.<br>

            <br>

            Regards.<br>

            <br>

          </tt>

          <div class="moz-signature">

            <hr> <img src="cid:part3.05000405.07030008@oracle.com">

            <p class="MsoNormal"><b><font color="gray" face="Franklin

                  Gothic Book" size="1"><span

                    style="font-size:8.0pt;font-family:&quot;Franklin

                    Gothic Book&quot;;color:gray; font-weight:bold">

                    Area de Sistemas<br>

                    Servicio de las Tecnologias de la Informacion y

                    Comunicaciones (STIC)<br>

                    Universidad de Valladolid<br>

                    Edificio Alfonso VIII, C/Real de Burgos s/n. 47011,

                    Valladolid - ESPAÑA<br>

                    Telefono: 983 18-6410, Fax: 983 423271<br>

                    E-mail: <a moz-do-not-send="true"

                      class="moz-txt-link-abbreviated"

                      href="mailto:sistemas@uva.es">sistemas@uva.es</a><br>

                  </span></font></b></p>

            <b><font color="gray" face="Franklin Gothic Book" size="1">

                <hr> </font></b></div>

          <div class="moz-cite-prefix">El 14/09/15 a las 20:29, Tariq

            Saeed escribió:<br>

          </div>

          <blockquote cite="mid:55F71218.6060906@oracle.com" type="cite">

            <meta content="text/html; charset=windows-1252"

              http-equiv="Content-Type">

            <br>

            <div class="moz-cite-prefix">On 09/14/2015 01:20 AM, Area de

              Sistemas wrote:<br>

            </div>

            <blockquote cite="mid:55F68341.5030303@uva.es" type="cite">

              <meta http-equiv="content-type" content="text/html;

                charset=windows-1252">

              <tt>Hello everyone,<br>

                <br>

                We have a problem in a 3 member OCFS2 cluster used to

                serve an web/php application that access (read and/or

                write) files located in the OCFS2 volume.<br>

                The problem appears only some times (apparently during

                high load periods).<br>

                <br>

                SYMPTOMS:<br>

                - access to OCFS2 content becomes more an more slow

                until stalls<br>

                    * a "ls" command that normally takes &lt;=1s takes

                30s, 40s, 1m,...<br>

                - load average of the system grows to 150, 200 or even

                more<br>

                <br>

                - high iowait values: 70-90%<br>

                  <br>

              </tt></blockquote>

            <tt>         This is hint that disk is under pressure. Run

              iostat (see man page)<br>

                       when this happens, producing report every 3

              seconds or and look at<br>

                       %util col<br>

                                     %util<br>

                                   Percentage of CPU time during which

              I/O requests were issued to the  device  (bandwidth<br>

                                   utilization for the device). Device

              saturation occurs when this value is close to 100%.<br>

              <br>

            </tt>

            <blockquote cite="mid:55F68341.5030303@uva.es" type="cite"><tt>  

                * but CPU usage is low<br>

                <br>

                - in the syslog appears a lot of messages like:<br>

                    (httpd,XXXXX,Y):ocfs2_rename:1474 ERROR: status =

                -13<br>

              </tt></blockquote>

            <tt>    </tt>EACCES    Permission denied. find the filename

            and check perms ls -l.<br>

            <blockquote cite="mid:55F68341.5030303@uva.es" type="cite"><tt>

                  or<br>

                    (httpd,XXXXX,Y):ocfs2_unlink:951 ERROR: status = -2<br>

              </tt></blockquote>

            <tt>    </tt>ENOENT     All we can say is an attempt to

            delete a file from a directory that has already been

            deleted. <br>

                                    This requires some knowledge of the

            environment. Is there an application log. <br>

            <blockquote cite="mid:55F68341.5030303@uva.es" type="cite"><tt>

                <br>

                  and the more "worrying":<br>

                     kernel: INFO: task httpd:3488 blocked for more than

                120 seconds.<br>

                     kernel: "echo 0 &gt;

                /proc/sys/kernel/hung_task_timeout_secs" disables this

                message.<br>

                     kernel: httpd           D c6fe5d74     0  3488  

                1616 0x00000080    <br>

                     kernel: c6fe5e04 00000082 00000000 c6fe5d74

                c6fe5d74 000041fd c6fe5d88 c0439b18<br>

                     kernel: c0b976c0 c0b976c0 c0b976c0 c0b976c0

                ed0f0ac0 c6fe5de8 c0b976c0 f75ac6c0<br>

                     kernel: f2f0cd60 c0a95060 00000001 c6fe5dbc

                c0874b8d c6fe5de8 f8fd9a86 00000001<br>

                     kernel: Call Trace:<br>

                     kernel: [&lt;c0439b18&gt;] ?

                default_spin_lock_flags+0x8/0x10<br>

                     kernel: [&lt;c0874b8d&gt;] ?

                _raw_spin_lock+0xd/0x10<br>

                     kernel: [&lt;f8fd9a86&gt;] ?

                ocfs2_dentry_revalidate+0xc6/0x2d0 [ocfs2]<br>

                     kernel: [&lt;f8ff17be&gt;] ?

                ocfs2_permission+0xfe/0x110 [ocfs2]<br>

                     kernel: [&lt;f905b6f0&gt;] ?

                ocfs2_acl_chmod+0xd0/0xd0 [ocfs2]<br>

                     kernel: [&lt;c0873105&gt;] schedule+0x35/0x50<br>

                     kernel: [&lt;c0873b2e&gt;]

                __mutex_lock_slowpath+0xbe/0x120<br>

                     ....<br>

                <br>

              </tt></blockquote>

            <tt>the important part of bt is cut off. Where is the rest

              of it? The entries starting with "?"<br>

              are junk. You can attach /v/l/messages to give us a

              complete pic.My guess is blocking on <br>

              mutex for so long is that the thread holding mutex is

              blocked on i/o. <br>

              Run "ps -e -o pid,stat,comm,whchan=WIDE_WCHAN-COLUMN" and

              look at 'D' state (uninterruptable slee)<br>

              process. These are processes usually blocked on i/o. <br>

            </tt>

            <blockquote cite="mid:55F68341.5030303@uva.es" type="cite"><tt>

                <br>

                (UNACCEPTABLE) WORKAROUND:<br>

                   stop httpd (really slow)<br>

                   stop ocfs2 service (really slow)<br>

                   start ocfs2 an httpd<br>

                <br>

                MORE INFO:<br>

                - OS information:<br>

                    Oracle Linux 6.4 32bit<br>

                    4GB RAM<br>

                    uname -a: 2.6.39-400.109.6.el6uek.i686 #1 SMP Wed

                Aug 28 09:55:10 PDT 2013 i686 i686 i386 GNU/Linux<br>

                    * anyway: we have another 5 nodes cluster with

                Oracle Linux 7.1 (so 64bit OS) serving a newer version

                of the same application and the problems are similar, so

                it appears not to be a OS problem but a more specific

                OCFS2 problem (bug? some tuning? other?)<br>

                <br>

                - standard configuration<br>

                    * if you want I can show the cluster.conf

                configuration but is the "expected configuration"<br>

                <br>

                - standard configuration in o2cb:<br>

                    Driver for "configfs": Loaded<br>

                    Filesystem "configfs": Mounted<br>

                    Stack glue driver: Loaded<br>

                    Stack plugin "o2cb": Loaded<br>

                    Driver for "ocfs2_dlmfs": Loaded<br>

                    Filesystem "ocfs2_dlmfs": Mounted<br>

                    Checking O2CB cluster "MoodleOCFS2": Online<br>

                      Heartbeat dead threshold: 31<br>

                      Network idle timeout: 30000<br>

                      Network keepalive delay: 2000<br>

                      Network reconnect delay: 2000<br>

                      Heartbeat mode: Local<br>

                    Checking O2CB heartbeat: Active<br>

                <br>

                - mount options: _netdev,rw,noatime<br>

                    * so other options (commit, data, ...) have their

                default values<br>

                <br>

                <br>

                Any ideas/suggestion?<br>

                <br>

                Regards.<br>

                <br>

              </tt>

              <div class="moz-signature">-- <br>

                <hr> <img src="cid:part5.03090103.07020702@oracle.com">

                <p class="MsoNormal"><b><font color="gray"

                      face="Franklin Gothic Book" size="1"><span

                        style="font-size:8.0pt;font-family:&quot;Franklin

                        Gothic Book&quot;;color:gray; font-weight:bold">

                        Area de Sistemas<br>

                        Servicio de las Tecnologias de la Informacion y

                        Comunicaciones (STIC)<br>

                        Universidad de Valladolid<br>

                        Edificio Alfonso VIII, C/Real de Burgos s/n.

                        47011, Valladolid - ESPAÑA<br>

                        Telefono: 983 18-6410, Fax: 983 423271<br>

                        E-mail: <a moz-do-not-send="true"

                          class="moz-txt-link-abbreviated"

                          href="mailto:sistemas@uva.es">sistemas@uva.es</a><br>

                      </span></font></b></p>

                <b><font color="gray" face="Franklin Gothic Book"

                    size="1">

                    <hr> </font></b></div>

              <br>

              <fieldset class="mimeAttachmentHeader"></fieldset>

              <br>

              <pre wrap="">_______________________________________________

Ocfs2-users mailing list

<a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:Ocfs2-users@oss.oracle.com">Ocfs2-users@oss.oracle.com</a>

<a moz-do-not-send="true" class="moz-txt-link-freetext" href="https://oss.oracle.com/mailman/listinfo/ocfs2-users">https://oss.oracle.com/mailman/listinfo/ocfs2-users</a></pre>

            </blockquote>

            <br>

          </blockquote>

          <br>

        </blockquote>

        <br>

      </blockquote>

      <br>

    </blockquote>

    <br>

  </body>

</html>