<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
        {mso-style-priority:34;
        margin-top:0in;
        margin-right:0in;
        margin-bottom:0in;
        margin-left:.5in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri","sans-serif";
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri","sans-serif";}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
/* List Definitions */
@list l0
        {mso-list-id:278340782;
        mso-list-type:hybrid;
        mso-list-template-ids:247243030 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level2
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level3
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level4
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level5
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level6
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level7
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level8
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level9
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
ol
        {margin-bottom:0in;}
ul
        {margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal">Hi everyone,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">We’re running a two-node cluster w/ocfs2 v.1.4.4 under RHEL 5.8 and we’re having major stability issues:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">[root@splwww02 ~]# lsb_release -a ; uname -a ; rpm -qa |grep -i ocfs<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">LSB Version: :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Distributor ID: RedHatEnterpriseServer<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Description: Red Hat Enterprise Linux Server release 5.8 (Tikanga)<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Release: 5.8<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Codename: Tikanga<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Linux splwww02.llbean.com 2.6.18-308.1.1.el5 #1 SMP Fri Feb 17 16:51:01 EST 2012 x86_64 x86_64 x86_64 GNU/Linux<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">ocfs2-2.6.18-308.1.1.el5-1.4.7-1.el5<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">ocfs2-tools-1.4.4-1.el5<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">ocfs2console-1.4.4-1.el5<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">These rpms were downloaded and installed from oracle.com.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">The node are ESX VMs and the storage is shared to both VMs through vSphere.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">We’re using ocfs2 because we needed shared back end storage for a splunk implementation (splunk web search heads can used pooled storage and we didn’t want to just use NFS for the obvious reasons – where shared storage seemed to be a better
solution all the way around).<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">/etc/ocfs2/cluster.conf contains the following on each node:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">cluster:<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"> node_count = 2<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"> name = splwww<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">node:<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"> ip_port = 7777<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"> ip_address = 10.130.245.40<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"> number = 0<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"> name = splwww01<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"> cluster = splwww<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">node:<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"> ip_port = 7777<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"> ip_address = 10.130.245.41<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"> number = 1<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"> name = splwww02<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"> cluster = splwww<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Note that each node uses eth0 for clustering. We do NOT currently have a private network setup in ESX for this purpose but CAN if this is recommended by people here.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">o2cb status on both nodes:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Driver for "configfs": Loaded<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Filesystem "configfs": Mounted<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Driver for "ocfs2_dlmfs": Loaded<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Filesystem "ocfs2_dlmfs": Mounted<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Checking O2CB cluster splwww: Online<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Heartbeat dead threshold = 3<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"> Network idle timeout: 90000<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"> Network keepalive delay: 5000<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"> Network reconnect delay: 5000<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Checking O2CB heartbeat: Active<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">/etc/sysconfig/o2cb:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">O2CB_ENABLED=true<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">O2CB_STACK=o2cb<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">O2CB_BOOTCLUSTER=splwww<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">O2CB_HEARTBEAT_THRESHOLD=3<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">O2CB_IDLE_TIMEOUT_MS=90000<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">O2CB_KEEPALIVE_DELAY_MS=5000<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">O2CB_RECONNECT_DELAY_MS=5000<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Each host mounts the (ocfs2) file system then turns around and exports the file system where it mounts it on localhost as shown below (yes I know this is awful but splunk does not support ocfs2 and as such is unable to lock files properly
no matter what options I give to mount an ocfs2 file system):<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in"># From /etc/fstab<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">/dev/mapper/splwwwshared-splwwwshared--lv /opt/splunk/pooled ocfs2 _netdev,datavolume,errors=remount-ro,rw,noatime 0 0<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in"># From /etc/exports<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">/opt/splunk/pooled 127.0.0.1(rw,no_root_squash)<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in"># From /bin/mount<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">/dev/mapper/splwwwshared-splwwwshared--lv on /opt/splunk/pooled type ocfs2 (rw,_netdev,noatime,datavolume,errors=remount-ro,heartbeat=local)<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">localhost:/opt/splunk/pooled on /opt/splunk/mnt type nfs (rw,nordirplus,addr=127.0.0.1)<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">The above should supposedly allow us to enjoy the benefits of an ocfs2 shared/clustered file system while functionally supporting what splunk needs. Unfortunately we have experienced a number of problems with stability, summarized here:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoListParagraph" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><![if !supportLists]><span style="mso-list:Ignore">1.<span style="font:7.0pt "Times New Roman"">
</span></span><![endif]>Cluster timeouts, node(s) leaving cluster.<o:p></o:p></p>
<p class="MsoListParagraph"><o:p> </o:p></p>
<p class="MsoListParagraph">Apr 21 21:03:50 splwww01 kernel: (o2hb-DA9199FC9F,3090,0):o2hb_do_disk_heartbeat:768 ERROR: status = -52<o:p></o:p></p>
<p class="MsoListParagraph">Apr 21 21:03:52 splwww01 kernel: (o2hb-DA9199FC9F,3090,0):o2hb_do_disk_heartbeat:777 ERROR: Device "dm-5": another node is heartbeating in our slot!<o:p></o:p></p>
<p class="MsoListParagraph"><o:p> </o:p></p>
<p class="MsoListParagraph" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><![if !supportLists]><span style="mso-list:Ignore">2.<span style="font:7.0pt "Times New Roman"">
</span></span><![endif]>Ocfs2 crashes, nfs mount becoming unavailable, /opt/splunk/pooled becoming inaccessible, splunk crashing.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Apr 24 06:53:43 splwww01 kernel: (o2hb-DA9199FC9F,3089,0):ocfs2_dlm_eviction_cb:98 device (253,5): dlm has evicted node 1<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoListParagraph" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><![if !supportLists]><span style="mso-list:Ignore">3.<span style="font:7.0pt "Times New Roman"">
</span></span><![endif]>inode irregularities on either/both node(s).<o:p></o:p></p>
<p class="MsoListParagraph"><o:p> </o:p></p>
<p class="MsoListParagraph">Apr 24 06:40:47 splwww01 kernel: (nfsd,3330,1):ocfs2_read_locked_inode:472 ERROR: status = -5<o:p></o:p></p>
<p class="MsoListParagraph"><o:p> </o:p></p>
<p class="MsoListParagraph">Apr 21 10:09:31 splwww01 kernel: (ocfs2rec,7676,0):ocfs2_clear_journal_error:690 ERROR: File system error -5 recorded in journal 0.<o:p></o:p></p>
<p class="MsoListParagraph">Apr 21 10:09:31 splwww01 kernel: (ocfs2rec,7676,0):ocfs2_clear_journal_error:692 ERROR: File system on device dm-5 needs checking.<o:p></o:p></p>
<p class="MsoListParagraph"><o:p> </o:p></p>
<p class="MsoListParagraph" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><![if !supportLists]><span style="mso-list:Ignore">4.<span style="font:7.0pt "Times New Roman"">
</span></span><![endif]>Full system crashes on both node(s).<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Ideally, I’d love to continue using ocfs2 because it solves an age-old problem for us, but since we can’t keep either system stable for more than an hour or so, I’m looking for some more insight as to what we’re seeing and how we might
resolve some of these issues.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Note also that I’m using monit to check a .state file I created under /opt/splunk/pool and re-mount all the fs, restart splunk, etc, but this only takes us so far once the system(s) crash and reboot.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">--<o:p></o:p></p>
<p class="MsoNormal">Nathan Patwardhan, Sr. System Engineer<o:p></o:p></p>
<p class="MsoNormal"><a href="mailto:npatwardhan@llbean.com">npatwardhan@llbean.com</a><o:p></o:p></p>
<p class="MsoNormal">x26662<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</body>
</html>