[Ocfs2-users] RHEL 5.8, ocfs2 v1.4.4, stability issues

Wed May 2 12:51:14 PDT 2012

Glad to hear that your stability is getting better. 

DRBD works best on dedicated hardware RAIDs with dedicated 1GbE replication links. 

Some might disagree with me, but I really recommend dedicated external storage for OCFS2. You can use any old server with lots of drive space...Something as old as a Dell PE2950 will even work if it meets your needs. Throw in some drives, make a RAID, install CentOS  set up an iSCSI target (iet) and you're good to go. 

Then, if you want to improve your availability, add DRBD and heartbeat to the mix.

Best,
Michael

-----Original Message-----
From: Nathan Patwardhan [mailto:npatwardhan at llbean.com] 
Sent: Wednesday, May 02, 2012 3:30 PM
To: Kushnir, Michael (NIH/NLM/LHC) [C]; ocfs2-users at oss.oracle.com
Subject: RE: RHEL 5.8, ocfs2 v1.4.4, stability issues

>  -----Original Message-----
>  From: Kushnir, Michael (NIH/NLM/LHC) [C]  
> [mailto:michael.kushnir at nih.gov]
>  Sent: Friday, April 27, 2012 11:59 AM
>  To: Nathan Patwardhan; ocfs2-users at oss.oracle.com
>  Subject: RE: RHEL 5.8, ocfs2 v1.4.4, stability issues
>  
>  If memory serves me correctly, shared VMDK for clustering only works 
> if  both cluster nodes sit on the same ESX server. This pretty much 
> defeats the  purpose of clustering because your ESX server becomes a 
> single point of  failure. ESX servers crash and fail sometimes... Not frequently, but they do.
>  
>  
>  
>  When setting up RDM make sure you configure it for physical mode 
> (pass  thru). If you use virtual mode, your nodes will once again be 
> limited to a  single ESX server and your max RDM size will be limited to ~2TB.

We haven't gotten to implement RDM yet, but will.  We're discussing this internally right now so I don't have any summaries about that.

As a transitional step, I decided to see if I could stabilize ocfs2 by introducing drbd and REMOVING the shared vmdks from both ESX guests.  After each ESX guest was given a single and additional 10GB vmdk and I added each vmdk (/dev/sdb1 on each guest) to drbd and got drbd going, I did a mkfs.ocfs2 and got everything mounted as it should be, then started splunk.

We've been stable for almost 24 hours but most importantly we have NOT seen any of the errors in syslog nor have experienced any ocfs2 outages or system crashes as we had seen when systems had been using a shared vmdk.  drdb performance isn't the greatest and is actually pretty poor in the context of ESX, but at the same time proving out ocfs2 reliability definitely seems to be taking shape.  I am looking forward to seeing how we do performance-wise once we introduce RDM.

--
Nathan Patwardhan, Sr. System Engineer
npatwardhan at llbean.com
x26662