[Ocfs2-users] Ocfs2-users Digest, Vol 98, Issue 9

Wed Feb 29 16:10:23 PST 2012

At 02:00 PM 2/29/2012, you wrote:
>Message: 1
>Date: Tue, 28 Feb 2012 14:37:28 -0800
>From: Sunil Mushran <sunil.mushran at oracle.com>
>Subject: Re: [Ocfs2-users] Concurrent write performance issues with
>         OCFS2
>To: ocfs2-users at oss.oracle.com
>Message-ID: <4F4D5728.5080305 at oracle.com>
>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>In 1.4, the local allocator window is small. 8MB. Meaning the node
>has to hit the global bitmap after every 8MB. In later releases, the
>window is much larger.
>
>Second, a single node is not a good baseline. A better baseline is
>multiple nodes writing concurrently to the block device. Not fs.
>Use dd. Set different write offsets. This should help figure out how
>the shared device works with multiple nodes.

I too have seen some serious performance issues under 1.4, especially 
with writes.  I'll share some info I've gathered on this topic, take 
it however you wish...

In the past I never really thought about running benchmarks against 
the shared block device as a baseline to compare with the 
filesystem.  So today I did run several dd tests of my own (both read 
and write) against a shared block device (different LUN, but using 
the exact same storage hardware including specific disks as the one 
with OCFS2).

My tests were not in line with those of Erik Schwartz, as I 
determined the performance degradations to be OCFS2 related.

I have a a fs shared by 2 nodes, both are dual quad core xeon systems 
with 2 dedicated storage NICs per box.
Storage is a Dell/EqualLogic iSCSI SAN with 3 gigE NICs, dedicated 
gigE switches, using jumbo frames.
I'm using dm-multipath as well.

RHEL5 (2.6.18-194.3.1.el5 kernel)
ocfs2-2.6.18-194.11.4.el5-1.4.7-1.el5
ocfs2-tools-1.4.4-1.el5

Using the individual /dev/sdX vs. the /dev/mapper/mpathX devices 
indicates that multipath is working properly as the numbers are close 
to double what the separates each give.

Given the hardware, I'd consider 200MB/s a limit for a single box and 
300MB/s the limit for the SAN.

Block device:
Sequential reads tend to be in the 180-190MB/s range with just one 
node reading.
Both nodes simultaneously reading gives about 260-270MB/s total throughput.
Sequential writes tend to be in the 115-140MB/s range with just one 
node writing.
Both nodes simultaneously writing gives about 200-230MB/s total throughput.

OCFS2:
Sequential reads tend to be in the 80-95MB/s range with just one node reading.
Both nodes simultaneously reading gives about 125-135MB/s total throughput.
Sequential writes tend to be in the 5-20MB/s range with just one node writing.
Both nodes simultaneously writing (different files) gives unbearably 
slow performance of less than 1MB/s total throughput.

Now one thing I will say is that I was testing on a "mature" 
filesystem that has been in use for quite some time.  Tons of file & 
directory creation, reading, updating, deleting, over the course of a 
couple years.

So to see how that might affect things, I then created a new 
filesystem on that same block device I used above (with same options 
as the "mature" one) and ran the set of dd-based fs tests on that.

Create params: -b 4K -C 4K 
--fs-features=backup-super,sparse,unwritten,inline-data
  Mount params: -o noatime,data=writeback

Fresh OCFS2:
Sequential reads tend to be in the 100-125MB/s range with just one 
node reading.
Both nodes simultaneously reading gives about 165-180MB/s total throughput.
Sequential writes tend to be in the 120-140MB/s range with just one 
node writing.
Both nodes simultaneously writing (different files) gives reasonable 
performance of around 100MB/s total throughput.

Wow, what a difference!  I will say that, for the "mature" filesystem 
above that is performing poorly, it has definitely gotten worse over 
time.  It seems to me that the filesystem itself has some time or 
usage based performance degradation issues.

I'm actually thinking it would be to the benefit of my cluster to 
create a new volume, shut down all applications, copy the contents 
over, shuffle mount points, and start it all back up.  The only 
problem is that this will make for some highly unappreciated 
downtime!  Also, I'm concerned that all that copying and loading it 
up with contents may just result in the same performance losses, 
making the whole process just wasted effort.