[Ocfs2-users] High load average on Apache Cluster with drbd + ocfs2

Wed Mar 3 02:47:08 PST 2010

Hi Andreas,

I saw almost exactly what you described when using ocfs2 on web servers. Some time late at night, the load would go through the roof on 1 web server because there were lots of apache processes in the uninterruptible "D" state If I stopped apache on the problem server and the load dropped, but went back up as soon as I started it again.

Turns out I'd hit a free space fragmentation problem. While df reported I had heaps of free space (>50% from memory!), I couldn't write (echo >>) to the log files on the problem web server. Note that you'll find you can still create small files and append to small files, but not the larger apache log files.

The fact that it happens late at night was very confusing, but eventually made sense. As the day goes on, the log files get bigger and bigger pieces of contiguous free space are required to extend the file. Eventually, a contiguous piece of free space cannot be found and your writes will start to fail.

A *partial* fix went into 2.6.33. It's partial because it doesn't fix the free space fragmentation issue but rather allows the problem node to steal some free space from the node that is still ok. All it does is prolong the problem a little such that writes will start to fail on both nodes at the same time.

Another thing you can do that doesn't require a kernel upgrade is to reduce the number of node slots. The default is 8 (-N to mkfs.ocfs2) so reducing this will free up some *contiguous* free space. Unfortunately this is an offline operation.

This may not be your issue, but it certainly sounds familiar. I recall it was very frustrating trying to diagnose the issue.

Cheers,

Brad

On Wed, 3 Mar 2010 11:04:48 +0100
"Andreas Kossmann" <kossmann.andreas at gmx.de> wrote:

> Hello all,
> 
> I have an enviroment with 2 Debian 5.0 servers. 
> Kernel is 2.6.26-2-amd64. I have installed drbd-8.0.14 and ocfs2-tools 1.4.1.
> It is an Active/Active WebCluster with Apache.
> The 2 servers write to the same log files.
> 
> In my test enviroment everything works fine. In the production environment I have the problem, that after a few weeks the Apache-Servers goes crazy and get a very high load >100.
> 
> First I thought the problem may be drbd, but I have read many problemes with ocfs2 and apache load average.
> 
> The curios thing is that the load is often very high at times where request are very small ( eg. 11:00 PM )
> 
> I've disconnected the second webserver from the network and checked the filesystem. A few bitmap errors occured and i repaired them. Then I changed the drbd config so, that only webserver 1 is primary and the webserver 2 is secondary. So webserver 2 cannot write to the device. 
> 
> After I connect the webserver 2 to the network again and the sync from the primary starts. The load on webserver 1 is going > 100.
> 
> I have also tested the connection with webserver 2 with disconnected drbd I discovered that the load on webserver 1 is going i little higher also.
> 
> Is there any solution for the ocfs2 load problem with apache?
> If there is no solution I hvae to change from active/active to active/passive with ext3 as filesystem.
> 
> Please, help me.
> 
> Thanks a lot
> 
> Andreas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20100303/29d58943/attachment.bin