[Ocfs2-users] OCFS2 tuning, fragmentation and localalloc option. Cluster hanging during mix read+write workloads

Gavin Jones gjones at where2getit.com
Mon Jul 15 13:33:00 PDT 2013


Hello,

We have a 16 node OCFS2 cluster used for web serving duties.  Each
node mounts (the same) 6 OCFS2 volumes.  Shared data includes client
files, application files for our webapp, log files, configuration
files.  Storage provided by 2x EqualLogic PS400E iSCSI SANs, each
having 12 drives in a RAID50; units are in a 'Group'.

The problem we are having is that periodically, maybe once a week or
so, we get several Apache processes on a handful of nodes that get
stuck in D state and are unable to recover.  This greatly increases
server load, causes more Apache processes to backup, OCFS2 starts
complaining about unresponsive nodes and before you know it, the
cluster is down.

This seems to occur most often when we are doing writes + reads; if it
is just reads the cluster hums along.  However, when we need to update
many files or remove lots of files (think temporary images) in
addition to normal read activity, we have the above-mentioned problem.

We have done some searching and found
http://www.mail-archive.com/ocfs2-users@oss.oracle.com/msg05525.html
which describes a similar problem with write activity.  In that case,
the problem was allocating contiguous space on a fragmented filesystem
and the solution was to adjust the mount option 'localalloc'.  We are
wondering if we are in a similar position.

Below is the output from the stat_sysdir_analyze.sh script mentioned
in the link above, which analyzes stat_sysdir.sh output; I've included
the two volumes that seem to be our 'problem' volumes.

Volume 1:
bash stat_sysdir_analyze.sh sde1-client-20130715.txt
Number |
of |
clust. | Contiguous cluster size
--------------------------------
4549 510 and smaller
1825 511

Volume 2:
bash stat_sysdir_analyze.sh sdd1-data-20130715.txt
Number |
of |
clust. | Contiguous cluster size
--------------------------------
175 510 and smaller
23 511

Any evidence here of excessive fragmentation that tuning localalloc
would help with?

Also regarding localalloc, I notice it is different for the above two
volumes on many of the nodes; I find this interesting as the cluster
is supposed to make an educated guess on this value.  For instance:

/dev/sda1 on /u/client type ocfs2
(rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=6,coherency=full,user_xattr,noacl)
/dev/sde1 on /u/data type ocfs2
(rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=5,coherency=full,user_xattr,noacl)


/dev/sdd1 on /u/client type ocfs2
(rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=9,coherency=full,user_xattr,noacl)
/dev/sdb1 on /u/data type ocfs2
(rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=5,coherency=full,user_xattr,noacl)


/dev/sda1 on /u/client type ocfs2
(rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=11,coherency=full,user_xattr,noacl)
/dev/sdc1 on /u/data type ocfs2
(rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=5,coherency=full,user_xattr,noacl)


/dev/sda1 on /u/client type ocfs2
(rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=6,coherency=full,user_xattr,noacl)
/dev/sdc1 on /u/data type ocfs2
(rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=7,coherency=full,user_xattr,noacl

I'm not sure why the cluster would be picking different values
depending on the node?

Anyway, any opinions, advice, tuning suggestions greatly appreciated.
This business of the cluster hanging is turning into quite a problem.

I'll provide any other requested information upon request.

Thanks,

Gavin W. Jones
Where 2 Get It, Inc.

--
"There has grown up in the minds of certain groups in this country the
notion that because a man or corporation has made a profit out of the
public for a number of years, the government and the courts are
charged with the duty of guaranteeing such profit in the future, even
in the face of changing circumstances and contrary to public interest.
This strange doctrine is not supported by statute nor common law."

~Robert Heinlein



More information about the Ocfs2-users mailing list