PURPOSE

The purpose of this article is to provide best practice guidelines for installing and using Oracle Cluster File System (OCFS) on Linux.
 

SCOPE & APPLICATION

The article is intended for anyone (System Administrators (SAs), Database Administrators (DBAs) or users) who are planning to install and use an OCFS partition. This article applies to users of both Red Hat Linux Advanced Server 2.1 and United Linux 1.0, currently the only two OCFS supported Linux distributions.

Oracle Cluster File System (OCFS) Best Practices

1. Why Use OCFS?

OCFS was designed as an alternative to using raw devices for Oracle9i Real Application Clusters (RAC). Management of raw devices is usually a difficult task and many Database Administrators (DBAs) and System Administrators (SAs) are more familiar with filesystems. Another issue with raw devices on Linux is the maximum of 255 raw partitions, as there can be no more than 255 /dev/raw device files.

2. Configuring Linux For OCFS

Other than those required to run Oracle on Linux, OCFS requires no specific kernel configuration. In fact, other than memory, OCFS does not rely upon or utilise file-related kernel parameters such as /proc/sys/fs/file-max.

Red Hat Linux Advanced Server 2.1
Due to improvements in Virtual Memory page cache performance, a minimum kernel errata of e.24 or higher is strongly recommended for Red Hat Linux Advanced Server 2.1. If using kernel errata e.12 or higher, the default kernel page cache settings should be used. Non-default page cache settings, such as those configured in /etc/sysctl.conf [vm.pagecache] or echoed to /proc/sys/vm/pagecache should be removed or reset to default values.

Only if using kernel errata less than e.12, manual configuration of Virtual Memory page cache is likely to be necessary to avoid excessive page cache retention. Therefore, add the following parameter to /etc/sysctl.conf file, then reboot for changes to take effect:
[/etc/sysctl.conf]
vm.pagecache = 10 20 30
Alternatively, run the following command as root for changes to take immediate effect (without the necessity of reboot), however changes are lost on reboot:
[root@ca-test2 root]# echo 10 20 30 > /proc/sys/vm/pagecache
Note: If upgrading from an earlier version (particularly a pre-e12 kernel) to a current kernel, ensure to check and remove any Virtual Memory settings such as that described above.

3. File Types Supported by OCFS

At this time (version 1.0.9), OCFS only supports Oracle data files - this includes redo log files, archive log files, controlfiles and datafiles. OCFS also supports the Oracle Cluster Manager (OCM) shared quorum disk file and shared Server Configuration file (for svrctl). Support for shared Oracle Home installation is not currently supported, but expected in the latter part of 2003.

4. Selecting an OCFS Blocksize

Selecting an appropriate block size requires an understanding that OCFS was specifically designed for (and favours) large, contiguous files, such as Oracle datafiles. Forward knowledge of the types of files to reside in an OCFS partition is required before formatting a partition. Block sizes between 4Kb (min.) and 1Mb (max.) are available. The larger the block size, the fewer the maximum number of possible files. Conversely, the smaller the block size, the greater the maximum number of possible files. However, the smaller the block size, the greater the performance penalty. Block sizes must be a multiple of 4096 bytes [4Kb]. Small block sizes should be useful in future OCFS versions that are likely to support regular files, such as a shared Oracle Home installation between RAC nodes.

The maximum number of possible files for an OCFS partition is calculated as follows:
(<partition size> - 8Mb) / <block size>  = number of bits [max. possible files]
For example:
(113246208 [108Mb] - 8388608 [8Mb]) / 131072 = 800 [max. possible files]

Given the above, calculating the required OCFS block size to allow for a given maximum number of files, is calculated as follows:
 (<partition size> - 8Mb) /  <number of bits> = <block size>
For example:
(113246208 [108Mb] - 8388608 [8Mb]) / 100 = 1048576 [1Mb] block size  

The default OCFS block size is 128Kb (131,072 bytes). Formatting an OCFS partition with a 128Kb block size provides a good balance between the maximum number of files and file I/O performance for medium to large sized files. A block size of 128Kb means that for every file created (with content), a minimum of 128Kb of disk space
is allocated, even if the file only contains 1 byte of data.

Following are guidelines for selecting an appropriate OCFS block size:

File  Types
Block Size
Suitability
Max File Size
Datafiles
128Kb - 1Mb
Few, large, contiguous files
1Mb = 1Tb
Archivelogs, Redo logs, Controlfiles
128Kb
Several medium sized files
128Kb = 1Tb
Smaller files
4Kb - 128Kb
Many, small files
4Kb = 2Gb

1Mb block size: 1Mb * 8M bits = 8,796,093,022,208 [ 8Tb] maximum file size but limited by Linux addressability to only 1Tb.
128Kb block size: 128Kb * 8M bits = 1,099,511,627,776 [ 1Tb] maximum file size
4Kb block size: 4Kb * 8M bits = 34,359,738,368 [32Gb] maximum file size

Note: 8M bits refers to the 1Mb fixed size of the global bitmap per OCFS partition i.e. 8-bit word * 1,048,576.

5. Calculating an OCFS Partition Size

OCFS supports partitions of up to 1Tb (tested). Since no volume management is built into OCFS, Oracle recommends enabling hardware raid support to create logical disk volumes of sufficient size. If hardware raid support is not available, a Logical Volume Manager (LVM) or Multi-Disk (MD) disk configuration can be employed depending on Linux distribution being used. The creation of too many OCFS partitions (i.e. 50 or more) per clustered system is likely to create a performance (process) bottleneck - this is not specifically related to OCFS. Ideally, it is desirable to have no more than 20 OCFS partitions per system.

Calculating the exact required partition size for an OCFS volume is complex. The minimum partition size depends on several factors including;
The absolute minimum OCFS partition size should be at least 100Mb+ - this allows for volume structures and file metadata (for up to 32 nodes), but minimal user file space. Unlike other general purpose filesystems e.g. ext2/3, OCFS was purpose built for few, very large, contiguous files. Like other filesystems, OCFS does not factor disk space exhaustion, so initial partition sizing is critical and must allow for sufficient file growth - this includes space for user files as well as volume metadata.

Calculating a required partition size, given a pre-determined block size and known maximum number of required files, is calculated as follows:
(<max number files> * <block size>) + 8388608 [8Mb] = <min. partition size>
Note: The 8Mb pertains to the volume metadata file space.
           Also, this does not factor user file space requirements.
For example:
(800 * 131072) + 8388608 [8Mb] = 113246208 [108Mb] min. partition size


6. Separate Files Across Partitions

Oracle recommends placing archive log files on separate partitions to other file types e.g. redolog files, datafiles. For optimal performance, each node should have its own, separate OCFS archive log partition. Although RAC requires each node to mount each others partitions, good design will ensure that only one node will ever write its archive to its assigned partition. This way, contention for space allocation (i.e. repeated locking/unlocking of the global bitmap by multiple nodes for a shared volume) is reduced, particularly if the database is heavily used and many archive logs files are generated by each node.

All datafiles may reside in the same OCFS partition. An Oracle block size of 8Kb is usually recommended - this need not change when OCFS is used. Performance of 8Kb Oracle blocks is somewhat better than smaller block sizes. Again, the larger the Oracle block size, the better the performance.

7. Defragmentation

Given the extent-based allocation of volume metadata and user file data clusters, it is possible for disk fragmentation to occur. The following guidelines list measures to prevent volume fragmentation:

OCFS requires contiguous space on disk for initial datafile creation e.g. if you create a 1Gb datafile (in one command), it requires 1Gb of contiguous space on disk. If you then extend the datafile by 100Mb, it requires another 100Mb chunk of contiguous disk space. However, the 100Mb chunk need not fall right behind the 1Gb chunk.

By default, restoration of files using RMAN requires contiguous space to be available on disk. Should insufficient contiguous space be available, RMAN restoration will fail with insufficient disk space error [ENOSPC]. OCFS /usr/bin/extfinder (contiguous extent find utility) can be used to identify the largest available, contiguous extents within a volume.
At this time, no OCFS defragmentation tool exists. The only method to defragment an OCFS volume is to copy off files, then restore them back to the partition. Copying files (i.e. cp --o_direct ..., dd o_direct=yes ..., tar --o_direct ...) from an OCFS partition requires the application of fileutils and tar patches that provide direct_io write capability. o_direct enabled versions of these utilities are available from MetaLink  (http://metalink.oracle.com) as patches:

8. Performance

Ensure that /usr/bin/updatedb (aka /usr/bin/locate, slocate) does not run against OCFS partitions. Updatedb, a file indexer, will reduce OCFS file I/O performance. To prevent updatedb from indexing OCFS partitions, add 'ocfs' to PRUNEFS= list in /etc/updatedb.conf e.g.
[/etc/updatedb.conf]
PRUNEFS="devpts NFS nfs afs proc smbfs autofs auto iso9660 ocfs"
PRUNEPATHS="/tmp /usr/tmp /var/tmp /afs /net /ocfs-data /ocfs-index /quorum"
export PRUNEFS
export PRUNEPATHS

Some OCFS directory structures may increase the time required to find files. Limit the number of files in a directory, particularly for volumes where the time required to list files is critical. Instead, use a directory tree rather than a flat tree structure. The flatter the tree structure for a given number of files, the worse the look-up time. The deeper the file within a directory structure, the more expensive searching is initially, however once a file is found, it's file entry is cached by OCFS.

9. Support for Mount by Label

If OCFS volumes are created with unique labels (e.g. mkfs.ocfs -L mylabel ...), mount supports the mounting of volumes by label. Red Hat Linux Advanced Server 2.1 mount errata provides updates for OCFS. To mount OCFS by label run:
# mount -t ocfs -L mylabel /ocfs

10. Sample Configuration and Layout

Install the appropriate ocfs-kernel module - the correct module depends on your kernel. Command 'uname -a' will identify your current running kernel e.g.
[root@ca-test2 root]# uname -a
Linux ca-test2 2.4.9-e.25enterprise #1 SMP Tue Jun 16 15:49:20 PST 2003 i686 unknown

In the example above, only the following OCFS packages, i.e. for kernel type enterprise, should be downloaded and installed:

During installation, OCFS automatically creates the necessary init/rc script (/etc/init.d/ocfs) - this installs (loads) the ocfs.o module upon server startup. If you wish to automatically mount OCFS volumes upon server startup/reboot, add a corresponding line to the /etc/fstab file specifying a filesystem type of 'ocfs' per OCFS partition.

For Red Hat Linux Advanced Server 2.1 only, ensure to add _netdev to the 4th field that usually says defaults e.g.

/dev/sda1	/ocfs-data	ocfs	_netdev	0 0
/dev/sda2 /ocfs-index ocfs _netdev 0 0
Note: _netdev instructs mount to exclude these volumes on first pass mount i.e. only mount after all network services are started.

The following is a sample filesystem layout (using the same mount point on every node in the cluster):
/ocfs-data
/ocfs-index
/ocfs-ctrl-redo1
/ocfs-ctrl-redo2
/archive1a
/archive1b
/archive2a
/archive2b
Note: the example above assumes a two node RAC database, where each instance writes archive to its LOG_ARCHIVE_DEST and LOG_ARCHIVE_DUPLEX_DEST. To avoid performance and defragmentation  issues, the archive destinations for each instance are written to separate partitions.

11. Troubleshooting

OCFS writes [printk()] debug and error messages into the system log (/var/log/messages). dmesg reports any OCFS errors. For issues suspected to be OCFS-related, a few things can be checked quickly; An OCFS filesystem check utility (/sbin/fsck.ocfs) is available from 1.0.9 and can be run against unmounted volumes.

12. How to Obtain OCFS for Linux

 The only two OCFS supported Linux distributions are Red Hat Linux Advanced Server 2.1 and United Linux 1.0.

At time of writing, the latest Production version of OCFS is 1.0.9.
OCFS for Linux is available for download from:
OCFS is available for the following distribution/kernel types:
Red Hat Advanced Server 2.1 with kernel 2.4.9-e.12 or higher

OCFS kernel 2.4.9-e.12 or higher

Packages for United Linux 1.0 with Service pack 2A or higher

OCFS 1.0.9 kernel module RPM for United linux

RELATED DOCUMENTS

<Note:224586.1> Oracle Cluster File System (OCFS) on Red Hat AS - FAQ.
<Note:240575.1> RAC on Linux Best Practices.
<Note:184821.1> Step-By-Step Installation of RAC on Linux.
<Note:241114.1> Step-By-Step Installation of RAC on Linux - Single Node (Oracle9i 9.2.0 with OCFS).