The purpose of this article is to provide best practice guidelines
for installing and using Oracle Cluster File System (OCFS) on Linux.
SCOPE & APPLICATION
The article is intended for anyone (System Administrators (SAs), Database
Administrators (DBAs) or users) who are planning to install and use an
OCFS partition. This article applies to users of both Red Hat Linux Advanced
Server 2.1 and United Linux 1.0, currently the only two OCFS supported Linux
Oracle Cluster File System (OCFS) Best Practices
1. Why Use OCFS?
OCFS was designed as an alternative to using raw devices for Oracle9i
Real Application Clusters (RAC). Management of raw devices is usually
a difficult task and many Database Administrators (DBAs) and System Administrators
(SAs) are more familiar with filesystems. Another issue with raw devices
on Linux is the maximum of 255 raw partitions, as there can be no more
than 255 /dev/raw device files.
2. Configuring Linux For OCFS
Other than those required to run Oracle on Linux, OCFS requires no specific
kernel configuration. In fact, other than memory, OCFS does not rely upon
or utilise file-related kernel parameters such as /proc/sys/fs/file-max.
Red Hat Linux Advanced Server 2.1
Due to improvements in Virtual Memory page cache performance,
a minimum kernel errata of e.24 or higher is strongly recommended for Red
Hat Linux Advanced Server 2.1. If using kernel errata e.12 or higher, the
default kernel page cache settings should be used. Non-default page cache
settings, such as those configured in /etc/sysctl.conf [vm.pagecache] or
echoed to /proc/sys/vm/pagecache should be removed or reset to default values.
Only if using kernel errata less than e.12, manual configuration of Virtual
Memory page cache is likely to be necessary to avoid excessive page cache
retention. Therefore, add the following parameter to /etc/sysctl.conf file,
then reboot for changes to take effect:
Alternatively, run the following command as root for changes to take immediate
effect (without the necessity of reboot), however changes are lost on reboot:
vm.pagecache = 10 20 30
[root@ca-test2 root]# echo 10 20 30 > /proc/sys/vm/pagecache
Note: If upgrading from an earlier version (particularly a pre-e12
kernel) to a current kernel, ensure to check and remove any Virtual Memory
settings such as that described above.
3. File Types Supported by OCFS
At this time (version 1.0.9), OCFS only supports Oracle data files
- this includes redo log files, archive log files, controlfiles and datafiles.
OCFS also supports the Oracle Cluster Manager (OCM) shared quorum disk
file and shared Server Configuration file (for svrctl). Support for shared
Oracle Home installation is not currently supported, but expected in the
latter part of 2003.
4. Selecting an OCFS Blocksize
Selecting an appropriate block size requires an understanding that
OCFS was specifically designed for (and favours) large, contiguous files,
such as Oracle datafiles. Forward knowledge of the types of files to reside
in an OCFS partition is required before formatting a partition. Block
sizes between 4Kb (min.) and 1Mb (max.) are available. The larger the block
size, the fewer the maximum number of possible files. Conversely, the smaller
the block size, the greater the maximum number of possible files. However,
the smaller the block size, the greater the performance penalty. Block sizes must be a multiple
of 4096 bytes [4Kb]. Small block sizes
should be useful in future OCFS versions that are likely to support regular
files, such as a shared Oracle Home installation between RAC nodes.
The maximum number of possible files for an OCFS partition is calculated
(<partition size> - 8Mb) / <block size> = number of bits [max. possible files]
(113246208 [108Mb] - 8388608 [8Mb]) / 131072 = 800 [max. possible files]
Given the above, calculating the required OCFS block size to allow
for a given maximum number of files, is calculated as follows:
(<partition size> - 8Mb) / <number of bits> = <block size>
(113246208 [108Mb] - 8388608 [8Mb]) / 100 = 1048576 [1Mb] block size
The default OCFS block size is 128Kb (131,072 bytes). Formatting an
OCFS partition with a 128Kb block size provides a good balance between
the maximum number of files and file I/O performance for medium to large
sized files. A block size of 128Kb means that for every file created (with
content), a minimum of 128Kb of disk space is
allocated, even if the file only contains 1 byte of data.
Following are guidelines for selecting an appropriate OCFS block size:
|Max File Size
|128Kb - 1Mb
|Few, large, contiguous files
|1Mb = 1Tb
|Archivelogs, Redo logs, Controlfiles
|Several medium sized files
|128Kb = 1Tb
|4Kb - 128Kb
|Many, small files
|4Kb = 2Gb
1Mb block size: 1Mb * 8M bits = 8,796,093,022,208
[ 8Tb] maximum file size but limited by Linux addressability to only 1Tb.
128Kb block size: 128Kb * 8M bits = 1,099,511,627,776 [ 1Tb] maximum
4Kb block size: 4Kb * 8M bits = 34,359,738,368 [32Gb] maximum
Note: 8M bits refers to the 1Mb fixed size of the global bitmap per OCFS
partition i.e. 8-bit word * 1,048,576.
5. Calculating an OCFS Partition Size
OCFS supports partitions of up to 1Tb (tested). Since
no volume management is built into OCFS, Oracle recommends enabling hardware
raid support to create logical disk volumes of sufficient size. If hardware
raid support is not available, a Logical Volume Manager (LVM)
or Multi-Disk (MD) disk configuration can be employed
depending on Linux distribution being used. The creation of too many OCFS
partitions (i.e. 50 or more) per clustered system is likely to create a
performance (process) bottleneck - this is not specifically related to OCFS.
Ideally, it is desirable to have no more than 20 OCFS partitions per system.
Calculating the exact required partition size for an OCFS volume is
complex. The minimum partition size depends on several factors including;
The absolute minimum OCFS partition size should be at
least 100Mb+ - this allows for volume structures and file metadata (for up
to 32 nodes), but minimal user file space. Unlike other general purpose filesystems
e.g. ext2/3, OCFS was purpose built for few, very large, contiguous files.
Like other filesystems, OCFS does not factor disk space exhaustion, so initial
partition sizing is critical and must allow for sufficient file growth -
this includes space for user files as well as volume metadata.
- the OCFS block size (see above)
- the maximum number of (not necessarily the size
of) intended user files e.g. datafiles
- the intended size of user data/files
Calculating a required partition size, given a pre-determined block
size and known maximum number of required files, is calculated as follows:
(<max number files> * <block size>) + 8388608 [8Mb] = <min. partition size>
Note: The 8Mb pertains to the volume metadata
Also, this does not
factor user file space requirements.
(800 * 131072) + 8388608 [8Mb] = 113246208 [108Mb] min. partition size
Oracle recommends placing archive log files on separate
partitions to other file types e.g. redolog files, datafiles. For optimal
performance, each node should have its own, separate OCFS archive log partition.
Although RAC requires each node to mount each others partitions, good design
will ensure that only one node will ever write its archive to its assigned
partition. This way, contention for space allocation (i.e. repeated locking/unlocking
of the global bitmap by multiple nodes for a
shared volume) is reduced, particularly if the database is heavily
used and many archive logs files are generated by each node.
6. Separate Files
All datafiles may reside in the same OCFS partition. An Oracle block
size of 8Kb is usually recommended - this need not change when OCFS is
used. Performance of 8Kb Oracle blocks is somewhat better than smaller
block sizes. Again, the larger the Oracle block size, the better the performance.
Given the extent-based allocation of volume metadata
and user file data clusters, it is possible for disk fragmentation to occur.
The following guidelines list measures to prevent volume fragmentation:
OCFS requires contiguous space on disk for initial datafile creation e.g.
if you create a 1Gb datafile (in one command), it requires 1Gb of contiguous
space on disk. If you then extend the datafile by 100Mb, it requires another
100Mb chunk of contiguous disk space. However, the 100Mb chunk need not
fall right behind the 1Gb chunk.
- Avoid heavy, concurrent new file creation, deletion
or extension, particularly from multiple nodes to the same partition.
- Attempt to correctly size Oracle datafiles before creation (including
adding new datafile to a tablespace), ensuring to allow for more than adequate
- Use of a consistent extent/next extent size for all tablespaces
in order to prevent partition fragmentation.
- Separate data and index datafiles across separate OCFS
- Separate archive logs and redo log files across
separate OCFS partitions.
- Where possible, avoid enabling datafile autoextensibility.
Statically sized datafiles are ideal to avoid defragmentation. Autoextensibility
is acceptable as long as a large next extent is configured.
- Where possible, use Recovery Manager
(RMAN), particularly restoration - RMAN writes in o_direct mode by default.
restoration of files using RMAN requires contiguous space to be available
on disk. Should insufficient contiguous space be available, RMAN restoration
will fail with insufficient disk space error [ENOSPC]. OCFS /usr/bin/extfinder
(contiguous extent find utility) can be used to identify the largest available,
contiguous extents within a volume.
At this time, no OCFS defragmentation tool exists. The
only method to defragment an OCFS volume is to copy off files, then restore
them back to the partition. Copying files (i.e. cp --o_direct ..., dd o_direct=yes
..., tar --o_direct ...) from an OCFS partition requires the application of
fileutils and tar patches that provide direct_io write capability. o_direct
enabled versions of these utilities are available from MetaLink (http://metalink.oracle.com)
- Patch 2883583: fileutils-4.1-4.2.i386.rpm
- Patch 2913284: tar-1.13.25-9.i386.rpm
Ensure that /usr/bin/updatedb (aka /usr/bin/locate,
slocate) does not run against OCFS partitions. Updatedb, a file indexer,
will reduce OCFS file I/O performance. To prevent updatedb from indexing
OCFS partitions, add 'ocfs' to PRUNEFS= list in /etc/updatedb.conf e.g.
PRUNEFS="devpts NFS nfs afs proc smbfs autofs auto iso9660 ocfs"
PRUNEPATHS="/tmp /usr/tmp /var/tmp /afs /net /ocfs-data /ocfs-index /quorum"
Some OCFS directory structures may increase the time required to find files.
Limit the number of files in a directory, particularly for volumes where
the time required to list files is critical. Instead, use a directory
tree rather than a flat tree structure. The flatter the tree structure for
a given number of files, the worse the look-up time. The deeper the
file within a directory structure, the more expensive searching is initially,
however once a file is found, it's file entry is cached by OCFS.
9. Support for Mount by Label
If OCFS volumes are created with unique labels (e.g.
mkfs.ocfs -L mylabel ...), mount supports the mounting of volumes by label.
Red Hat Linux Advanced Server 2.1 mount errata provides updates for OCFS.
To mount OCFS by label run:
# mount -t ocfs -L mylabel /ocfs
10. Sample Configuration and Layout
Install the appropriate ocfs-kernel module - the correct
module depends on your kernel. Command 'uname -a' will identify your current
running kernel e.g.
[root@ca-test2 root]# uname -a
Linux ca-test2 2.4.9-e.25enterprise #1 SMP Tue Jun 16 15:49:20 PST 2003 i686 unknown
In the example above, only the following OCFS packages, i.e. for kernel
type enterprise, should be downloaded and installed:
During installation, OCFS automatically creates the necessary init/rc script
(/etc/init.d/ocfs) - this installs (loads) the ocfs.o module upon server
startup. If you wish to automatically mount OCFS volumes upon server startup/reboot,
add a corresponding
line to the /etc/fstab file specifying a filesystem type of 'ocfs' per
For Red Hat Linux Advanced Server 2.1 only, ensure to add _netdev to the
4th field that usually says defaults e.g.
/dev/sda1 /ocfs-data ocfs _netdev 0 0
Note: _netdev instructs mount to exclude these volumes
on first pass mount i.e. only mount after all network services are started.
/dev/sda2 /ocfs-index ocfs _netdev 0 0
The following is a sample filesystem layout (using the same mount
point on every node in the cluster):
Note: the example above assumes a two node RAC database,
where each instance writes archive to its
To avoid performance and defragmentation issues, the archive destinations
for each instance are written to separate partitions.
OCFS writes [printk()] debug and error messages into
the system log (/var/log/messages). dmesg reports any OCFS errors. For
issues suspected to be OCFS-related, a few things can be checked quickly;
OCFS filesystem check utility (/sbin/fsck.ocfs) is available from 1.0.9 and
can be run against unmounted volumes.
- Type 'dmesg' too see if there are messages starting
- Run 'ps aux' to identify any hanging processes
i.e. in a D (uninterruptible) state. If the state does not change, running
'/usr/bin/strace -p' will trace any process activity
- Ensure to provide this information to Oracle
Support Services (OSS) when logging a Service Request (SR).
12. How to Obtain OCFS for Linux
The only two OCFS supported Linux distributions are Red Hat Linux
Advanced Server 2.1 and United Linux 1.0.
At time of writing, the latest Production version of OCFS is 1.0.9.
OCFS for Linux is available for download from:
OCFS is available for the following distribution/kernel types:
Red Hat Advanced Server 2.1 with kernel 2.4.9-e.12 or higher
kernel 2.4.9-e.12 or higher
Packages for United Linux 1.0 with Service pack 2A or higher
OCFS 1.0.9 kernel module RPM for United linux
<Note:224586.1> Oracle Cluster File System (OCFS) on Red Hat AS - FAQ.
<Note:240575.1> RAC on Linux Best Practices.
<Note:184821.1> Step-By-Step Installation of RAC on Linux.
<Note:241114.1> Step-By-Step Installation of RAC on Linux - Single
Node (Oracle9i 9.2.0 with OCFS).