[Ocfs2-tools-devel] =?yes?q?=5BPATCH=2027/28=5D=20ocfs2=3A=20Add=20manpage?=

Sunil Mushran sunil.mushran at oracle.com
Fri Aug 19 15:16:24 PDT 2011


The ocfs2 user's guide has been converted to a manpage.

Signed-off-by: Sunil Mushran <sunil.mushran at oracle.com>
---
 configure.in                              |    1 +
 debian/ocfs2-tools.manpages               |    1 +
 libocfs2/Makefile                         |    4 +-
 libocfs2/ocfs2.7.in                       | 1446 +++++++++++++++++++++++++++++
 vendor/common/ocfs2-tools.spec-generic.in |    1 +
 5 files changed, 1452 insertions(+), 1 deletions(-)
 create mode 100644 libocfs2/ocfs2.7.in

diff --git a/configure.in b/configure.in
index 7079d2d..7a007a3 100644
--- a/configure.in
+++ b/configure.in
@@ -445,6 +445,7 @@ libo2cb/o2cb.7
 o2monitor/o2hbmonitor.8
 o2cb_ctl/ocfs2.cluster.conf.5
 vendor/common/o2cb.sysconfig.5
+libocfs2/ocfs2.7
 vendor/common/ocfs2-tools.spec-generic
 ])
 
diff --git a/debian/ocfs2-tools.manpages b/debian/ocfs2-tools.manpages
index cbd1567..6444737 100644
--- a/debian/ocfs2-tools.manpages
+++ b/debian/ocfs2-tools.manpages
@@ -12,3 +12,4 @@ debian/tmp/usr/share/man/man1/o2info.1
 debian/tmp/usr/share/man/man8/o2hbmonitor.8
 debian/tmp/usr/share/man/man5/ocfs2.cluster.conf.5
 debian/tmp/usr/share/man/man5/o2cb.sysconfig.5
+debian/tmp/usr/share/man/man7/ocfs2.7
diff --git a/libocfs2/Makefile b/libocfs2/Makefile
index d360bbf..a4027c2 100644
--- a/libocfs2/Makefile
+++ b/libocfs2/Makefile
@@ -105,7 +105,9 @@ libocfs2.a: $(OBJS)
 	$(AR) r $@ $^
 	$(RANLIB) $@
 
-DIST_FILES = $(CFILES) $(HFILES) ocfs2_err.et
+MANS = ocfs2.7
+
+DIST_FILES = $(CFILES) $(HFILES) ocfs2_err.et ocfs2.7.in
 
 CLEAN_RULES = clean-err
 
diff --git a/libocfs2/ocfs2.7.in b/libocfs2/ocfs2.7.in
new file mode 100644
index 0000000..5f399ae
--- /dev/null
+++ b/libocfs2/ocfs2.7.in
@@ -0,0 +1,1446 @@
+.TH "OCFS2" "7" "August 2011" "Version @VERSION@" "OCFS2 Manual Pages"
+.SH "NAME"
+OCFS2 \- A Cluster file system for Linux
+
+.SH "INTRODUCTION"
+.PP
+\fBOCFS2\fR is a \fBfile system\fR. It allows users to store and retrieve data. The data
+is stored in files that are organized in a hierarchical directory tree. It is a \fBPOSIX compliant\fR
+file system that supports the standard interfaces and the behavioral semantics as spelled out
+by that specification.
+
+It is also a \fBshared disk cluster\fR file system, one that allows multiple nodes to access the
+same disk at the same time. This is where the fun begins as allowing a file system to be
+accessible on multiple nodes opens a can of worms. What if the nodes are of different
+architectures? What if a node dies while writing to the file system? What data consistency
+can one expect if processes on two nodes are reading and writing concurrently? What if
+one node removes a file while it is still being used on another node?
+
+Unlike most shared file systems where the answer is fuzzy, the answer in OCFS2 is very
+well defined. It behaves on all nodes exactly like a \fBlocal\fR file system. If a file is
+removed, the directory entry is removed but the inode is kept as long as it is in use across
+the cluster. When the last user closes the descriptor, the inode is marked for deletion.
+
+The data consistency model follows the same principle. It works as if the two processes
+that are running on two different nodes are running on the same node. A read on a node
+gets the last write irrespective of the IO mode used. The modes can be \fIbuffered\fR, \fIdirect\fR,
+\fIasynchronous\fR, \fIsplice\fR or \fImemory mapped\fR IOs. It is fully \fBcache coherent\fR.
+
+Take for example the REFLINK feature that allows a user to create multiple write-able
+snapshots of a file. This feature, like all others, is fully cluster-aware. A file
+being written to on multiple nodes can be safely reflinked on another. The snapshot
+created is a point-in-time image of the file that includes both the file data and all its
+attributes (including extended attributes).
+
+It is a \fBjournaling\fR file system. When a node dies, a surviving node transparently replays
+the journal of the dead node. This ensures that the file system metadata is always
+consistent. It also defaults to ordered data journaling to ensure the file data is flushed
+to disk before the journal commit, to remove the small possibility of stale data appearing
+in files after a crash.
+
+It is \fBarchitecture\fR and \fBendian neutral\fR. It allows concurrent mounts on nodes with
+different processors like x86, x86_64, IA64 and PPC64. It handles little and big endian,
+32-bit and 64-bit architectures.
+
+It is \fBfeature rich\fR. It supports \fIindexed directories\fR, \fImetadata checksums\fR,
+\fIextended attributes\fR, \fIPOSIX ACLs\fR, \fIquotas\fR, \fIREFLINKs\fR, \fIsparse files\fR,
+\fIunwritten extents\fR and \fIinline-data\fR.
+
+It is \fBfully integrated\fR with the mainline Linux kernel. The file system was merged
+into Linux kernel 2.6.16 in early 2006.
+
+It is \fBquickly installed\fR. It is available with almost all Linux distributions.
+The file system is \fBon-disk compatible\fR across all of them.
+
+It is \fBmodular\fR. The file system can be configured to operate with other cluster
+stacks like \fIPacemaker\fR and \fICMAN\fR along with its own stack, \fIO2CB\fR.
+
+It is \fBeasily configured\fR. The O2CB cluster stack configuration involves editing two
+files, one for cluster layout and the other for cluster timeouts.
+
+It is \fBvery efficient\fR. The file system consumes very little resources. It is used
+to store virtual machine images in limited memory environments like Xen and KVM.
+
+In summary, OCFS2 is an efficient, easily configured, modular, quickly installed, fully
+integrated and compatible, feature-rich, architecture and endian neutral, cache coherent,
+ordered data journaling, POSIX-compliant, shared disk cluster file system.
+
+.SH "OVERVIEW"
+.PP
+OCFS2 is a general-purpose shared-disk cluster file system for Linux capable of providing
+both high performance and high availability.
+
+As it provides local file system semantics, it can be used with almost all applications.
+Cluster-aware applications can make use of cache-coherent parallel I/Os from multiple nodes
+to scale out applications easily. Other applications can make use of the clustering
+facilities to fail-over running application in the event of a node failure.
+
+The notable features of the file system are:
+.TP
+\fBTunable Block size\fR
+The file system supports block sizes of 512, 1K, 2K and 4K bytes. 4KB is almost always
+recommended. This feature is available in all releases of the file system.
+
+.TP
+\fBTunable Cluster size\fR
+A cluster size is also referred to as an allocation unit. The file system supports
+cluster sizes of 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K and 1M bytes. For most use
+cases, 4KB is recommended. However, a larger value is recommended for volumes hosting
+mostly very large files like database files, virtual machine images, etc. A large
+cluster size allows the file system to store large files more efficiently. This feature
+is available in all releases of the file system.
+
+.TP
+\fBEndian and Architecture neutral\fR
+The file system can be mounted concurrently on nodes having different architectures.
+Like 32-bit, 64-bit, little-endian (x86, x86_64, ia64) and big-endian (ppc64, s390x).
+This feature is available in all releases of the file system.
+
+.TP
+\fBBuffered, Direct, Asynchronous, Splice and Memory Mapped I/O modes\fR
+The file system supports all modes of I/O for maximum flexibility and performance.
+It also supports cluster-wide \fBshared writeable mmap(2)\fR. The support for bufferred,
+direct and asynchronous I/O is available in all releases. The support for splice I/O
+was added in Linux kernel \fB2.6.20\fR and for shared writeable map(2) in \fB2.6.23\fR.
+
+.TP
+\fBMultiple Cluster Stacks\fR
+The file system includes a flexible framework to allow it to function with userspace
+cluster stacks like Pacemaker (\fBpcmk\fR) and CMAN (\fBcman\fR), its own in-kernel
+cluster stack \fBo2cb\fR and \fIno\fR cluster stack.
+
+The support for \fBo2cb\fR cluster stack is available in all releases.
+
+The support for \fIno\fR cluster stack, or \fBlocal\fR mount, was added in Linux
+kernel \fB2.6.20\fR.
+
+The support for userspace cluster stack was added in Linux kernel \fB2.6.26\fR.
+
+.TP
+\fBJournaling\fR
+The file system supports both \fBordered\fR (default) and \fBwriteback\fR data journaling
+modes to provide file system consistency in the event of power failure or system crash.
+It uses \fBJBD2\fR in Linux kernel \fB2.6.28\fR and later. It used \fBJBD\fR in earlier
+kernels.
+
+.TP
+\fBExtent-based Allocations\fR
+The file system allocates and tracks space in ranges of clusters. This is unlike block
+based file systems that have to track each and every block. This feature allows the
+file system to be very efficient when dealing with both large volumes and large files.
+This feature is available in all releases of the file system.
+
+.TP
+\fBSparse files\fR
+Sparse files are files with holes. With this feature, the file system delays allocating
+space until a write is issued to a cluster. This feature was added in Linux kernel \fB2.6.22\fR
+and requires enabling on-disk feature \fBsparse\fR.
+
+.TP
+\fBUnwritten Extents\fR
+An unwritten extent is also referred to as user pre-allocation. It allows an application
+to request a range of clusters to be allocated, but not initialized, within a file.
+Pre-allocation allows the file system to optimize the data layout with fewer, larger
+extents. It also provides a performance boost, delaying initialization until the user
+writes to the clusters. This feature was added in Linux kernel \fB2.6.23\fR and requires
+enabling on-disk feature \fBunwritten\fR.
+
+.TP
+\fBHole Punching\fR
+Hole punching allows an application to remove arbitrary allocated regions within a
+file. Creating holes, essentially. This is more efficient than zeroing the same extents.
+This feature is especially useful in virtualized environments as it allows a block discard
+in a guest file system to be converted to a hole punch in the host file system thus
+allowing users to reduce disk space usage. This feature was added in Linux kernel \fB2.6.23\fR
+and requires enabling on-disk features \fBsparse\fR and \fBunwritten\fR.
+
+.TP
+\fBInline-data\fR
+Inline data is also referred to as data-in-inode as it allows storing small files
+and directories in the inode block. This not only saves space but also has a positive
+impact on cold-cache directory and file operations. The data is transparently moved
+out to an extent when it no longer fits inside the inode block. This feature was added
+in Linux kernel \fB2.6.24\fR and requires enabling on-disk feature \fBinline-data\fR.
+
+.TP
+\fBREFLINK\fR
+REFLINK is also referred to as fast copy. It allows users to atomically (and instantly)
+copy regular files. In other words, create multiple writeable snapshots of regular files.
+It is called REFLINK because it looks and feels more like a (hard) \fBlink(2)\fR than a
+traditional snapshot. Like a link, it is a regular user operation, subject to the security
+attributes of the inode being reflinked and not to the super user privileges typically
+required to create a snapshot. Like a link, it operates within a file system. But unlike
+a link, it links the inodes at the data extent level allowing each reflinked inode to grow
+independently as and when written to. Up to four billion inodes can share a data extent.
+This feature was added in Linux kernel \fB2.6.32\fR and requires enabling on-disk feature
+\fBrefcount\fR.
+
+.TP
+\fBAllocation Reservation\fR
+File contiguity plays an important role in file system performance. When a file is
+fragmented on disk, reading and writing to the file involves many seeks, leading to
+lower throughput. Contiguous files, on the other hand, minimize seeks, allowing the
+disks to perform IO at the maximum rate.
+
+With allocation reservation, the file system reserves a window in the bitmap for all
+extending files allowing each to grow as contiguously as possible. As this extra space
+is not actually allocated, it is available for use by other files if the need arises.
+This feature was added in Linux kernel \fB2.6.35\fR and can be tuned using the mount
+option \fBresv_level\fR.
+
+.TP
+\fBIndexed Directories\fR
+An indexed directory allows users to perform quick lookups of a file in very large
+directories. It also results in faster creates and unlinks and thus provides better
+overall performance. This feature was added in Linux kernel \fB2.6.30\fR and requires
+enabling on-disk feature \fBindexed-dirs\fR.
+
+.TP
+\fBFile Attributes\fR
+This refers to EXT2-style file attributes, such as immutable, modified using
+\fBchattr(1)\fR and queried using \fBlsattr(1)\fR. This feature was added in Linux
+kernel \fB2.6.19\fR.
+
+.TP
+\fBExtended Attributes\fR
+An extended attribute refers to a name:value pair than can be associated with file
+system objects like regular files, directories, symbolic links, etc. \fIOCFS2\fR allows
+associating an \fIunlimited\fR number of attributes per object. The attribute names can be
+up to 255 bytes in length, terminated by the first NUL character. While it is not
+required, printable names (ASCII) are recommended. The attribute values can be up
+to 64 KB of arbitrary binary data. These attributes can be modified and listed using
+standard Linux utilities \fBsetfattr(1)\fR and \fBgetfattr(1)\fR. This feature was
+added in Linux kernel \fB2.6.29\fR and requires enabling on-disk feature \fBxattr\fR.
+
+.TP
+\fBMetadata Checksums\fR
+This feature allows the file system to detect silent corruptions in all metadata blocks
+like inodes and directories. This feature was added in Linux kernel \fB2.6.29\fR and
+requires enabling on-disk feature \fBmetaecc\fR.
+
+.TP
+\fBPOSIX ACLs and Security Attributes\fR
+POSIX ACLs allows assigning fine-grained discretionary access rights for files and
+directories. This security scheme is a lot more flexible than the traditional file
+access permissions that imposes a strict user-group-other model.
+
+Security attributes allow the file system to support other security regimes like SELinux,
+SMACK, AppArmor, etc.
+
+Both these security extensions were added in Linux kernel \fB2.6.29\fR and requires
+enabling on-disk feature \fBxattr\fR.
+
+.TP
+\fBUser and Group Quotas\fR
+This feature allows setting up usage quotas on user and group basis by using the
+standard utilities like \fBquota(1)\fR, \fBsetquota(8)\fR, \fBquotacheck(8)\fR, and
+\fBquotaon(8)\fR. This feature was added in Linux kernel \fB2.6.29\fR and requires
+enabling on-disk features \fBusrquota\fR and \fBgrpquota\fR.
+
+.TP
+\fBUnix File Locking\fR
+The Unix operating system has historically provided two system calls to lock files.
+\fBflock(2)\fR or BSD locking and \fBfcntl(2)\fR or POSIX locking. \fIOCFS2\fR
+extends both file locks to the cluster. File locks taken on one node interact with those
+taken on other nodes.
+
+The support for clustered \fBflock(2)\fR was added in Linux kernel \fB2.6.26\fR.
+All \fBflock(2)\fR options are supported, including the kernels ability to cancel
+a lock request when an appropriate kill signal is received by the user. This feature
+is supported with all cluster-stacks including \fBo2cb\fR.
+
+The support for clustered \fBfcntl(2)\fR was added in Linux kernel \fB2.6.28\fR.
+But because it requires group communication to make the locks coherent, it is only
+supported with userspace cluster stacks, \fBpcmk\fR and \fBcman\fR and \fInot\fR
+with the default cluster stack \fBo2cb\fR.
+
+.TP
+\fBComprehensive Tools Support\fR
+The file system has a comprehensive EXT3-style toolset that tries to use similar
+parameters for ease-of-use. It includes mkfs.ocfs2(8) (format), tunefs.ocfs2(8)
+(tune), fsck.ocfs2(8) (check), debugfs.ocfs2(8) (debug), etc.
+
+.TP
+\fBOnline Resize\fR
+The file system can be dynamically grown using \fBtunefs.ocfs2(8)\fR. This feature
+was added in Linux kernel \fB2.6.25\fR.
+
+.SH "RECENT CHANGES"
+.PP
+The O2CB cluster stack has a \fBglobal heartbeat\fR mode. It allows users to specify
+heartbeat regions that are consistent across all nodes. The cluster stack also allows
+online addition and removal of both nodes and heartbeat regions.
+
+\fBo2cb(8)\fR is the new cluster configuration utility. It is an easy to use utility
+that allows users to create the cluster configuration on a node that is not part of the
+cluster. It replaces the older utility \fBo2cb_ctl(8)\fR which has being deprecated.
+
+\fBocfs2console(8)\fR has been obsoleted.
+
+\fBo2info(8)\fR is a new utility that can be used to provide file system information.
+It allows non-priviledged users to see the enabled file system features, block and
+cluster sizes, extended file stat, free space fragmentation, etc.
+
+\fBo2hbmonitor(8)\fR is a \fBo2hb\fR heartbeat monitor. It is an extremely light weight
+utility that logs messages to the system logger once the heartbeat delay exceeds the
+warn threshold. This utility is useful in identifying volumes encountering I/O delays.
+
+\fBdebugfs.ocfs2(8)\fR has some new commands. \fInet_stats\fR shows the \fBo2net\fR
+message times between various nodes. This is useful in indentifying nodes are that slowing
+down the cluster operations. \fIstat_sysdir\fR allows the user to dump the entire system
+directory that can be used to debug issues. \fIgrpextents\fR dumps the complete free space
+fragmentation in the cluster group allocator.
+
+\fBmkfs.ocfs2(8)\fR now enables \fIxattr\fB, \fIindexed-dirs\fR, \fIdiscontig-bg\fR,
+\fIrefcount\fR, \fIextended-slotmap\fR and \fIclusterinfo\fR feature flags by default,
+in addition to the older defaults, \fIsparse\fR, \fIunwritten\fR and \fIinline-data\fR.
+
+\fBmount.ocfs2(8)\fR allows users to specify the level of cache coherency between nodes.
+By default the file system operates in full coherency mode that also serializes the
+direct I/Os. While this mode is technically correct, it limits the I/O thruput in a
+clustered database. This mount option allows the user to limit the cache coherency
+to only the buffered I/Os to allow multiple nodes to do concurrent direct writes to
+the same file. This feature works with Linux kernel \fB2.6.37\fR and later.
+
+.SH "COMPATIBILITY"
+.PP
+The OCFS2 development teams goes to great lengths to maintain compatibility. It attempts
+to maintain both on-disk and network protocol compatibility across all releases of the
+file system. It does so even while adding new features that entail on-disk format and
+network protocol changes. To do this successfully, it follows a few rules:
+
+.in +4n
+\fB1\fR. The on-disk format changes are managed by a set of feature flags that can be
+turned on and off. The file system in kernel detects these features during mount and
+continues only if it understands all the features. Users encountering this have the
+option of either disabling that feature or upgrading the file system to a newer release.
+
+\fB2\fR. The latest release of ocfs2-tools is compatible with all versions of the file
+system. All utilities detect the features enabled on disk and continue only if it
+understands all the features. Users encountering this have to upgrade the tools to
+a newer release.
+
+\fB3\fR. The network protocol version is negotiated by the nodes to ensure all nodes
+understand the active protocol version.
+.in
+
+.TP
+\fBFEATURE FLAGS\fR
+The feature flags are split into three categories, namely, \fBCompat\fR, \fBIncompat\fR
+and \fBRO Compat\fR.
+
+\fBCompat\fR, or compatible, is a feature that the file system does not need to fully
+understand to safely read/write to the volume. An example of this is the backup-super
+feature that added the capability to backup the super block in multiple locations in the
+file system. As the backup super blocks are typically not read nor written to by the file
+system, an older file system can safely mount a volume with this feature enabled.
+
+\fBIncompat\fR, or incompatible, is a feature that the file system needs to fully
+understand to read/write to the volume. Most features fall under this category.
+
+\fBRO Compat\fR, or read-only compatible, is a feature that the file system needs to
+fully understand to write to the volume. Older software can safely read a volume with
+this feature enabled. An example of this would be user and group quotas. As quotas are
+manipulated only when the file system is written to, older software can safely mount
+such volumes in read-only mode.
+
+The list of feature flags, the version of the kernel it was added in, the earliest
+version of the tools that understands it, etc., is as follows:
+
+.TS
+CENTER ALLBOX;
+LB LB LB LB LB
+LI C C C C.
+Feature Flags	Kernel Version	Tools Version	Category	Hex Value
+backup-super	All	ocfs2-tools 1.2	Compat	1
+strict-journal-super	All	All	Compat	2
+local	Linux 2.6.20	ocfs2-tools 1.2	Incompat	8
+sparse	Linux 2.6.22	ocfs2-tools 1.4	Incompat	10
+inline-data	Linux 2.6.24	ocfs2-tools 1.4	Incompat	40
+extended-slotmap	Linux 2.6.27	ocfs2-tools 1.6	Incompat	100
+xattr	Linux 2.6.29	ocfs2-tools 1.6	Incompat	200
+indexed-dirs	Linux 2.6.30	ocfs2-tools 1.6	Incompat	400
+metaecc	Linux 2.6.29	ocfs2-tools 1.6	Incompat	800
+refcount	Linux 2.6.32	ocfs2-tools 1.6	Incompat	1000
+discontig-bg	Linux 2.6.35	ocfs2-tools 1.6	Incompat	2000
+clusterinfo	Linux 2.6.37	ocfs2-tools 1.8	Incompat	4000
+unwritten	Linux 2.6.23	ocfs2-tools 1.4	RO Compat	1
+grpquota	Linux 2.6.29	ocfs2-tools 1.6	RO Compat	2
+usrquota	Linux 2.6.29	ocfs2-tools 1.6	RO Compat	4
+.TE
+.BR
+
+To query the features enabled on a volume, do:
+.in +4n
+.nf
+.sp
+$ \fBo2info --fs-features /dev/sdf1\fR
+backup-super strict-journal-super sparse extended-slotmap inline-data xattr
+indexed-dirs refcount discontig-bg clusterinfo unwritten
+.fi
+.in
+
+.TP
+\fBENABLING AND DISABLING FEATURES\fR
+
+The format utility, \fBmkfs.ocfs2(8)\fR, allows a user to enable and disable specific
+features using the fs-features option. The features are provided as a comma separated
+list. The enabled features are listed as is. The disabled features are prefixed with
+\fBno\fR.  The example below shows the file system being formatted with sparse disabled
+and inline-data enabled.
+
+.in +4n
+.nf
+# \fBmkfs.ocfs2 --fs-features=nosparse,inline-data /dev/sda1\fR
+.fi
+.in
+
+After formatting, the users can toggle features using the tune utility, \fBtunefs.ocfs2(8)\fR.
+This is an \fIoffline\fR operation. The volume needs to be umounted across the cluster.
+The example below shows the sparse feature being enabled and inline-data disabled.
+
+.in +4n
+.nf
+# \fBtunefs.ocfs2 --fs-features=sparse,noinline-data /dev/sda1\fR
+.fi
+.in
+
+Care should be taken before enabling and disabling features. Users planning to use a
+volume with an older version of the file system will be better of not enabling newer
+features as turning disabling may not succeed.
+
+An example would be disabling the sparse feature; this requires filling every hole.
+The operation can only succeed if the file system has enough free space.
+
+.TP
+\fBDETECTING FEATURE INCOMPATIBILITY\fR
+
+Say one tries to mount a volume with an incompatible feature. What happens then? How
+does one detect the problem? How does one know the name of that incompatible feature?
+
+To begin with, one should look for error messages in \fBdmesg(8)\fR. Mount failures that
+are due to an incompatible feature will always result in an error message like the following:
+
+.in +4n
+.nf
+\fBERROR: couldn't mount because of unsupported optional features (200).\fR
+.fi
+.in
+
+Here the file system is unable to mount the volume due to an unsupported optional
+feature. That means that that feature is an \fBIncompat\fR feature. By referring to the
+table above, one can then deduce that the user failed to mount a volume with the \fBxattr\fR
+feature enabled. (The value in the error message is in hexadecimal.)
+
+Another example of an error message due to incompatibility is as follows:
+
+.in +4n
+.nf
+\fBERROR: couldn't mount RDWR because of unsupported optional features (1).\fR
+.fi
+.in
+
+Here the file system is unable to mount the volume in the RW mode. That means that
+that feature is a \fBRO Compat\fR feature. Another look at the table and it becomes
+apparent that the volume had the \fBunwritten\fR feature enabled.
+
+In both cases, the user has the option of disabling the feature. In the second case,
+the user has the choice of mounting the volume in the RO mode.
+
+.SH "GETTING STARTED"
+.PP
+The OCFS2 software is split into two components, namely, kernel and tools. The kernel
+component includes the core file system and the cluster stack, and is packaged along
+with the kernel. The tools component is packaged as \fBocfs2-tools\fR and needs to
+be specifically installed. It provides utilities to format, tune, mount, debug and
+check the file system.
+
+To install \fBocfs2-tools\fR, refer to the package handling utility in in your distributions.
+
+The next step is selecting a cluster stack. The options include:
+
+.in +4n
+\fBA\fR. No cluster stack, or \fBlocal mount\fR.
+
+\fBB\fR. In-kernel \fBo2cb\fR cluster stack with \fBlocal\fR or \fBglobal\fR heartbeat.
+
+\fBC\fR. Userspace cluster stacks \fBpcmk\fR or \fBcman\fR.
+.in
+
+The file system allows changing cluster stacks easily using \fBtunefs.ocfs2(8)\fR.
+To list the cluster stacks stamped on the OCFS2 volumes, do:
+
+.in +4n
+.nf
+# \fBmounted.ocfs2 -d\fR
+Device     Stack  Cluster     F  UUID                              Label
+/dev/sdb1  o2cb   webcluster  G  DCDA2845177F4D59A0F2DCD8DE507CC3  hbvol1
+/dev/sdc1  None                  23878C320CF3478095D1318CB5C99EED  localmount
+/dev/sdd1  o2cb   webcluster  G  8AB016CD59FC4327A2CDAB69F08518E3  webvol
+/dev/sdg1  o2cb   webcluster  G  77D95EF51C0149D2823674FCC162CF8B  logsvol
+/dev/sdh1  o2cb   webcluster  G  BBA1DBD0F73F449384CE75197D9B7098  scratch
+.fi
+.in
+
+.TP
+\fBA. NON-CLUSTERED OR LOCAL MOUNT\fR
+
+To format a \fIOCFS2\fR volume as a non-clustered (\fBlocal\fR) volume, do:
+
+.in +4n
+.nf
+$ \fBmkfs.ocfs2 -L "mylabel" --fs-features=local /dev/sda1\fR
+.fi
+.in
+
+To convert an existing clustered volume to a non-clustered volume, do:
+
+.in +4n
+.nf
+$ \fBtunefs.ocfs2 --fs-features=local /dev/sda1\fR
+.fi
+.in
+
+Non-clustered volumes do not interact with the cluster stack. One can have both
+clustered and non-clustered volumes mounted at the same time.
+
+While formating a non-clustered volume, users should consider the possibility of later
+converting that volume to a clustered one. If there is a possibility of that, then the
+user should add enough node-slots using the -N option. Adding node-slots during format
+creates journals with large extents. If created later, then the journals will be
+fragmented which is not good for performance.
+
+.TP
+\fBB. CLUSTERED MOUNT WITH O2CB CLUSTER STACK\fR
+
+Only one of the two heartbeat mode can be active at any one time. Changing heartbeat
+modes is an offline operation.
+
+Both heartbeat modes require /etc/ocfs2/cluster.conf and /etc/sysconfig/o2cb to
+be populated as described in \fBocfs2.cluster.conf(5)\fR and \fBo2cb.sysconfig(5)\fR
+respectively. The only difference in set up between the two modes is that \fBglobal\fR
+requires heartbeat devices to be configured whereas \fBlocal\fR does not.
+
+Refer \fBo2cb(7)\fR for more information.
+
+.RS
+.TP
+\fBLOCAL HEARTBEAT\fR
+This is the default heartbeat mode. The user needs to populate the configuration files
+as described in \fBocfs2.cluster.conf(5)\fR and \fBo2cb.sysconfig(5)\fR. In this mode,
+the cluster stack heartbeats on all mounted volumes. Thus, one does not have to specify
+heartbeat devices in cluster.conf.
+
+Once configured, the \fBo2cb\fR cluster stack can be onlined and offlined as follows:
+
+.in +4n
+.nf
+# \fBservice o2cb online\fR
+Setting cluster stack "o2cb": OK
+Registering O2CB cluster "webcluster": OK
+Setting O2CB cluster timeouts : OK
+
+# \fBservice o2cb offline\fR
+Clean userdlm domains: OK
+Stopping O2CB cluster webcluster: OK
+Unregistering O2CB cluster "webcluster": OK
+.fi
+.in
+
+.TP
+\fBGLOBAL HEARTBEAT\fR
+The configuration is similar to \fBlocal\fR heartbeat. The one additional step in
+this mode is that it requires heartbeat devices to be also configured.
+
+These heartbeat devices are OCFS2 formatted volumes with global heartbeat enabled
+on disk. These volumes can later be mounted and used as clustered file systems.
+
+The steps to format a volume with global heartbeat enabled is listed in \fBo2cb(7)\fR.
+Also listed there is listing all volumes with the cluster stack stamped on disk.
+
+In this mode, the heartbeat is started when the cluster is onlined and stopped when
+the cluster is offlined.
+
+.in +4n
+.nf
+# \fBservice o2cb online\fR
+Setting cluster stack "o2cb": OK
+Registering O2CB cluster "webcluster": OK
+Setting O2CB cluster timeouts : OK
+Starting global heartbeat for cluster "webcluster": OK
+
+# \fBservice o2cb offline\fR
+Clean userdlm domains: OK
+Stopping global heartbeat on cluster "webcluster": OK
+Stopping O2CB cluster webcluster: OK
+Unregistering O2CB cluster "webcluster": OK
+.fi
+.in
+
+.in +4n
+.nf
+# \fBservice o2cb status\fR
+Driver for "configfs": Loaded
+Filesystem "configfs": Mounted
+Stack glue driver: Loaded
+Stack plugin "o2cb": Loaded
+Driver for "ocfs2_dlmfs": Loaded
+Filesystem "ocfs2_dlmfs": Mounted
+Checking O2CB cluster "webcluster": Online
+  Heartbeat dead threshold: 31
+  Network idle timeout: 30000
+  Network keepalive delay: 2000
+  Network reconnect delay: 2000
+  Heartbeat mode: Global
+Checking O2CB heartbeat: Active
+  77D95EF51C0149D2823674FCC162CF8B /dev/sdg1
+Nodes in O2CB cluster: 92 96
+.fi
+.in
+
+.RE
+
+.TP
+\fBC. CLUSTERED MOUNT WITH USERSPACE CLUSTER STACK\fR
+
+Configure and online the userspace stack \fBpcmk\fR or \fBcman\fR before using
+\fBtunefs.ocfs2(8)\fR to update the cluster stack on disk.
+
+.in +4n
+.nf
+# \fBtunefs.ocfs2 --update-cluster-stack /dev/sdd1\fR
+Updating on-disk cluster information to match the running cluster.
+DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS
+FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION.
+Update the on-disk cluster information? y
+.fi
+.in
+
+Refer to the cluster stack documentation for information on starting and stopping
+the cluster stack.
+
+.SH "FILE SYSTEM UTILITIES"
+.PP
+This sections lists the utilities that are used to manage the \fIOCFS2\fR file systems.
+This includes tools to format, tune, check, mount, debug the file system. Each utility
+has a man page that lists its capabilities in detail.
+
+.TP
+\fBmkfs.ocfs2(8)\fR
+This is the file system \fIformat\fR utility. All volumes have to be formatted prior to
+its use.  As this utility overwrites the volume, use it with care. Double check to ensure
+the volume is not in use on any node in the cluster.
+
+As a precaution, the utility will abort if the volume is locally mounted. It also
+detects use across the cluster if used by OCFS2. But these checks are not comprehensive
+and can be overridden. So use it with care.
+
+While it is not always required, the cluster should be online.
+
+.TP
+\fBtunefs.ocfs2(8)\fR
+This is the file system \fItune\fR utility. It allows users to change certain on-disk
+parameters like label, uuid, number of node-slots, volume size and the size of the
+journals. It also allows turning on and off the file system features as listed above.
+
+This utility requires the cluster to be online.
+
+.TP
+\fBfsck.ocfs2(8)\fR
+This is the file system \fIcheck\fR utility. It detects and fixes on-disk errors. All the
+check codes and their fixes are listed in \fBfsck.ocfs2.checks(8)\fR.
+
+This utility requires the cluster to be online to ensure the volume is not in use on
+another node and to prevent the volume from being mounted for the duration of the check.
+
+.TP
+\fBmount.ocfs2(8)\fR
+This is the file system \fImount\fR utility. It is invoked indirectly by the \fBmount(8)\fR
+utility.
+
+This utility detects the cluster status and aborts if the cluster is offline or does
+not match the cluster stamped on disk.
+
+.TP
+\fBo2info(1)\fR
+This is the file system \fIinformation\fR utility. It provides information like the features
+enabled on disk, block size, cluster size, free space fragmentation, etc.
+
+It can be used by both priviledged and non-priviledged users. Users having read permission
+on the device can provide the path to the device. Other users can provide the path to a
+file on a mounted file system.
+
+.TP
+\fBdebugfs.ocfs2(8)\fR
+This is the file system \fIdebug\fR utility. It allows users to examine all file system
+structures including walking directory structures, displaying inodes, backing up files,
+etc., without mounting the file system.
+
+This utility requires the user to have read permission on the device.
+
+.TP
+\fBo2image(8)\fR
+This is the file system \fIimage\fR utility. It allows users to copy the file system metadata
+skeleton, including the inodes, directories, bitmaps, etc. As it excludes data, it
+shrinks the size of the file system tremendously.
+
+The image file created can be used in debugging on-disk corruptions.
+
+.TP
+\fBmounted.ocfs2(8)\fR
+This is the file system \fIdetect\fR utility. It detects all \fIOCFS2\fR volumes in the
+system and lists its label, uuid and cluster stack. 
+
+.SH "O2CB CLUSTER STACK UTILITIES"
+.PP
+This sections lists the utilities that are used to manage \fIO2CB\fR cluster stack.
+Each utility has a man page that lists its capabilities in detail.
+.TP
+\fBo2cb(8)\fR
+This is the cluster \fIconfiguration\fR utility. It allows users to update the cluster
+configuration by adding and removing nodes and heartbeat regions. This utility is used
+by the \fIo2cb\fR init script to online and offline the cluster.
+
+This is a \fBnew\fR utility and replaces \fBo2cb_ctl(8)\fR which has been deprecated.
+
+.TP
+\fBocfs2_hb_ctl(8)\fR
+This is the cluster heartbeat utility. It allows users to start and stop \fBlocal\fR
+heartbeat. This utility is invoked by \fBmount.ocfs2(8)\fR and should not be invoked
+directly by the user.
+
+.TP
+\fBo2hbmonitor(8)\fR
+This is the disk heartbeat monitor. It tracks the elapsed time since the last heartbeat
+and logs warnings once that time exceeds the warn threshold.
+
+.SH "FILE SYSTEM NOTES"
+.PP
+This section includes some useful notes that may prove helpful to the user.
+.TP
+\fBBALANCED CLUSTER\fR
+A cluster is a computer. This is a fact and not a slogan. What this means is that an errant
+node in the cluster can affect the behavior of other nodes. If one node is slow, the cluster
+operations will slow down on all nodes. To prevent that, it is best to have a balanced
+cluster. This is a cluster that has equally powered and loaded nodes.
+
+The standard recommendation for such clusters is to have identical hardware and
+software across all the nodes. However, that is not a hard and fast rule. After all,
+we have taken the effort to ensure that OCFS2 works in a mixed architecture environment.
+
+If one uses OCFS2 in a mixed architecture environment, try to ensure that the nodes are
+equally powered and loaded. The use of a load balancer can assist with the latter. Power
+refers to the number of processors, speed, amount of memory, I/O throughput, network
+bandwidth, etc. In reality, having equally powered heterogeneous nodes is not always
+practical. In that case, make the lower node numbers more powerful than the higher
+node numbers. The O2CB cluster stack favors lower node numbers in all of its tiebreaking logic.
+
+This is not to suggest you should add a single core node in a cluster of quad cores. No
+amount of node number juggling will help you there.
+
+.TP
+\fBFILE DELETION\fR
+In Linux, rm(1) removes the directory entry. It does not necessarily delete the corresponding
+inode. By removing the directory entry, it gives the illusion that the inode has been deleted.
+This puzzles users when they do not see a corresponding up-tick in the reported free space.
+The reason is that inode deletion has a few more hurdles to cross.
+
+First is the hard link count. This indicates the number of directory entries pointing to that
+inode. As long as a directory entry is linked to that inode, it cannot be deleted. The file
+system has to wait for that count to drop to zero.
+
+The second hurdle is the POSIX semantics allowing files to be unlinked even while they are
+in use. In OCFS2, that translates to in use across the cluster. The file system has to wait
+for all processes across the cluster to stop using the inode.
+
+Once these two conditions are met, the inode is deleted and the freed bits are flushed to
+disk on the next sync.
+
+This assumes that the inode was not reflinked. If it was, then the deletion would only
+release space that was private to the inode. Shared space would only be released when
+the last inode using it is deleted.
+
+Users interested in following the trail can use debugfs.ocfs2(8) to view the node specific
+system files orphan_dir and truncate_log. Once the link count is zero, an inode is moved
+to the orphan_dir. After deletion, the freed bits are added to the truncate_log, where
+they remain until the next sync, during which the bits are flushed to the global bitmap.
+
+.TP
+\fBDIRECTORY LISTING\fR
+ls(1) may be a simple command, but it is not cheap. What is expensive is not the part
+where it reads the directory listing, but the second part where it reads all the inodes, also
+referred as an inode stat(2). If the inodes are not in cache, this can entail disk I/O.
+Now, while a cold cache inode stat(2) is expensive in all file systems, it is especially so in
+a clustered file system. It needs to take a lock on each node, pure overhead when
+compared to a local file system.
+
+A hot cache stat(2), on the other hand, has shown to perform on OCFS2 like it does on
+EXT3.
+
+In other words, the second ls(1) will be quicker than the first. However, it is not
+guaranteed. Say you have a million files in a file system and not enough kernel memory
+to cache all the inodes. In that case, each ls(1) will involve some cold cache stat(2)s.
+
+.TP
+\fBALLOCATION RESERVATION\fR
+Allocation reservation allows multiple concurrently extending files to grow as contiguously
+as possible. One way to demonstrate its functioning is to run a script that extends
+multiple files in a circular order. The script below does that by writing one hundred
+4KB chunks to four files, one after another.
+
+.in +4n
+.nf
+$ for i in $(seq 0 99);
+> do
+>   for j in $(seq 4);
+>   do
+>     dd if=/dev/zero of=file$j bs=4K count=1 seek=$i;
+>   done;
+> done;
+.fi
+.in
+
+When run on a system running Linux kernel 2.6.34 or earlier, we end up with files with
+100 extents each. That is full fragmentation. As the files are being extended one after
+another, the on-disk allocations are fully interleaved.
+
+.in +4n
+.nf
+$ \fBfilefrag file1 file2 file3 file4\fR
+file1: 100 extents found
+file2: 100 extents found
+file3: 100 extents found
+file4: 100 extents found
+.fi
+.in
+
+When run on a system running Linux kernel 2.6.35 or later, we see files with 7 extents
+each. That is a lot fewer than before. Fewer extents mean more on-disk contiguity and
+that always leads to better overall performance.
+
+.in +4n
+.nf
+$ \fBfilefrag file1 file2 file3 file4\fR
+file1: 7 extents found
+file2: 7 extents found
+file3: 7 extents found
+file4: 7 extents found
+.fi
+.in
+
+.TP
+\fBREFLINK OPERATION\fR
+This feature allows a user to create a writeable snapshot of a regular file. In this
+operation, the file system creates a new inode with the same extent pointers as the
+original inode. Multiple inodes are thus able to share data extents. This adds a twist
+in file system administration because none of the existing file system utilities in
+Linux expect this behavior. du(1), a utility to used to compute file space usage,
+simply adds the blocks allocated to each inode. As it does not know about shared
+extents, it over estimates the space used.  Say, we have a 5GB file in a volume having
+42GB free.
+
+.in +4n
+.nf
+$ \fBls -l\fR
+total 5120000
+-rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:15 myfile
+
+$ \fBdu -m myfile*\fR
+5000    myfile
+
+$ \fBdf -h .\fR
+Filesystem            Size  Used Avail Use% Mounted on
+/dev/sdd1             50G   8.2G   42G  17% /ocfs2
+.fi
+.in
+
+If we were to reflink it 4 times, we would expect the directory listing to report five 5GB
+files, but the df(1) to report no loss of available space. du(1), on the other hand, would
+report the disk usage to climb to 25GB.
+
+.in +4n
+.nf
+$ \fBreflink myfile myfile-ref1\fR
+$ \fBreflink myfile myfile-ref2\fR
+$ \fBreflink myfile myfile-ref3\fR
+$ \fBreflink myfile myfile-ref4\fR
+
+$ \fBls -l\fR
+total 25600000
+-rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:15 myfile
+-rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref1
+-rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref2
+-rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref3
+-rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref4
+
+$ \fBdf -h .\fR
+Filesystem            Size  Used Avail Use% Mounted on
+/dev/sdd1             50G   8.2G   42G  17% /ocfs2
+
+$ \fBdu -m myfile*\fR
+5000    myfile
+5000    myfile-ref1
+5000    myfile-ref2
+5000    myfile-ref3
+5000    myfile-ref4
+25000 total
+.fi
+.in
+
+Enter \fBshared-du(1)\fR, a shared extent-aware du. This utility reports the shared
+extents per file in parenthesis and the overall footprint. As expected, it lists the
+overall footprint at 5GB.
+
+.in +4n
+.nf
+$ \fBshared-du -m -c --shared-size myfile*\fR
+5000    (5000)  myfile
+5000    (5000)  myfile-ref1
+5000    (5000)  myfile-ref2
+5000    (5000)  myfile-ref3
+5000    (5000)  myfile-ref4
+25000 total
+5000 footprint
+.fi
+.in
+
+This utility is available at this link (http://oss.oracle.com/~smushran/reflink-tools/). Also
+available is a shared extent-aware filefrag utlity that lists the location of the extents on
+the volume.
+
+We are currently in the process of pushing the changes to the upstream maintainers of
+these utilities.
+
+.in +4n
+.nf
+# \fBshared-filefrag -v myfile\fR
+Filesystem type is: 7461636f
+File size of myfile is 5242880000 (1280000 blocks, blocksize 4096)
+ext logical physical expected length flags
+0         0  2247937            8448
+1      8448  2257921  2256384  30720
+2     39168  2290177  2288640  30720
+3     69888  2322433  2320896  30720
+4    100608  2354689  2353152  30720
+7    192768  2451457  2449920  30720
+ . . .
+37  1073408  2032129  2030592  30720 shared
+38  1104128  2064385  2062848  30720 shared
+39  1134848  2096641  2095104  30720 shared
+40  1165568  2128897  2127360  30720 shared
+41  1196288  2161153  2159616  30720 shared
+42  1227008  2193409  2191872  30720 shared
+43  1257728  2225665  2224128  22272 shared,eof
+myfile: 44 extents found
+.fi
+.in
+
+.TP
+\fBDATA COHERENCY\fR
+One of the challenges in a shared file system is data coherency when multiple nodes are
+writing to the same set of files. NFS, for example, provides close-to-open data coherency
+that results in the data being flushed to the server when the file is closed on the client.
+This leaves open a wide window for stale data being read on another node.
+
+A simple test to check the data coherency of a shared file system involves concurrently
+appending the same file. Like running "uname -a >>/dir/file" using a parallel distributed
+shell like dsh or pconsole. If coherent, the file will contain the results from all nodes.
+
+.in +4n
+.nf
+# \fBdsh -R ssh -w node32,node33,node34,node35 "uname -a >> /ocfs2/test"\fR
+# \fBcat /ocfs2/test\fR
+Linux node32 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
+Linux node35 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
+Linux node33 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
+Linux node34 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
+.fi
+.in
+
+OCFS2 is a \fBfully cache coherent\fR cluster file system.
+
+.TP
+\fBDISCONTIGUOUS BLOCK GROUP\fR
+Most file systems pre-allocate space for inodes during format. OCFS2 dynamically
+allocates this space when required.
+
+However, this dynamic allocation has been problematic when the free space is very
+fragmented, because the file system required the inode and extent allocators to
+grow in contiguous fixed-size chunks.
+
+The discontiguous block group feature takes care of this problem by allowing the
+allocators to grow in smaller, variable-sized chunks.
+
+This feature was added in Linux kernel \fB2.6.35\fR and requires enabling on-disk
+feature \fBdiscontig-bg\fR.
+
+.TP
+\fBBACKUP SUPER BLOCKS\fR
+A file system super block stores critical information that is hard to recreate.
+In OCFS2, it stores the block size, cluster size, and the locations of the root and
+system directories, among other things. As this block is close to the start of the
+disk, it is very susceptible to being overwritten by an errant write.
+Say, dd if=file of=/dev/sda1.
+
+Backup super blocks are copies of the super block. These blocks are dispersed in the
+volume to minimize the chances of being overwritten. On the small chance that the
+original gets corrupted, the backups are available to scan and fix the corruption.
+
+\fBmkfs.ocfs2(8)\fR enables this feature by default. Users can disable this by
+specifying \fB--fs-features=nobackup-super\fR during format.
+
+\fBo2info(1)\fR can be used to view whether the feature has been enabled on a device.
+
+.in +4n
+.nf
+# \fBo2info --fs-features /dev/sdb1\fR
+backup-super strict-journal-super sparse extended-slotmap inline-data xattr
+indexed-dirs refcount discontig-bg clusterinfo unwritten
+.fi
+.in
+
+In OCFS2, the super block is on the third block. The backups are located at the \fB1G,
+4G, 16G, 64G, 256G and 1T\fB byte offsets. The actual number of backup blocks depends
+on the size of the device. The super block is not backed up on devices smaller than 1GB.
+
+\fBfsck.ocfs2(8)\fR refers to these six offsets by numbers, 1 to 6. Users can specify
+any backup with the -r option to recover the volume. The example below uses the second
+backup. If successful, \fBfsck.ocfs2(8)\fR overwrites the corrupted super block with
+the backup.
+
+.in +4n
+.nf
+# \fBfsck.ocfs2 -f -r 2 /dev/sdb1\fR
+fsck.ocfs2 1.8.0
+[RECOVER_BACKUP_SUPERBLOCK] Recover superblock information from backup block#1048576? <n> y
+Checking OCFS2 filesystem in /dev/sdb1:
+  Label:              webhome
+  UUID:               B3E021A2A12B4D0EB08E9E986CDC7947
+  Number of blocks:   13107196
+  Block size:         4096
+  Number of clusters: 13107196
+  Cluster size:       4096
+  Number of slots:    8
+
+/dev/sdb1 was run with -f, check forced.
+Pass 0a: Checking cluster allocation chains
+Pass 0b: Checking inode allocation chains
+Pass 0c: Checking extent block allocation chains
+Pass 1: Checking inodes and blocks.
+Pass 2: Checking directory entries.
+Pass 3: Checking directory connectivity.
+Pass 4a: checking for orphaned inodes
+Pass 4b: Checking inodes link counts.
+All passes succeeded.
+.fi
+.in
+
+.TP
+\fBSYNTHETIC FILE SYSTEMS\fR
+The OCFS2 development effort included two synthetic file systems, configfs and dlmfs. It
+also makes use of a third, debugfs.
+
+.RS
+.TP
+\fBconfigfs\fR
+configfs has since been accepted as a generic kernel component and is also used by
+netconsole and fs/dlm. OCFS2 tools use it to communicate the list of nodes in the
+cluster, details of the heartbeat device, cluster timeouts, and so on to the in-kernel
+node manager. The o2cb init script mounts this file system at /sys/kernel/config.
+
+.TP
+\fBdlmfs\fR
+dlmfs exposes the in-kernel o2dlm to the user-space. While it was developed
+primarily for OCFS2 tools, it has seen usage by others looking to add a cluster
+locking dimension in their applications. Users interested in doing the same should
+look at the libo2dlm library provided by ocfs2-tools. The o2cb init script mounts this
+file system at /dlm.
+
+.TP
+\fBdebugfs\fR
+OCFS2 uses debugfs to expose its in-kernel information to user space. For example,
+listing the file system cluster locks, dlm locks, dlm state, o2net state, etc. Users can
+access the information by mounting the file system at /sys/kernel/debug. To automount,
+add the following to /etc/fstab:
+debugfs /sys/kernel/debug debugfs defaults 0 0
+.RE
+
+.TP
+\fBDISTRIBUTED LOCK MANAGER\fR
+One of the key technologies in a cluster is the lock manager, which maintains the locking
+state of all resources across the cluster. An easy implementation of a lock manager
+involves designating one node to handle everything. In this model, if a node wanted to
+acquire a lock, it would send the request to the lock manager. However, this model has a
+weakness: lock manager’s death causes the cluster to seize up.
+
+A better model is one where all nodes manage a subset of the lock resources. Each node
+maintains enough information for all the lock resources it is interested in. On event
+of a node death, the remaining nodes pool in the information to reconstruct the lock
+state maintained by the dead node. In this scheme, the locking overhead is distributed
+amongst all the nodes. Hence, the term distributed lock manager.
+
+O2DLM is a distributed lock manager. It is based on the specification titled "Programming
+Locking Application" written by Kristin Thomas and is available at the following link.
+http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf
+
+.TP
+\fBDLM DEBUGGING\fR
+O2DLM has a rich debugging infrastructure that allows it to show the state of the lock
+manager, all the lock resources, among other things.
+The figure below shows the dlm state of a nine-node cluster that has just lost three
+nodes: 12, 32, and 35. It can be ascertained that node 7, the recovery master, is
+currently recovering node 12 and has received the lock states of the dead node from all
+other live nodes.
+
+.in +4n
+.nf
+# \fBcat /sys/kernel/debug/o2dlm/45F81E3B6F2B48CCAAD1AE7945AB2001/dlm_state\fR
+Domain: 45F81E3B6F2B48CCAAD1AE7945AB2001  Key: 0x10748e61
+Thread Pid: 24542  Node: 7  State: JOINED
+Number of Joins: 1  Joining Node: 255
+Domain Map: 7 31 33 34 40 50
+Live Map: 7 31 33 34 40 50
+Lock Resources: 48850 (439879)
+MLEs: 0 (1428625)
+  Blocking: 0 (1066000)
+  Mastery: 0 (362625)
+  Migration: 0 (0)
+Lists: Dirty=Empty  Purge=Empty  PendingASTs=Empty  PendingBASTs=Empty
+Purge Count: 0  Refs: 1
+Dead Node: 12
+Recovery Pid: 24543  Master: 7  State: ACTIVE
+Recovery Map: 12 32 35
+Recovery Node State:
+        7 - DONE
+        31 - DONE
+        33 - DONE
+        34 - DONE
+        40 - DONE
+        50 - DONE
+.fi
+.in
+
+The figure below shows the state of a dlm lock resource that is mastered (owned) by
+node 25, with 6 locks in the granted queue and node 26 holding the EX (writelock) lock
+on that resource.
+
+.in +4n
+.nf
+# \fBdebugfs.ocfs2 -R "dlm_locks M000000000000000022d63c00000000" /dev/sda1\fR
+Lockres: M000000000000000022d63c00000000   Owner: 25    State: 0x0
+Last Used: 0      ASTs Reserved: 0    Inflight: 0    Migration Pending: No
+Refs: 8    Locks: 6    On Lists: None
+Reference Map: 26 27 28 94 95
+ Lock-Queue  Node  Level  Conv  Cookie           Refs  AST  BAST  Pending-Action
+ Granted     94    NL     -1    94:3169409       2     No   No    None
+ Granted     28    NL     -1    28:3213591       2     No   No    None
+ Granted     27    NL     -1    27:3216832       2     No   No    None
+ Granted     95    NL     -1    95:3178429       2     No   No    None
+ Granted     25    NL     -1    25:3513994       2     No   No    None
+ Granted     26    EX     -1    26:3512906       2     No   No    None
+.fi
+.in
+
+The figure below shows a lock from the file system perspective. Specifically, it shows a
+lock that is in the process of being upconverted from a NL to EX. Locks in this state are
+are referred to in the file system as busy locks and can be listed using the debugfs.ocfs2
+command, "fs_locks -B".
+
+.in +4n
+.nf
+# \fBdebugfs.ocfs2 -R "fs_locks -B" /dev/sda1\fR
+Lockres: M000000000000000000000b9aba12ec  Mode: No Lock
+Flags: Initialized Attached Busy
+RO Holders: 0  EX Holders: 0
+Pending Action: Convert  Pending Unlock Action: None
+Requested Mode: Exclusive  Blocking Mode: No Lock
+PR > Gets: 0  Fails: 0    Waits Total: 0us  Max: 0us  Avg: 0ns
+EX > Gets: 1  Fails: 0    Waits Total: 544us  Max: 544us  Avg: 544185ns
+Disk Refreshes: 1
+.fi
+.in
+
+With this debugging infrastructure in place, users can debug hang issues as follows:
+
+.in +4n
+* Dump the busy fs locks for all the OCFS2 volumes on the node with hanging
+processes. If no locks are found, then the problem is not related to O2DLM.
+
+* Dump the corresponding dlm lock for all the busy fs locks. Note down the
+owner (master) of all the locks.
+
+* Dump the dlm locks on the master node for each lock.
+.in
+
+At this stage, one should note that the hanging node is waiting to get an AST from the
+master. The master, on the other hand, cannot send the AST until the current holder has
+down converted that lock, which it will do upon receiving a Blocking AST. However, a
+node can only down convert if all the lock holders have stopped using that lock.
+After dumping the dlm lock on the master node, identify the current lock holder and
+dump both the dlm and fs locks on that node.
+
+The trick here is to see whether the Blocking AST message has been relayed to file
+system. If not, the problem is in the dlm layer. If it has, then the most common reason
+would be a lock holder, the count for which is maintained in the fs lock.
+
+At this stage, printing the list of process helps.
+
+.in +4n
+.nf
+$ \fBps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN\fR
+.fi
+.in
+
+Make a note of all D state processes. At least one of them is responsible for the hang on
+the first node.
+
+The challenge then is to figure out why those processes are hanging. Failing that, at
+least get enough information (like alt-sysrq t output) for the kernel developers to review.
+What to do next depends on where the process is hanging. If it is waiting for the I/O to
+complete, the problem could be anywhere in the I/O subsystem, from the block device
+layer through the drivers to the disk array. If the hang concerns a user lock (flock(2)),
+the problem could be in the user’s application. A possible solution could be to kill the
+holder. If the hang is due to tight or fragmented memory, free up some memory by
+killing non-essential processes.
+
+The thing to note is that the symptom for the problem was on one node but the cause is
+on another. The issue can only be resolved on the node holding the lock. Sometimes, the
+best solution will be to reset that node. Once killed, the O2DLM recovery process will
+clear all locks owned by the dead node and let the cluster continue to operate. As harsh
+as that sounds, at times it is the only solution. The good news is that, by following the
+trail, you now have enough information to file a bug and get the real issue resolved.
+
+.TP
+\fBNFS EXPORTING\fR
+OCFS2 volumes can be exported as NFS volumes. This support is limited to NFS version
+3, which translates to Linux kernel version 2.4 or later.
+
+If the version of the Linux kernel on the system exporting the volume is older than
+\fB2.6.30\fR, then the NFS clients must mount the volumes using the \fInordirplus\fR
+mount option. This disables the READDIRPLUS RPC call to workaround a bug in NFSD,
+detailed in the following link:
+
+.in +4n
+.nf
+http://oss.oracle.com/pipermail/ocfs2-announce/2008-June/000025.html
+.fi
+.in
+
+Users running NFS version 2 can export the volume after having disabled subtree checking
+(mount option no_subtree_check). Be warned, disabling the check has security implications
+(documented in the exports(5) man page) that users must evaluate on their own.
+
+.TP
+\fBFILE SYSTEM LIMITS\fR
+OCFS2 has no intrinsic limit on the total number of files and directories in the file
+system. In general, it is only limited by the size of the device. But there is one limit
+imposed by the current filesystem. It can address at most four billion clusters. A file
+system with 1MB cluster size can go up to 4PB, while a file system with a 4KB cluster size
+can address up to 16TB.
+
+.TP
+\fBSYSTEM OBJECTS\fR
+The OCFS2 file system stores its internal meta-data, including bitmaps, journals, etc., as
+system files. These are grouped in a system directory. These files and directories are not
+accessible via the file system interface but can be viewed using the \fBdebugfs.ocfs2(8)\fR
+tool.
+
+To list the system directory (referred to as double-slash), do:
+
+.in +4n
+.nf
+# \fBdebugfs.ocfs2 -R "ls -l //" /dev/sde1\fR
+        66     drwxr-xr-x  10  0  0         3896 19-Jul-2011 13:36 .
+        66     drwxr-xr-x  10  0  0         3896 19-Jul-2011 13:36 ..
+        67     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 bad_blocks
+        68     -rw-r--r--   1  0  0      1179648 19-Jul-2011 13:36 global_inode_alloc
+        69     -rw-r--r--   1  0  0         4096 19-Jul-2011 14:35 slot_map
+        70     -rw-r--r--   1  0  0      1048576 19-Jul-2011 13:36 heartbeat
+        71     -rw-r--r--   1  0  0  53686960128 19-Jul-2011 13:36 global_bitmap
+        72     drwxr-xr-x   2  0  0         3896 25-Jul-2011 15:05 orphan_dir:0000
+        73     drwxr-xr-x   2  0  0         3896 19-Jul-2011 13:36 orphan_dir:0001
+        74     -rw-r--r--   1  0  0      8388608 19-Jul-2011 13:36 extent_alloc:0000
+        75     -rw-r--r--   1  0  0      8388608 19-Jul-2011 13:36 extent_alloc:0001
+        76     -rw-r--r--   1  0  0    121634816 19-Jul-2011 13:36 inode_alloc:0000
+        77     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 inode_alloc:0001
+        77     -rw-r--r--   1  0  0    268435456 19-Jul-2011 13:36 journal:0000
+        79     -rw-r--r--   1  0  0    268435456 19-Jul-2011 13:37 journal:0001
+        80     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 local_alloc:0000
+        81     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 local_alloc:0001
+        82     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 truncate_log:0000
+        83     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 truncate_log:0001
+.fi
+.in
+
+The file names that end with numbers are slot specific and are referred to as node-local
+system files. The set of node-local files used by a node can be determined from the slot
+map. To list the slot map, do:
+
+.in +4n
+.nf
+# \fBdebugfs.ocfs2 -R "slotmap" /dev/sde1\fR
+    Slot#    Node#
+        0       32
+        1       35
+        2       40
+        3       31
+        4       34
+        5       33
+.fi
+.in
+
+For more information, refer to the OCFS2 support guides available in the Documentation
+section at http://oss.oracle.com/projects/ocfs2.
+
+.TP
+\fBHEARTBEAT, QUORUM, AND FENCING\fR
+Heartbeat is an essential component in any cluster. It is charged with accurately
+designating nodes as dead or alive. A mistake here could lead to a cluster hang or a
+corruption.
+
+\fIo2hb\fR is the disk heartbeat component of \fBo2cb\fR. It periodically updates a
+timestamp on disk, indicating to others that this node is alive. It also reads all the
+timestamps to identify other live nodes. Other cluster components, like \fIo2dlm\fR
+and \fIo2net\fR, use the \fIo2hb\fR service to get node up and down events.
+
+The quorum is the group of nodes in a cluster that is allowed to operate on the shared
+storage. When there is a failure in the cluster, nodes may be split into groups that can
+communicate in their groups and with the shared storage but not between groups.
+\fIo2quo\fR determines which group is allowed to continue and initiates fencing of
+the other group(s).
+
+Fencing is the act of forcefully removing a node from a cluster. A node with OCFS2
+mounted will fence itself when it realizes that it does not have quorum in a degraded
+cluster. It does this so that other nodes won’t be stuck trying to access its resources.
+
+\fBo2cb\fR uses a machine reset to fence. This is the quickest route for the node to
+rejoin the cluster.
+
+.TP
+\fBPROCESSES\fR
+
+.RS
+.TP
+\fB[o2net]\fR
+One per node. It is a work-queue thread started when the cluster is brought on-line
+and stopped when it is off-lined. It handles network communication for all mounts.
+It gets the list of active nodes from O2HB and sets up a TCP/IP communication
+channel with each live node. It sends regular keep-alive packets to detect any
+interruption on the channels.
+
+.TP
+\fB[user_dlm]\fR
+One per node. It is a work-queue thread started when dlmfs is loaded and stopped
+when it is unloaded (dlmfs is a synthetic file system that allows user space
+processes to access the in-kernel dlm).
+
+.TP
+\fB[ocfs2_wq]\fR
+One per node. It is a work-queue thread started when the OCFS2 module is loaded
+and stopped when it is unloaded. It is assigned background file system tasks that
+may take cluster locks like flushing the truncate log, orphan directory recovery and
+local alloc recovery. For example, orphan directory recovery runs in the background
+so that it does not affect recovery time.
+
+.TP
+\fB[o2hb-14C29A7392]\fR
+One per heartbeat device. It is a kernel thread started when the heartbeat region is
+populated in configfs and stopped when it is removed. It writes every two seconds
+to a block in the heartbeat region, indicating that this node is alive. It also reads the
+region to maintain a map of live nodes. It notifies subscribers like o2net and o2dlm
+of any changes in the live node map.
+
+.TP
+\fB[ocfs2dc]\fR
+One per mount. It is a kernel thread started when a volume is mounted and stopped
+when it is unmounted. It downgrades locks in response to blocking ASTs (BASTs)
+requested by other nodes.
+
+.TP
+\fB[jbd2/sdf1-97]\fR
+One per mount. It is part of JBD2, which OCFS2 uses for journaling.
+
+.TP
+\fB[ocfs2cmt]\fR
+One per mount. It is a kernel thread started when a volume is mounted and stopped
+when it is unmounted. It works with kjournald2.
+
+.TP
+\fB[ocfs2rec]\fR
+It is started whenever a node has to be recovered. This thread performs file system
+recovery by replaying the journal of the dead node. It is scheduled to run after dlm
+recovery has completed.
+
+.TP
+\fB[dlm_thread]\fR
+One per dlm domain. It is a kernel thread started when a dlm domain is created and
+stopped when it is destroyed. This thread sends ASTs and blocking ASTs in response
+to lock level convert requests. It also frees unused lock resources.
+
+.TP
+\fB[dlm_reco_thread]\fR
+One per dlm domain. It is a kernel thread that handles dlm recovery when another
+node dies. If this node is the dlm recovery master, it re-masters every lock resource
+owned by the dead node.
+
+.TP
+\fB[dlm_wq]\fR
+One per dlm domain. It is a work-queue thread that o2dlm uses to queue blocking
+tasks.
+.RE
+
+.TP
+\fBFUTURE WORK\fR
+File system development is a never ending cycle. Faster and larger disks, faster
+and more number of processors, larger caches, etc. keep changing the sweet spot for
+performance forcing developers to rethink long held beliefs. Add to that new use cases,
+which forces developers to be innovative in providing solutions that melds seamlessly
+with existing semantics.
+
+We are currently looking to add features like transparent compression, transparent
+encryption, delayed allocation, multi-device support, etc. as well as work on improving
+performance on newer generation machines.
+
+If you are interested in contributing, email the development team at ocfs2-devel at oss.oracle.com.
+
+.SH "ACKNOWLEDGEMENTS"
+.PP
+The principal developers of the OCFS2 file system, its tools and the O2CB cluster stack,
+are \fBJoel Becker\fR, \fBZach Brown\fR, \fBMark Fasheh\fR, \fBJan Kara\fR, \fBKurt Hackel\fR,
+\fBTao Ma\fR, \fBSunil Mushran\fR, \fBTiger Yang\fR and \fBTristan Ye\fR.
+
+Other developers who have contributed to the file system via bug fixes, testing, etc.
+are \fBWim Coekaerts\fR, \fBSrinivas Eeda\fR, \fBColy Li\fR, \fBJeff Mahoney\fR,
+\fBMarcos Matsunaga\fR, \fBGoldwyn Rodrigues\fR, \fBManish Singh\fR and \fBWengang Wang\fR.
+
+The members of the Linux Cluster community including \fBAndrew Beekhof\fR,
+\fBLars Marowsky-Bree\fR, \fBFabio Massimo Di Nitto\fR and \fBDavid Teigland\fR.
+
+The members of the Linux File system community including \fBChristoph Hellwig\fR and
+\fBChris Mason\fR.
+
+The corporations that have contributed resources for this project including \fBOracle\fR,
+\fBSUSE Labs\fR, \fBEMC\fR, \fBEmulex\fR, \fBHP\fR, \fBIBM\fR, \fBIntel\fR and
+\fBNetwork Appliance\fR.
+
+.SH "SEE ALSO"
+.BR debugfs.ocfs2(8)
+.BR fsck.ocfs2(8)
+.BR fsck.ocfs2.checks(8)
+.BR mkfs.ocfs2(8)
+.BR mount.ocfs2(8)
+.BR mounted.ocfs2(8)
+.BR o2image(8)
+.BR o2info(1)
+.BR o2cb(7)
+.BR o2cb(8)
+.BR o2cb.sysconfig(5)
+.BR o2hbmonitor(8)
+.BR ocfs2.cluster.conf(5)
+.BR tunefs.ocfs2(8)
+
+.SH "AUTHOR"
+Oracle Corporation
+
+.SH "COPYRIGHT"
+Copyright \(co 2004, 2011 Oracle. All rights reserved.
diff --git a/vendor/common/ocfs2-tools.spec-generic.in b/vendor/common/ocfs2-tools.spec-generic.in
index 9117e2d..37e39bb 100644
--- a/vendor/common/ocfs2-tools.spec-generic.in
+++ b/vendor/common/ocfs2-tools.spec-generic.in
@@ -148,6 +148,7 @@ fi
 /usr/share/man/man8/o2hbmonitor.8.gz
 /usr/share/man/man5/ocfs2.cluster.conf.5.gz
 /usr/share/man/man5/o2cb.sysconfig.5.gz
+/usr/share/man/man7/ocfs2.7.gz
 
 
 %if %{build_ocfs2console}
-- 
1.7.4.1




More information about the Ocfs2-tools-devel mailing list