New Slot Map Format
JoelBecker, December 2007
Introduction
Work is underway to use ocfs with userspace cluster stacks. However, all userspace cluster stacks support node numbers greater than 32567, the maximum node number the current slot map can handle. This design document specifies a new slot map format that will support larger node numbers. At the same time, it will remove other limitations of the current design.
Current Limitations
There are three main limitations to the current slot map design.
A slot entry cannot specify a node number larger than 32567. Now, o2nm doesn't even allow a node number greater than 254, but the slot map format allows up to INT16_MAX-1. The userspace cluster stacks can have node numbers up to UINT32_MAX.
- A slot entry is marked invalid via a magic value. If the entry is set to -1 (int16 size, so 0xFFFF), it is considered empty or invalid. This has lead to all sorts of confusion when reading or setting the value. The validity of an entry should be separate from the contents.
- The slot map is limited to 254 entries. This is not a big limitation - we may never exceed it due to other reasons.
New Design Boundaries
The slot map will support node numbers up to UINT32_MAX.
- A slot map entry will have a separate field to mark it valid or invalid.
- The slot map will not be arbitrarily bounded. That is, it will grow with the number of slots on the filesystem.
The New Slot Map Format
#define OCFS2_FEATURE_INCOMPAT_EXTENDED_SLOT_MAP 0x100 struct ocfs2_extended_slot { __u8 es_valid; __u8 es_reserved1[3]; __le32 es_node_num; }; struct ocfs2_slot_map_extended { struct ocfs2_extended_slot se_slots[0]; };
The new slot map format is in use if super->s_feature_incompat contains OCFS2_FEATURE_INCOMPAT_EXTENDED_SLOT_MAP. If the feature bit is not set, the original slot map format is in use. The slot map is still contained in the "slot_map" system file.
In the old format, the size of the slot_map file was always exactly one cluster. The new slot_map file's size is the size of the total allocation required to hold super->s_max_slots * sizeof(struct ocfs2_extended_slot). The file is merely an array of these extended slots - all array positions are formatted, even though we only use super->s_max_slots values. Both schemes intentionally over commit slot map size to make adding slots easier. i_size is set to the full allocation.
The Extended Slot Entry
An extended slot entry contains a field for the node number supporting numbers up to UINT32_MAX. This field is treated as unsigned. It also contains an 8-bit field for validity. If the es_valid field is nonzero, the entry is valid and the es_node_num field contains valid data. If the es_valid field is zero, the entire entry can be considered empty.
There are three bytes of reserved space. This allows for extension of the entry, rather than another wholesale rewrite of the slot map.
Filesystem Changes
The filesystem should be able to read and write both the old and the new format. This can be readily accomplished by isolating the code that reads and writes the map. This is accomplished in a few steps.
Move all slot map structure access to slot_map.c. Everything else must use accessors.
- Change the recovery code to use node numbers, not a bitmap.
Read and write the map based on s_max_slots, not the hard-coded "one block" it used to be. This I/O will work with the old format as well.
Convert the filesystem's cached idea of the map to one without magic, basically emulating struct ocfs2_extended_slot.
- Define the old format properly, reading it and converting it to the in-memory map of nodes.
- Add the new format, reading it and converting it to the in-memory map.
With these changes, reading the slot map populates an in-memory map that isn't tied to the on-disk format. All access from the rest of the filesystem references this in-memory map. At write time, the in-memory map is converted to the appropriate format.
These changes are available on the new-slot-map branch of Joel's linux-2.6 Git tree.
Tools Changes
The ocfs2 toolset needs to be able to create and read this new format. mkfs.ocfs2(8) needs to create filesystems using the new format. tunefs.ocfs2(8) should switch between formats. And any tool that examines the map needs to read the new format.
These changes are available on the new-slot-map branch of the ocfs2-tools Git tree.