OCFS2 ONLINE ADD SLOTS SUPPORT
Tao Ma, August 11 2007
GOALS
The immediate goal is to allow users to add slots for a volume while some nodes have already mounted it(online add slots). Nowadays, there is a growing demand that users want to add slots with the ocfs2 volume mounted. See http://oss.oracle.com/pipermail/ocfs2-users/2007-September/002063.html as an example. Now tunefs.ocfs2 already has the ability of adding slots offline, so this design doc will only focus on the online part.
USER INTERACTION
It is the same as the offline action and we will handle the difference internally. tunefs.ocfs2 will be the front-end tool. The "-N" argument will indicate the new slot number.
# tunefs.ocfs2 tunefs.ocfs2 1.4.1 usage: tunefs.ocfs2 [-N number-of-node-slots] device
NOTE: As Joel Becker has almost finished his refactoring of tunefs.ocfs2, and I have decided to add online support under his new infrastructure. So the real user interface may be ocfs2ne(if it doesn't replace tunefs.ocfs2 yet).
BACKGROUND
In OCFS2, slot number means the capacity of nodes which can mount the ocfs2 volume simultaneously. The system file for every node is stored under system file directory, and inodes for them are allocated from global_inode_alloc. So for adding a new slots, what we normally need to do is:
1. Allocating inodes from global_inode_alloc(If it is full, we need to allocate more spaces from global_bitmap).
2. Intialize the new system file(For normal file, initializing the inode, directory and for journal, allocating journal clusters for it),
3. Link the new system file to system directory.
4. Increase the slot number stored in super block.
SCHEME
We want to do as much as possible in user space and let the kernel do the rest. The good news is that online resize has been added into tunefs.ocfs2 already and Joel Becker has already built up an infrastructure for online operation, so we only need to focus on the detail of adding slots.
We will add slot one by one, so that if any mistake happens, we can recover from it quickly. And since the journal formatting process is time-consuming, we will ask the kernel to do the allocation process and do the format in user space. There are 3 more things that is also helpful for us.
1. In mkfs.ocfs2, we allocate inodes for 32 slots in global_inode_alloc, so if it is enough, we don't need to allocate from global_bitmap for inode space.
32 slots means we save 800K for such inodes. For 64, 1.6M. Maybe we should bump up the min. - Sunil
No matter what the minimum, we have to get expansion right. I think 32 is just fine. -- Joel
2. When ocfs2 is mounted, the inode of global_inode_alloc isn't loaded in the kernel, so that we can modify it safely in userspace.
3. slot_map: old slot_map format doesn't need us to change it. For the new format, the file is allocated 1 cluster at the very beginning, and with the minimum cluster size(4K), we can have 256 slot entries, so there is a great chance that we don't need to change it.
The new slot map also has enough space for 256 slots. So no need to add code to extend the slot map. Maybe just add a check code that ensures the new slot count will fit in the slot map. If not, bail out. - Sunil
In the case of the new slot map, the code should just verify that the passed-in count fits within the current slot map size. In slot_map.c, that's ocfs2_slot_info->si_slots_per_block * ocfs2_slot_info->si_blocks. -- Joel
The scheme for ocfs2ne is as follows(only cover adding slot online and increase slot one at a time):
1. Calculate whether the increased slot number will ask us to allocate clusters from global_bitmap for global_inode_alloc.
2. If yes, bad. Call system calls and let kernel allocate spaces for us(As for efficiency issue, we will ask the kernel to allocate enough spaces for all the added slots not just this one).
Why bad? So we have an ioctl here. Define it. My read is that it adds a group to the global_inode_alloc. Can we reuse OCFS2_IOC_GROUP_ADD? - Sunil
We can't reuse OCFS2_IOC_GROUP_ADD here. It is defined as adding a group we've prepped, not as allocating and prepping one. More on the ioctl(2) definitions below. -- Joel
3. Now we have spaces and go for inode allocation.
- 1) Allocate spaces from global_inode_alloc.
- 2) Initialize all the new system inodes. If inline-data is supported, initialize the directory also.
Maybe we should not rely on inline-data at all. Reason is that the orphan dirs grow. Meaning making it inline may save us some effort but not space. In that case, we can save the if-then-else by adding a cluster to the orphan_dir. -- Sunil
I'm interested in Mark's take on this. We do format orphan dirs as inline when available. It's not hard either way. -- Joel
- 3) Call system calls and let kernel allocate space for journal, directory(if non-inline-data) and slot_map if needed .
That's the second ioctl. Again define it. My read is it allocates space to an inode. Again, slot_map resize is not needed. -- Sunil
4. Initialized journal, directory, and the new slot_map.
5. Call system calls and let kernel do the dir entry link process for the new file.
Third ioctl. This could involve extending the sysdir -- Sunil
6. Increase the slot and go to step 1.
The scheme for ocfs2 kernel is as follows:
1. Check the parameter for inode allocation.
2. If the ocfs2ne has allocate for us already. Cool. Go to step 4.
3. Allocate enough spaces from global_bitmap to global_inode_alloc and return.
4. Check whether the system is inline-supported, if yes, go to step 6.
5. Allocate spaces for orphan dir and modify the inode accordingly.
6. Check whether we have really allocate spaces for journal. If yes, go to step 8.
7. Allocate enough spaces for journal and return.
8. link all the system file to root_dir, write the super block about the new slot(Now the disk work for a new added slot is finished).
9. check whether we have finished the last slot add, if yes, reinit out slot info.
I am unclear on the kernel bits. Maybe better if you talk about it from ioctls point of view. -- Sunil
If I understand this correctly, Tao is invisioning a single ioctl(2), let's call it OCFS2_ADD_SLOTS. In the scheme above, you call it, and the kernel basically does each step, aborting at each point. That is, you call it the first time, the kernel grows the global_alloc and exits. You call it a second time, the kernel allocates journal space and exits. You call it a third time and it links the inodes and bumps the slot count. I really don't like an unpredictable scheme (does different things at different times) like that.
The ioctl(2)s should be specific and single-tasked. I'm just not sure how we break them down. There's two ways to approach it. We can have generic ioctl(2)s that provide building blocks for multiple online operations, or we can have very specific ones that are defined for this particular operation, adding new ioctl(2)s for new online operations
In a building block approach, we would probably have an ioctl(2) OCFS2_GROW_INODE. So, if you need more global alloc space, you call OCFS2_GROW_INODE on the global alloc inode. Then, once you've initialized the journal inode, you call OCFS2_GROW_INODE on the journal inode, asking the kernel to allocate it. The same for the orphan dirs (if needed). Next, we'd have an OCFS2_LINK_INODE ioctl(2). The code would call this once per system file to link them all into the system directory. After they are all in the system directory, you call OCFS2_ADD_SLOT, which increases the slot count and synchronizes with all nodes.
The benefit of the building block approach is that we can reuse these ioctl(2) calls for other online operations in the future. The problem is that they are possibly ripe for abuse and/or error, because they aren't tied to a specific operation.
In a specific scheme, we would have the more specific OCFS2_GROW_GLOBAL_ALLOC. It only works on the global alloc inode. Then perhaps we have the OCFS2_ALLOCATE_SLOT ioctl(2). It is passed an array of all the inodes for the new slot, and it grows each as needed (right now orphan dir and journal, maybe some other type in the future too). Finally, we have OCFS2_LINK_SLOT. It is passed the array of inodes again, and it links them to the system dir before bumping the slot count in the superblock and synchronizing with the other nodes.
These ioctl(2)s cannot be reused for other operations, but they can have a lot more error checking. The OCFS2_ALLOCATE_SLOT and OCFS2_LINK_SLOT calls can verify they have the right set of inodes before continuing. The OCFS2_GROW_GLOBAL_ALLOC call may be able to take a shortcut, knowing the limited locking rules of the global alloc file.
These are the two ways I see doing this. Thoughts? -- Joel
OTHER IMPORTANT ISSUES
slot map
slot_map is used to map between slot number and node number and it will be initialized and loaded during mount process and it will be refreshed when we take super lock(during mount, umount and recovery). So there are at least 3 issues that need to think about.
1. slot_map extend and initialization. It should be like what we do with journal file. The kernel allocate new spaces if needed and the ocfs2ne will empty the new cluster. And as I have described above, there is a great chance we will not meet with this.
2. kernel slot info reinit(in the operation node). In order to simplify the process and make the kernel do less, we only reinit our slot info when the last slot is added successfully.
3. When and how to inform other mounted node about the max slot number change? Maybe a new type of lock for "slot_map" is needed here. The behaviour of this lock is:
- 1) When we init the "slot_map" during mount, we get a PR lock.
- 2) After we update the slot information in one slot(the one we do ocfs2ne), we raise to EX lock and back to PR.
- 3) In downconvert_worker of this type of lock, we update the slot info and query the request for get PR.
- 4) We get the PR again.
- 5) Release the Lock when we release the slot during umount.
Open Question: Maybe vote is better here?
I pointed out in email that we need to prevent mount/unmount from happening until all live nodes agree on the new slot count. Regarding vote, we won't be reintroducing the old vote code. If we do external comms, they may only live with the ocfs2_controld -- Joel
NEW ONLINE LOCK RESOURCE
There is a lock resource named "tunefs-online-resize-lock" added in online resize. It will prevent nodes to do online resize simultaneously. But I think it may be improved and renamed as "tunefs-online-lock" so that all the online reaction can be avoided to happen simultaneously.
We cannot rename the "tunefs-online-resize-lock"; there is already software in the wild that uses it. We have to be able to lock against that software. I think it's fair to allow only one online operation at a time, so I propose a master "tunefs-online-lock". Going forward, all online operations will take that lock. The online resize operation first takes that lock, then takes "tunefs-online-resize-lock", thus locking out older software as well. This is similar to our RESIZE_INPROG flag predating the TUNEFS_INPROG flag. -- Joel
FAILURE SCENARIOS
As kernel has JBD support, it make the kernel operation much safer than the user space. So we will focus more on how to avoid user space corruption.
- Fail during the process of system inode allocation. fsck.ocfs2 will find the unused inode and clear it for us.
- Fail when we format journal, directory and slot_map. For journal and directory, fsck.ocfs2 will find the problem and fix it for us. For slot_map, we need to empty the new cluster and then increase the i_size(we can do it safely in user space since the kernel doesn't read/write this field after mount).
- Fail after we increase one slot, it is OK, since the disk work for a new slot is finished.
Any kernel failures must be recoverable by the replay of the journal. I think this is what Tao is saying. Otherwise, yeah, we can leak at most a slot's worth of inodes+space. No big deal. Especially if tunefs.ocfs2(8) can pick up where it left off, which we can even add later. -- Joel