[Ocfs2-tools-devel] [PATCH] ocfs2_controld: Fix double-leave in complete_mount().

Sat Aug 16 02:54:50 PDT 2008

In commit f5032771bc41bf9ff31ed42f332d8ec8def39e55 (ocfs2_controld: Allow
multiple real mounts.) we stopped tracking mountpoints.  Instead, we
describe different applications as "services".  "tunefs.ocfs2" is one
service.  "fsck.ocfs2" is another.  The actual filesystem uses the
service "ocfs2".

Only one instance of a service is allowed, except for the filesystem.
You can, of course, have one device mounted at multiple mountpoints:

 # mount /dev/sdb1 /ocfs2
 # mount /dev/sdb1 /ocfs3

In the special case of the "ocfs2" service, ocfs2_controld will allow
more than one MOUNT call (send by o2cb_begin_group_join).
The additinaly mounters get EALREADY, which libo2cb knows to interpret
as success.  It goes like this:

mount.ocfs2 on /ocfs2 (first)		ocfs2_controld
-----------------------------		--------------
o2cb_begin_group_join(uuid)
					start_join(uuid)
					finish_join(uuid)
					dlmcontrol_register(uuid)
					notify_mount_client(0)
err = mount(dev, mntpnt)
o2cb_complete_group_join(uuid, err)
					complete_mount(uuid, err)

mount.ocfs2 on /ocfs3 (additional)	ocfs2_controld
----------------------------------	--------------
o2cb_begin_group_join(uuid)
					notify_client(EALREADY)
err = mount(dev, mntpnt)
o2cb_complete_group_join(uuid, err)
					complete_mount(uuid, err)

Here's the crux of the problem.  If that first mounter gets an error
from mount(2), the daemon's complete_mount() should leave the group.
There's no filesystem mounted.

However, if the *second* mounter gets an error from mount(2) (say, a
missing mountpoint), the daemon should not leave the group - that first
mount is still going!  That's the bug.  The daemon didn't know the
difference, and it would leave the group.  The first mount was left out
in the lurch.

The fix is to mark the additional mounts as such.  complete_mount()
notices the additional flag and does nothing beyond responding to
mount.ocfs2(8).

dead_mounter() had the same problem.  If an additional mounter died, it
was treated like a first mounter.  dead_mounter() now does what
complete_mount() does.  It cleans up the additional state and nothing
else.

While we're there, we've learned enough about our state to handle first
mounts that died before sending their status to the daemon.  We've
always known that a dead_mounter() during group join could just set
leave_on_join.  But if the mount program has already been notified,
there may be a mounted filesystem.  We pinned the filesystem as busy and
basically locked out all other operations.

But as it turns out, a fully operational group is a good state.  We can
clear the in-progress flag and allow new operations.  Additional mounts
can happen cleanly, and umounts as well.  If the filesystem never got
mounted, a cleanup with ocfs2_hb_ctl is safe.  It's up to the
administrator to check safety, but it's a predictable environment.

Signed-off-by: Joel Becker <joel.becker at oracle.com>
---
 ocfs2_controld/mount.c |   68 +++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/ocfs2_controld/mount.c b/ocfs2_controld/mount.c
index 3accd28..0bf9ed0 100644
--- a/ocfs2_controld/mount.c
+++ b/ocfs2_controld/mount.c
@@ -37,6 +37,7 @@
 struct service {
 	struct list_head	ms_list;
 	char			ms_service[PATH_MAX + 1];
+	int			ms_additional;	/* This is a second mount */
 };
 
 struct mountgroup {
@@ -287,6 +288,16 @@ static void add_service(struct mountgroup *mg, const char *device,
 				   mg->mg_uuid, service);
 			return;
 		}
+
+		/*
+		 * There can be more than one real mount.  However, if an
+		 * additional mount fails in sys_mount(2), we can't have
+		 * complete_mount() removing the service.  We only want that
+		 * to happen when it's the first mount.
+		 */
+		log_debug("Additional mount of %s starting for %s",
+			  mg->mg_uuid, service);
+		ms->ms_additional = 1;
 	} else {
 		ms = malloc(sizeof(struct service));
 		if (!ms) {
@@ -608,7 +619,7 @@ out:
 		send_message(fd, CM_STATUS, mg->mg_error, mg->mg_error_msg);
 		mg->mg_error = 0;
 
-		if (mg->mg_error == EALREADY)
+		if (rc == -EALREADY)
 			mg->mg_mount_notified = 1;
 		else {
 			log_error("mount: %s", mg->mg_error_msg);
@@ -711,14 +722,27 @@ int complete_mount(int ci, int fd, const char *uuid, const char *errcode,
 	 */
 	mg->mg_ms_in_progress = NULL;
 
+	if (ms->ms_additional) {
+		/*
+		 * Our additional real mount is done whether it succeeded
+		 * or failed.  We only have to clear the additional state
+		 * and reply OK.
+		 */
+		log_debug("Completed additional mount of filesystem %s, "
+			  "error is %ld",
+			  mg->mg_uuid, err);
+		ms->ms_additional = 0;
+		err = 0;
+	}
+
 	if (!err) {
 		mg->mg_mount_fd = -1;
 		mg->mg_mount_ci = -1;
 	} else {
 		/*
 		 * remove_service() will kick off a leave if this was
-		 * the last service.  As part of the leave, it will add
-		 * reset ms_in_progress.
+		 * the last service.  As part of the leave, it will set
+		 * ms_in_progress.
 		 */
 		remove_service(mg, service);
 
@@ -833,6 +857,8 @@ void dead_mounter(int ci, int fd)
 	if (!ms)
 		return;
 
+	log_error("Mounter for filesystem %s, service %s died", mg->mg_uuid,
+		  ms->ms_service);
 	mg->mg_mount_ci = -1;
 	mg->mg_mount_fd = -1;
 
@@ -845,12 +871,26 @@ void dead_mounter(int ci, int fd)
 		return;
 
 	/*
+	 * If this was just an additional real mount, we just clear the
+	 * state.
+	 */
+	if (ms->ms_additional) {
+		log_debug("Additional mounter of filesystem %s died",
+			  mg->mg_uuid);
+		ms->ms_additional = 0;
+		mg->mg_ms_in_progress = NULL;
+		return;
+	}
+
+	/*
 	 * We haven't notified the client yet.  Thus, the client can't have
 	 * called mount(2).  Let's just abort this service.  If this was
 	 * the last service, we'll plan to leave the group.
 	 */
-	if (!mg->mg_mount_notified)
+	if (!mg->mg_mount_notified) {
 		remove_service(mg, ms->ms_service);
+		return;
+	}
 
 	/*
 	 * XXX
@@ -859,9 +899,25 @@ void dead_mounter(int ci, int fd)
 	 * expecting the client to call mount(2).  But the client died.
 	 * We don't know if that happened, so we can't leave the group.
 	 *
-	 * That's not totally true, btw.  Eventually we'll be opening the
-	 * sysfs file.  Once we have that info, we'll do better here.
+	 * We do know, though, that all the other in-progress operations
+	 * (group join, dlmc_fs_register) must have completed, or we
+	 * wouldn't have set mg_mount_notified.  Thus, we can treat it as
+	 * a live mount.  Witness the power of a fully armed and
+	 * operational mountgroup.
+	 *
+	 * We can clear the in-progress flag and allow other mounters.  If
+	 * it really mounted, it can be unmounted.  If it didn't mount, the
+	 * state can be torn down with ocfs2_hb_ctl.
+	 *
+	 * Maybe later we'll learn how to detect the mount via the kernel
+	 * and tear it down ourselves.  But not right now.
 	 */
+
+	log_error("Kernel mount of filesystem %s already entered, "
+		  "assuming it succeeded",
+		  mg->mg_uuid);
+
+	mg->mg_ms_in_progress = NULL;
 }
 
 int send_mountgroups(int ci, int fd)
-- 
1.5.6.3


-- 

"Glory is fleeting, but obscurity is forever."  
         - Napoleon Bonaparte

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127