[rds-devel] deploying ADDR_CHANGE in RDS

Or Gerlitz ogerlitz at voltaire.com
Tue Aug 12 14:40:15 PDT 2008


As of 2.6.27 (and backported into ofed 1.4) the bonding driver announces
fail over it does by delivering NETDEV_BONDING_FAILOVER event. On top of
that the RDMA-CM may deliver a callback to its consumers with the
RDMA_CM_EVENT_ADDR_CHANGE event.

A possible design which can be built on top of this is a scheme where you

1) set a primary device for a bond

2) such that on the steady state all RDMA traffic goes through the link
associated with this net device, once there's a failure, bonding does
failover, the RDMA connection gets broken, the app reconnects where an ARP
is sent through the seconday port, etc and a new connection is established.

3) once the problem is fixed, bonding does "fail-back" since it has a
primary definition, the RDMA connection is not broken!

4) the rdma cm would generate an ADDR CHANGE event as hint for the RDMA
ULP (eg RDS) that their connection is "not aligned with the IP stack" so
they should reconnect

I brought below the three most relevant patches that were merged and allow this.

The approach I suggest for RDS, is simply to reconnect when getting an
ADDR CHANGE rdma-cm event, or if doing it "now" is not optimal, schedule
some task which attempts to do it when possible.

Once you have this logic in place, you can build a load balancing scheme
which provides HA as well - define two bonds each with a primary that
relates to a different link (hca/port) and act on ADDR CHANGE events:

The app maintain two connections per peer such that in the steady state
each connection uses a different link and under failover they both use
the same link, when the problem is fixed ONLY one connection will get
addr change event and can revert to the link it should run on.

thougts?

Or

commit c1da4ac752b8b0411791d26c678fcf23d2eed242
Author: Or Gerlitz <ogerlitz at voltaire.com>
Date:   Fri Jun 13 18:12:00 2008 -0700

    net/core: add NETDEV_BONDING_FAILOVER event

    Add NETDEV_BONDING_FAILOVER event to be used in a successive patch
    by bonding to announce fail-over for the active-backup mode through the
    netdev events notifier chain mechanism. Such an event can be of use for the
    RDMA CM (communication manager) to let native RDMA ULPs (eg NFS-RDMA, iSER)
    always be aligned with the IP stack, in the sense that they use the same
    ports/links as the stack does. More usages can be done to allow monitoring
    tools based on netlink events being aware to bonding fail-over.

    Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
    Signed-off-by: Jay Vosburgh <fubar at us.ibm.com>
    Signed-off-by: Jeff Garzik <jgarzik at redhat.com>

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f27fd20..e92fc83 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1479,6 +1479,7 @@ extern void		__dev_addr_unsync(struct dev_addr_list **to, int *to_count, struct
 extern void		dev_set_promiscuity(struct net_device *dev, int inc);
 extern void		dev_set_allmulti(struct net_device *dev, int inc);
 extern void		netdev_state_change(struct net_device *dev);
+extern void		netdev_bonding_change(struct net_device *dev);
 extern void		netdev_features_change(struct net_device *dev);
 /* Load a device via the kmod */
 extern void		dev_load(struct net *net, const char *name);
diff --git a/include/linux/notifier.h b/include/linux/notifier.h
index 0ff6224..bd3d72d 100644
--- a/include/linux/notifier.h
+++ b/include/linux/notifier.h
@@ -197,6 +197,7 @@ static inline int notifier_to_errno(int ret)
 #define NETDEV_GOING_DOWN	0x0009
 #define NETDEV_CHANGENAME	0x000A
 #define NETDEV_FEAT_CHANGE	0x000B
+#define NETDEV_BONDING_FAILOVER 0x000C

 #define SYS_DOWN	0x0001	/* Notify of system down */
 #define SYS_RESTART	SYS_DOWN
diff --git a/net/core/dev.c b/net/core/dev.c
index 68d8df0..0e45742 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -961,6 +961,12 @@ void netdev_state_change(struct net_device *dev)
 	}
 }

+void netdev_bonding_change(struct net_device *dev)
+{
+	call_netdevice_notifiers(NETDEV_BONDING_FAILOVER, dev);
+}
+EXPORT_SYMBOL(netdev_bonding_change);
+
 /**
  *	dev_load 	- load a network module
  *	@net: the applicable net namespace


commit 01f3109de49a889db8adf9116449727547ee497e
Author: Or Gerlitz <ogerlitz at voltaire.com>
Date:   Fri Jun 13 18:12:02 2008 -0700

    bonding: deliver netdev event for fail-over under the active-backup mode

    under active-backup mode and when there's actual new_active slave,
    have bond_change_active_slave() call the networking core to deliver
    NETDEV_BONDING_FAILOVER event such that the fail-over can be notable
    by code outside of the bonding driver such as the RDMA stack and
    monitoring tools.

    As the correct context of locking appropriate for notifier calls is RTNL
    and nothing else, bond->curr_slave_lock and bond->lock are unlocked and
    later locked again. This is ensured by the rest of the code to be safe
    under backup-mode AND when new_active is not NULL.

    Jay Vosburgh modified the original patch for formatting and fixed a
    compiler error.

    Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
    Signed-off-by: Jay Vosburgh <fubar at us.ibm.com>
    Signed-off-by: Jeff Garzik <jgarzik at redhat.com>

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 2db2d05..925402b 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1203,6 +1203,14 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active)
 				dprintk("delaying gratuitous arp on %s\n",
 					bond->curr_active_slave->dev->name);
 			}
+
+			write_unlock_bh(&bond->curr_slave_lock);
+			read_unlock(&bond->lock);
+
+			netdev_bonding_change(bond->dev);
+
+			read_lock(&bond->lock);
+			write_lock_bh(&bond->curr_slave_lock);
 		}
 	}
 }

commit dd5bdff83b19d9174126e0398b47117c3a80e22d
Author: Or Gerlitz <ogerlitz at voltaire.com>
Date:   Tue Jul 22 14:14:22 2008 -0700

    RDMA/cma: Add RDMA_CM_EVENT_ADDR_CHANGE event

    Add an RDMA_CM_EVENT_ADDR_CHANGE event can be used by rdma-cm
    consumers that wish to have their RDMA sessions always use the same
    links (eg <hca/port>) as the IP stack does.  In the current code, this
    does not happen when bonding is used and fail-over happened but the IB
    link used by an already existing session is operating fine.

    Use the netevent notification for sensing that a change has happened
    in the IP stack, then scan the rdma-cm ID list to see if there is an
    ID that is "misaligned" with respect to the IP stack, and deliver
    RDMA_CM_EVENT_ADDR_CHANGE for this ID.  The consumer can act on the
    event or just ignore it.

    Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index ae11d5c..79792c9 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -168,6 +168,12 @@ struct cma_work {
 	struct rdma_cm_event	event;
 };

+struct cma_ndev_work {
+	struct work_struct	work;
+	struct rdma_id_private	*id;
+	struct rdma_cm_event	event;
+};
+
 union cma_ip_addr {
 	struct in6_addr ip6;
 	struct {
@@ -1598,6 +1604,30 @@ out:
 	kfree(work);
 }

+static void cma_ndev_work_handler(struct work_struct *_work)
+{
+	struct cma_ndev_work *work = container_of(_work, struct cma_ndev_work, work);
+	struct rdma_id_private *id_priv = work->id;
+	int destroy = 0;
+
+	mutex_lock(&id_priv->handler_mutex);
+	if (id_priv->state == CMA_DESTROYING ||
+	    id_priv->state == CMA_DEVICE_REMOVAL)
+		goto out;
+
+	if (id_priv->id.event_handler(&id_priv->id, &work->event)) {
+		cma_exch(id_priv, CMA_DESTROYING);
+		destroy = 1;
+	}
+
+out:
+	mutex_unlock(&id_priv->handler_mutex);
+	cma_deref_id(id_priv);
+	if (destroy)
+		rdma_destroy_id(&id_priv->id);
+	kfree(work);
+}
+
 static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int timeout_ms)
 {
 	struct rdma_route *route = &id_priv->id.route;
@@ -2723,6 +2753,65 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr)
 }
 EXPORT_SYMBOL(rdma_leave_multicast);

+static int cma_netdev_change(struct net_device *ndev, struct rdma_id_private *id_priv)
+{
+	struct rdma_dev_addr *dev_addr;
+	struct cma_ndev_work *work;
+
+	dev_addr = &id_priv->id.route.addr.dev_addr;
+
+	if ((dev_addr->src_dev == ndev) &&
+	    memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) {
+		printk(KERN_INFO "RDMA CM addr change for ndev %s used by id %p\n",
+		       ndev->name, &id_priv->id);
+		work = kzalloc(sizeof *work, GFP_KERNEL);
+		if (!work)
+			return -ENOMEM;
+
+		INIT_WORK(&work->work, cma_ndev_work_handler);
+		work->id = id_priv;
+		work->event.event = RDMA_CM_EVENT_ADDR_CHANGE;
+		atomic_inc(&id_priv->refcount);
+		queue_work(cma_wq, &work->work);
+	}
+
+	return 0;
+}
+
+static int cma_netdev_callback(struct notifier_block *self, unsigned long event,
+			       void *ctx)
+{
+	struct net_device *ndev = (struct net_device *)ctx;
+	struct cma_device *cma_dev;
+	struct rdma_id_private *id_priv;
+	int ret = NOTIFY_DONE;
+
+	if (dev_net(ndev) != &init_net)
+		return NOTIFY_DONE;
+
+	if (event != NETDEV_BONDING_FAILOVER)
+		return NOTIFY_DONE;
+
+	if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING))
+		return NOTIFY_DONE;
+
+	mutex_lock(&lock);
+	list_for_each_entry(cma_dev, &dev_list, list)
+		list_for_each_entry(id_priv, &cma_dev->id_list, list) {
+			ret = cma_netdev_change(ndev, id_priv);
+			if (ret)
+				goto out;
+		}
+
+out:
+	mutex_unlock(&lock);
+	return ret;
+}
+
+static struct notifier_block cma_nb = {
+	.notifier_call = cma_netdev_callback
+};
+
 static void cma_add_one(struct ib_device *device)
 {
 	struct cma_device *cma_dev;
@@ -2831,6 +2920,7 @@ static int cma_init(void)

 	ib_sa_register_client(&sa_client);
 	rdma_addr_register_client(&addr_client);
+	register_netdevice_notifier(&cma_nb);

 	ret = ib_register_client(&cma_client);
 	if (ret)
@@ -2838,6 +2928,7 @@ static int cma_init(void)
 	return 0;

 err:
+	unregister_netdevice_notifier(&cma_nb);
 	rdma_addr_unregister_client(&addr_client);
 	ib_sa_unregister_client(&sa_client);
 	destroy_workqueue(cma_wq);
@@ -2847,6 +2938,7 @@ err:
 static void cma_cleanup(void)
 {
 	ib_unregister_client(&cma_client);
+	unregister_netdevice_notifier(&cma_nb);
 	rdma_addr_unregister_client(&addr_client);
 	ib_sa_unregister_client(&sa_client);
 	destroy_workqueue(cma_wq);
diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h
index 22bb2e7..001d606 100644
--- a/include/rdma/rdma_cm.h
+++ b/include/rdma/rdma_cm.h
@@ -57,7 +57,8 @@ enum rdma_cm_event_type {
 	RDMA_CM_EVENT_DISCONNECTED,
 	RDMA_CM_EVENT_DEVICE_REMOVAL,
 	RDMA_CM_EVENT_MULTICAST_JOIN,
-	RDMA_CM_EVENT_MULTICAST_ERROR
+	RDMA_CM_EVENT_MULTICAST_ERROR,
+	RDMA_CM_EVENT_ADDR_CHANGE
 };

 enum rdma_port_space {



More information about the rds-devel mailing list