[rds-devel] deploying ADDR_CHANGE in RDS

Richard Frank richard.frank at oracle.com
Wed Aug 13 07:36:34 PDT 2008


Or, do have a patch to enable fail back processing in the RDS 
driver...when running with Linux Bonding.. ?

Can you post one ?

Or Gerlitz wrote:
> As of 2.6.27 (and backported into ofed 1.4) the bonding driver announces
> fail over it does by delivering NETDEV_BONDING_FAILOVER event. On top of
> that the RDMA-CM may deliver a callback to its consumers with the
> RDMA_CM_EVENT_ADDR_CHANGE event.
>
> A possible design which can be built on top of this is a scheme where you
>
> 1) set a primary device for a bond
>
> 2) such that on the steady state all RDMA traffic goes through the link
> associated with this net device, once there's a failure, bonding does
> failover, the RDMA connection gets broken, the app reconnects where an ARP
> is sent through the seconday port, etc and a new connection is established.
>
> 3) once the problem is fixed, bonding does "fail-back" since it has a
> primary definition, the RDMA connection is not broken!
>
> 4) the rdma cm would generate an ADDR CHANGE event as hint for the RDMA
> ULP (eg RDS) that their connection is "not aligned with the IP stack" so
> they should reconnect
>
> I brought below the three most relevant patches that were merged and allow this.
>
> The approach I suggest for RDS, is simply to reconnect when getting an
> ADDR CHANGE rdma-cm event, or if doing it "now" is not optimal, schedule
> some task which attempts to do it when possible.
>
> Once you have this logic in place, you can build a load balancing scheme
> which provides HA as well - define two bonds each with a primary that
> relates to a different link (hca/port) and act on ADDR CHANGE events:
>
> The app maintain two connections per peer such that in the steady state
> each connection uses a different link and under failover they both use
> the same link, when the problem is fixed ONLY one connection will get
> addr change event and can revert to the link it should run on.
>
> thougts?
>
> Or
>
> commit c1da4ac752b8b0411791d26c678fcf23d2eed242
> Author: Or Gerlitz <ogerlitz at voltaire.com>
> Date:   Fri Jun 13 18:12:00 2008 -0700
>
>     net/core: add NETDEV_BONDING_FAILOVER event
>
>     Add NETDEV_BONDING_FAILOVER event to be used in a successive patch
>     by bonding to announce fail-over for the active-backup mode through the
>     netdev events notifier chain mechanism. Such an event can be of use for the
>     RDMA CM (communication manager) to let native RDMA ULPs (eg NFS-RDMA, iSER)
>     always be aligned with the IP stack, in the sense that they use the same
>     ports/links as the stack does. More usages can be done to allow monitoring
>     tools based on netlink events being aware to bonding fail-over.
>
>     Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
>     Signed-off-by: Jay Vosburgh <fubar at us.ibm.com>
>     Signed-off-by: Jeff Garzik <jgarzik at redhat.com>
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index f27fd20..e92fc83 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1479,6 +1479,7 @@ extern void		__dev_addr_unsync(struct dev_addr_list **to, int *to_count, struct
>  extern void		dev_set_promiscuity(struct net_device *dev, int inc);
>  extern void		dev_set_allmulti(struct net_device *dev, int inc);
>  extern void		netdev_state_change(struct net_device *dev);
> +extern void		netdev_bonding_change(struct net_device *dev);
>  extern void		netdev_features_change(struct net_device *dev);
>  /* Load a device via the kmod */
>  extern void		dev_load(struct net *net, const char *name);
> diff --git a/include/linux/notifier.h b/include/linux/notifier.h
> index 0ff6224..bd3d72d 100644
> --- a/include/linux/notifier.h
> +++ b/include/linux/notifier.h
> @@ -197,6 +197,7 @@ static inline int notifier_to_errno(int ret)
>  #define NETDEV_GOING_DOWN	0x0009
>  #define NETDEV_CHANGENAME	0x000A
>  #define NETDEV_FEAT_CHANGE	0x000B
> +#define NETDEV_BONDING_FAILOVER 0x000C
>
>  #define SYS_DOWN	0x0001	/* Notify of system down */
>  #define SYS_RESTART	SYS_DOWN
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 68d8df0..0e45742 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -961,6 +961,12 @@ void netdev_state_change(struct net_device *dev)
>  	}
>  }
>
> +void netdev_bonding_change(struct net_device *dev)
> +{
> +	call_netdevice_notifiers(NETDEV_BONDING_FAILOVER, dev);
> +}
> +EXPORT_SYMBOL(netdev_bonding_change);
> +
>  /**
>   *	dev_load 	- load a network module
>   *	@net: the applicable net namespace
>
>
> commit 01f3109de49a889db8adf9116449727547ee497e
> Author: Or Gerlitz <ogerlitz at voltaire.com>
> Date:   Fri Jun 13 18:12:02 2008 -0700
>
>     bonding: deliver netdev event for fail-over under the active-backup mode
>
>     under active-backup mode and when there's actual new_active slave,
>     have bond_change_active_slave() call the networking core to deliver
>     NETDEV_BONDING_FAILOVER event such that the fail-over can be notable
>     by code outside of the bonding driver such as the RDMA stack and
>     monitoring tools.
>
>     As the correct context of locking appropriate for notifier calls is RTNL
>     and nothing else, bond->curr_slave_lock and bond->lock are unlocked and
>     later locked again. This is ensured by the rest of the code to be safe
>     under backup-mode AND when new_active is not NULL.
>
>     Jay Vosburgh modified the original patch for formatting and fixed a
>     compiler error.
>
>     Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
>     Signed-off-by: Jay Vosburgh <fubar at us.ibm.com>
>     Signed-off-by: Jeff Garzik <jgarzik at redhat.com>
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 2db2d05..925402b 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -1203,6 +1203,14 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active)
>  				dprintk("delaying gratuitous arp on %s\n",
>  					bond->curr_active_slave->dev->name);
>  			}
> +
> +			write_unlock_bh(&bond->curr_slave_lock);
> +			read_unlock(&bond->lock);
> +
> +			netdev_bonding_change(bond->dev);
> +
> +			read_lock(&bond->lock);
> +			write_lock_bh(&bond->curr_slave_lock);
>  		}
>  	}
>  }
>
> commit dd5bdff83b19d9174126e0398b47117c3a80e22d
> Author: Or Gerlitz <ogerlitz at voltaire.com>
> Date:   Tue Jul 22 14:14:22 2008 -0700
>
>     RDMA/cma: Add RDMA_CM_EVENT_ADDR_CHANGE event
>
>     Add an RDMA_CM_EVENT_ADDR_CHANGE event can be used by rdma-cm
>     consumers that wish to have their RDMA sessions always use the same
>     links (eg <hca/port>) as the IP stack does.  In the current code, this
>     does not happen when bonding is used and fail-over happened but the IB
>     link used by an already existing session is operating fine.
>
>     Use the netevent notification for sensing that a change has happened
>     in the IP stack, then scan the rdma-cm ID list to see if there is an
>     ID that is "misaligned" with respect to the IP stack, and deliver
>     RDMA_CM_EVENT_ADDR_CHANGE for this ID.  The consumer can act on the
>     event or just ignore it.
>
>     Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
>     Signed-off-by: Roland Dreier <rolandd at cisco.com>
>
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index ae11d5c..79792c9 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -168,6 +168,12 @@ struct cma_work {
>  	struct rdma_cm_event	event;
>  };
>
> +struct cma_ndev_work {
> +	struct work_struct	work;
> +	struct rdma_id_private	*id;
> +	struct rdma_cm_event	event;
> +};
> +
>  union cma_ip_addr {
>  	struct in6_addr ip6;
>  	struct {
> @@ -1598,6 +1604,30 @@ out:
>  	kfree(work);
>  }
>
> +static void cma_ndev_work_handler(struct work_struct *_work)
> +{
> +	struct cma_ndev_work *work = container_of(_work, struct cma_ndev_work, work);
> +	struct rdma_id_private *id_priv = work->id;
> +	int destroy = 0;
> +
> +	mutex_lock(&id_priv->handler_mutex);
> +	if (id_priv->state == CMA_DESTROYING ||
> +	    id_priv->state == CMA_DEVICE_REMOVAL)
> +		goto out;
> +
> +	if (id_priv->id.event_handler(&id_priv->id, &work->event)) {
> +		cma_exch(id_priv, CMA_DESTROYING);
> +		destroy = 1;
> +	}
> +
> +out:
> +	mutex_unlock(&id_priv->handler_mutex);
> +	cma_deref_id(id_priv);
> +	if (destroy)
> +		rdma_destroy_id(&id_priv->id);
> +	kfree(work);
> +}
> +
>  static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int timeout_ms)
>  {
>  	struct rdma_route *route = &id_priv->id.route;
> @@ -2723,6 +2753,65 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr)
>  }
>  EXPORT_SYMBOL(rdma_leave_multicast);
>
> +static int cma_netdev_change(struct net_device *ndev, struct rdma_id_private *id_priv)
> +{
> +	struct rdma_dev_addr *dev_addr;
> +	struct cma_ndev_work *work;
> +
> +	dev_addr = &id_priv->id.route.addr.dev_addr;
> +
> +	if ((dev_addr->src_dev == ndev) &&
> +	    memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) {
> +		printk(KERN_INFO "RDMA CM addr change for ndev %s used by id %p\n",
> +		       ndev->name, &id_priv->id);
> +		work = kzalloc(sizeof *work, GFP_KERNEL);
> +		if (!work)
> +			return -ENOMEM;
> +
> +		INIT_WORK(&work->work, cma_ndev_work_handler);
> +		work->id = id_priv;
> +		work->event.event = RDMA_CM_EVENT_ADDR_CHANGE;
> +		atomic_inc(&id_priv->refcount);
> +		queue_work(cma_wq, &work->work);
> +	}
> +
> +	return 0;
> +}
> +
> +static int cma_netdev_callback(struct notifier_block *self, unsigned long event,
> +			       void *ctx)
> +{
> +	struct net_device *ndev = (struct net_device *)ctx;
> +	struct cma_device *cma_dev;
> +	struct rdma_id_private *id_priv;
> +	int ret = NOTIFY_DONE;
> +
> +	if (dev_net(ndev) != &init_net)
> +		return NOTIFY_DONE;
> +
> +	if (event != NETDEV_BONDING_FAILOVER)
> +		return NOTIFY_DONE;
> +
> +	if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING))
> +		return NOTIFY_DONE;
> +
> +	mutex_lock(&lock);
> +	list_for_each_entry(cma_dev, &dev_list, list)
> +		list_for_each_entry(id_priv, &cma_dev->id_list, list) {
> +			ret = cma_netdev_change(ndev, id_priv);
> +			if (ret)
> +				goto out;
> +		}
> +
> +out:
> +	mutex_unlock(&lock);
> +	return ret;
> +}
> +
> +static struct notifier_block cma_nb = {
> +	.notifier_call = cma_netdev_callback
> +};
> +
>  static void cma_add_one(struct ib_device *device)
>  {
>  	struct cma_device *cma_dev;
> @@ -2831,6 +2920,7 @@ static int cma_init(void)
>
>  	ib_sa_register_client(&sa_client);
>  	rdma_addr_register_client(&addr_client);
> +	register_netdevice_notifier(&cma_nb);
>
>  	ret = ib_register_client(&cma_client);
>  	if (ret)
> @@ -2838,6 +2928,7 @@ static int cma_init(void)
>  	return 0;
>
>  err:
> +	unregister_netdevice_notifier(&cma_nb);
>  	rdma_addr_unregister_client(&addr_client);
>  	ib_sa_unregister_client(&sa_client);
>  	destroy_workqueue(cma_wq);
> @@ -2847,6 +2938,7 @@ err:
>  static void cma_cleanup(void)
>  {
>  	ib_unregister_client(&cma_client);
> +	unregister_netdevice_notifier(&cma_nb);
>  	rdma_addr_unregister_client(&addr_client);
>  	ib_sa_unregister_client(&sa_client);
>  	destroy_workqueue(cma_wq);
> diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h
> index 22bb2e7..001d606 100644
> --- a/include/rdma/rdma_cm.h
> +++ b/include/rdma/rdma_cm.h
> @@ -57,7 +57,8 @@ enum rdma_cm_event_type {
>  	RDMA_CM_EVENT_DISCONNECTED,
>  	RDMA_CM_EVENT_DEVICE_REMOVAL,
>  	RDMA_CM_EVENT_MULTICAST_JOIN,
> -	RDMA_CM_EVENT_MULTICAST_ERROR
> +	RDMA_CM_EVENT_MULTICAST_ERROR,
> +	RDMA_CM_EVENT_ADDR_CHANGE
>  };
>
>  enum rdma_port_space {
>
> _______________________________________________
> rds-devel mailing list
> rds-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/rds-devel
>   



More information about the rds-devel mailing list