[rds-devel] [PATCH RFC] RDS/IB+IW: Yet Another Stall (YAS :)

Andy Grover andy.grover at oracle.com
Thu Feb 5 16:32:16 PST 2009


Hi Steve,

I'll apply this, and also make the same change in ib transport. It fixes 
the issue, and I don't immediately see a way to do this in a more 
elegant manner.

Thanks.

Regards -- Andy

Steve Wise wrote:
> From: Steve Wise <swise at opengridcomputing.com>
> 
> Found another flow control stall.  I can reproduce it by running tkmivrpc
> on 3 nodes, 4 separate jobs of 4 senders and 4 receivers in each job
> (32 processes on each node).
> 
> Intermittently, it will hit this stall between two of the three nodes:
> 
> One node, the sender, looks like this:
> - pending RDS messages to send
> - transport send ring empty (due to no send credits)
> - 0 send credits
> - sq empty and no pending completions
> - no pending rq completions
> 
> The other node, the receiver, looks like this:
> - No RDS messages to send
> - In the middle of a fragmented receive
> - No "IB_ACK_REQUESTED" flag set
> - lots of posted credits to advertise
> - sq empty and no pending completions
> - no pending rq completions
> 
> So the connection is stalled until the receiver decides to send something.
> 
> BTW: If you rds-ping from the receiver to the sender, it unstalls the
> sender because the rds-ping messages advertise credits and get things
> going again.
> 
> Here's how the stall can happen:
> 
> Time 	Event
> -----------------
> t1	rds_iw_xmit() grabs send and posted credits, and the posted
> 	count at this point is 0.
> 
> t2	recv thread refills the recv ring, bumping up posted credits
> 	and sets IB_ACK_REQUESTED
> 
> t3	rds_iw_xmit() resets ACK_REQUESTED because it will piggy back 
> 	the ACK on outgoing data. 
> 
> t4	rds_iw_xmit posts send with 0 adv_credits from the original 
> 	grab done at t1
> 
> t5	send completion doesn't do anything special.  Namely it doesn't 
> 	attempt to advertise credits
> 
> At this point, we can stall the peer because no more credits are
> advertised to the peer unless more outgoing data is transmitted.
> 
> The solution is to refetch the posted credits when the ack is going to
> be piggy backed at time t3.
> 
> Below is a patch that solves it for iWARP.  But it should probably be
> cleaned up some:
> 
> - resolve the 255 -> 127 hack I added to avoid overflows since we grab
> posted credits twice.  This could be solved by passing in a limit to
> grab so that it only grabs up to 255.
> 
> - instead of overloading rds_iw_send_grab_credits(), maybe we should
> define a new service for getting post credits.  It will have to be
> implemented similarly to the grab function (with the goto and the
> cmpxchg).
> 
> Signed-off-by: Steve Wise <swise at opengridcomputing.com>
> ---
> 
>  drivers/infiniband/ulp/rds/iw.h      |    2 +-
>  drivers/infiniband/ulp/rds/iw_recv.c |    2 +-
>  drivers/infiniband/ulp/rds/iw_send.c |   20 +++++++++++++++++---
>  drivers/infiniband/ulp/rds/rds.h     |    2 +-
>  4 files changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/rds/iw.h b/drivers/infiniband/ulp/rds/iw.h
> index 600d57f..8b7f01b 100644
> --- a/drivers/infiniband/ulp/rds/iw.h
> +++ b/drivers/infiniband/ulp/rds/iw.h
> @@ -349,7 +349,7 @@ int rds_iw_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op);
>  void rds_iw_send_add_credits(struct rds_connection *conn, unsigned int credits);
>  void rds_iw_advertise_credits(struct rds_connection *conn, unsigned int posted);
>  int rds_iw_send_grab_credits(struct rds_iw_connection *ic, u32 wanted,
> -			     u32 *adv_credits);
> +			     u32 *adv_credits, int need_posted);
>  
>  /* ib_stats.c */
>  DECLARE_PER_CPU(struct rds_iw_statistics, rds_iw_stats);
> diff --git a/drivers/infiniband/ulp/rds/iw_recv.c b/drivers/infiniband/ulp/rds/iw_recv.c
> index 385f239..05bccc5 100644
> --- a/drivers/infiniband/ulp/rds/iw_recv.c
> +++ b/drivers/infiniband/ulp/rds/iw_recv.c
> @@ -524,7 +524,7 @@ void rds_iw_attempt_ack(struct rds_iw_connection *ic)
>  	}
>  
>  	/* Can we get a send credit? */
> -	if (!rds_iw_send_grab_credits(ic, 1, &adv_credits)) {
> +	if (!rds_iw_send_grab_credits(ic, 1, &adv_credits, 0)) {
>  		rds_iw_stats_inc(s_iw_tx_throttle);
>  		clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags);
>  		return;
> diff --git a/drivers/infiniband/ulp/rds/iw_send.c b/drivers/infiniband/ulp/rds/iw_send.c
> index 104662e..7b642cb 100644
> --- a/drivers/infiniband/ulp/rds/iw_send.c
> +++ b/drivers/infiniband/ulp/rds/iw_send.c
> @@ -351,7 +351,7 @@ void rds_iw_send_cq_comp_handler(struct ib_cq *cq, void *context)
>   * and using atomic_cmpxchg when updating the two counters.
>   */
>  int rds_iw_send_grab_credits(struct rds_iw_connection *ic,
> -			     u32 wanted, u32 *adv_credits)
> +			     u32 wanted, u32 *adv_credits, int need_posted)
>  {
>  	unsigned int avail, posted, got = 0, advertise;
>  	long oldval, newval;
> @@ -385,7 +385,12 @@ try_again:
>  	}
>  	newval -= IB_SET_SEND_CREDITS(got);
>  
> -	if (got && posted) {
> +	/*
> +	 * If need_posted is non-zero, then the caller wants
> +	 * the posted regardless of whether any send credits are
> +	 * available.
> +	 */
> +	if (posted && (got || need_posted)) {
>  		advertise = min_t(unsigned int, posted, RDS_MAX_ADV_CREDIT);
>  		newval -= IB_SET_POST_CREDITS(advertise);
>  	}
> @@ -507,6 +512,7 @@ int rds_iw_xmit(struct rds_connection *conn, struct rds_message *rm,
>  	u32 i;
>  	u32 work_alloc;
>  	u32 credit_alloc;
> +	u32 posted;
>  	u32 adv_credits = 0;
>  	int send_flags = 0;
>  	int sent;
> @@ -539,7 +545,8 @@ int rds_iw_xmit(struct rds_connection *conn, struct rds_message *rm,
>  
>  	credit_alloc = work_alloc;
>  	if (ic->i_flowctl) {
> -		credit_alloc = rds_iw_send_grab_credits(ic, work_alloc, &adv_credits);
> +		credit_alloc = rds_iw_send_grab_credits(ic, work_alloc, &posted, 0);
> +		adv_credits += posted;
>  		if (credit_alloc < work_alloc) {
>  			rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc - credit_alloc);
>  			work_alloc = credit_alloc;
> @@ -607,6 +614,13 @@ int rds_iw_xmit(struct rds_connection *conn, struct rds_message *rm,
>  		 * should call rds_iw_ring_alloc first. */
>  		rm->m_inc.i_hdr.h_ack = cpu_to_be64(rds_iw_piggyb_ack(ic));
>  		rds_message_make_checksum(&rm->m_inc.i_hdr);
> +	
> +		/*
> +		 * Update adv_credits since we reset the ACK_REQUIRED bit.
> +		 */
> +		rds_iw_send_grab_credits(ic, 0, &posted, 1);
> +		adv_credits += posted;
> +		BUG_ON(adv_credits > 255);
>  	} else if (ic->i_rm != rm)
>  		BUG();
>  
> diff --git a/drivers/infiniband/ulp/rds/rds.h b/drivers/infiniband/ulp/rds/rds.h
> index ea5b128..0a88b8e 100644
> --- a/drivers/infiniband/ulp/rds/rds.h
> +++ b/drivers/infiniband/ulp/rds/rds.h
> @@ -195,7 +195,7 @@ struct rds_connection {
>  #define RDS_FLAG_CONG_BITMAP	0x01
>  #define RDS_FLAG_ACK_REQUIRED	0x02
>  #define RDS_FLAG_RETRANSMITTED	0x04
> -#define RDS_MAX_ADV_CREDIT	255
> +#define RDS_MAX_ADV_CREDIT	127
>  
>  /*
>   * Maximum space available for extension headers.



More information about the rds-devel mailing list