[rds-devel] Re: [PATCH RFC RDS/IB] version 2. increase number of unsignaled work requests

Tue Jan 23 11:00:59 PST 2007

> +extern unsigned long rds_ib_sysctl_max_unsig_wrs;

I still hope we don't need this.  What happens if we *don't* have  
this tunable?  Why would anyone care?

> +	ret = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
> +	if (ret) {
> +		rdsdebug("ib_req_notify_cq send failed: %d\n", ret);
> +	}

The (arbitrary, I know) kernel coding style asks that this not have  
braces.

> +		if (oldest <= (unsigned long long)wc.wr_id)
> +			completed = (unsigned long long)wc.wr_id - oldest + 1;
> +		else
> +			completed = ic->i_send_ring.w_nr - oldest + (unsigned long long) 
> wc.wr_id + 1;

This calculation belongs in a helper function, probably marked inline.

> -		rds_ib_ring_free(&ic->i_send_ring, 1);
> +		for (; i < completed; i++) {
> +			if (send->s_rm)
> +				rds_ib_send_unmap_rm(ic, send);
> +			send->s_wr.num_sge = 1;
> +			if (++send == &ic->i_sends[ic->i_send_ring.w_nr])
> +				send = ic->i_sends;
> +			rds_ib_ring_free(&ic->i_send_ring, 1);
> +		}

I'm pretty sure we could just call rds_ib_ring_free(&.., completed);  
after having unmapped each send.  That'll result in less lock traffic.

>
>  		/* We expect errors as the qp is drained during shutdown */
>  		if (wc.status != IB_WC_SUCCESS && !ic->i_wc_err) {
> @@ -146,6 +161,7 @@
>  	u32 work_alloc;
>  	int sent;
>  	int ret;
> +	static u32 unsignaled_wrs_count;
>
>  	BUG_ON(off % RDS_FRAG_SIZE);
>  	BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header));
> @@ -155,6 +171,7 @@
>  		rm->m_count = dma_map_sg(ic->i_cm_id->device->dma_device,
>  					 rm->m_sg, rm->m_nents, DMA_TO_DEVICE);
>  		rdsdebug("ic %p mapping rm %p: %d\n", ic, rm, rm->m_count);
> +		unsignaled_wrs_count = 0;
>  		if (rm->m_count == 0) {
>  			rds_ib_stats_inc(s_ib_tx_sg_mapping_failure);
>  			ret = -ENOMEM; /* XXX ? */
> @@ -211,6 +228,15 @@
>  	/* if there's data reference it with a chain of work reqs */
>  	for(; i < work_alloc && scat != &rm->m_sg[rm->m_count]; i++) {
>
> +		++unsignaled_wrs_count;
> +
> +		if ( rds_ib_sysctl_max_unsig_wrs > 0 && unsignaled_wrs_count >=  
> rds_ib_sysctl_max_unsig_wrs ) {
> +			unsignaled_wrs_count = 0;
> +			send->s_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED;
> +		}
> +		else
> +			send->s_wr.send_flags = 0;
> +
>  		send->s_wr.next = NULL;
>  		if (prev)
>  			prev->s_wr.next = &send->s_wr;
> @@ -237,6 +263,7 @@
>  	/* if we finished the message then send completion owns it */
>  	if (scat == &rm->m_sg[rm->m_count]) {
>  		prev->s_rm = ic->i_rm;
> +		prev->s_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED;
>  		ic->i_rm = NULL;
>  	}

OK, again, this fails to set SIG|SOL on the final work request in the  
case where we consumed the ring but didn't finish the message.

Why can't we implement this with a single

   prev->s_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED;

before we post and get rid of all that noise tracking how many  
unsig_wrs there are?  Then we'd only raise a completion on the  
receiver in exactly two cases:

  1) the last fragment of a message

  2) the last fragment consumed as the sending ring became full

Am I missing something?

> -	ret = (ring->w_next_free + ring->w_nr_free) % ring->w_nr;
> +	if (ring->w_next_free + ring->w_nr_free < ring->w_nr)
> +		ret = ring->w_next_free + ring->w_nr_free;
> +	else
> +		ret = ring->w_next_free + ring->w_nr_free - ring->w_nr;

I'm going to guess that this is an optimization to prefer branching  
to division.  (A valid optimization on modern processors, I'd  
argue!).  Please don't make me guess in the future.  Describe all the  
changes in the patch description.  In any case, this separate change  
belongs in a separate patch.  Look at it this way, say the move to  
fewer completions tickles some bug somewhere and we have to revert  
it.  Why on earth should we also be forced to revert this  
optimization along with it?

- z