[rds-devel] Re: [PATCH RFC RDS/IB] version 2. increase number of unsignaled work requests

Wed Jan 24 04:43:12 PST 2007

Hi Zach,
See my answers below:

On Tue, 2007-01-23 at 11:00 -0800, Zach Brown wrote:
> > +extern unsigned long rds_ib_sysctl_max_unsig_wrs;
> 
> I still hope we don't need this.  What happens if we *don't* have  
> this tunable?  Why would anyone care?
> 
Same reason that you put:
extern unsigned long rds_ib_sysctl_max_send_wr;
extern unsigned long rds_ib_sysctl_max_recv_wr;

If somebody will change these values I think he will want to change
rds_ib_sysctl_max_unsig_wrs accordingly. 
Also, I wanted to check performance with different values for
rds_ib_sysctl_max_unsig_wrs in order to set the optimal default value
which currently is PAGE_SIZE / (4 * sizeof(struct ib_send_wr)).

In any case, I can get rid of this and to use
rds_ib_sysctl_max_send_wr/4.

> > +	ret = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
> > +	if (ret) {
> > +		rdsdebug("ib_req_notify_cq send failed: %d\n", ret);
> > +	}
> 
> The (arbitrary, I know) kernel coding style asks that this not have  
> braces.
OK.

> 
> > +		if (oldest <= (unsigned long long)wc.wr_id)
> > +			completed = (unsigned long long)wc.wr_id - oldest + 1;
> > +		else
> > +			completed = ic->i_send_ring.w_nr - oldest + (unsigned long long) 
> > wc.wr_id + 1;
> 
> This calculation belongs in a helper function, probably marked inline.
> 
> > -		rds_ib_ring_free(&ic->i_send_ring, 1);
> > +		for (; i < completed; i++) {
> > +			if (send->s_rm)
> > +				rds_ib_send_unmap_rm(ic, send);
> > +			send->s_wr.num_sge = 1;
> > +			if (++send == &ic->i_sends[ic->i_send_ring.w_nr])
> > +				send = ic->i_sends;
> > +			rds_ib_ring_free(&ic->i_send_ring, 1);
> > +		}
> 
> I'm pretty sure we could just call rds_ib_ring_free(&.., completed);  
> after having unmapped each send.  That'll result in less lock traffic.
> 

I tried rds_ib_ring_free(&.., completed) and my server stuck with no
memory in this case. There is probably some memory link issue in
rds_ib_ring_free. With rds_ib_ring_free(&ic->i_send_ring, 1) it works
fine.

> >
> >  		/* We expect errors as the qp is drained during shutdown */
> >  		if (wc.status != IB_WC_SUCCESS && !ic->i_wc_err) {
> > @@ -146,6 +161,7 @@
> >  	u32 work_alloc;
> >  	int sent;
> >  	int ret;
> > +	static u32 unsignaled_wrs_count;
> >
> >  	BUG_ON(off % RDS_FRAG_SIZE);
> >  	BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header));
> > @@ -155,6 +171,7 @@
> >  		rm->m_count = dma_map_sg(ic->i_cm_id->device->dma_device,
> >  					 rm->m_sg, rm->m_nents, DMA_TO_DEVICE);
> >  		rdsdebug("ic %p mapping rm %p: %d\n", ic, rm, rm->m_count);
> > +		unsignaled_wrs_count = 0;
> >  		if (rm->m_count == 0) {
> >  			rds_ib_stats_inc(s_ib_tx_sg_mapping_failure);
> >  			ret = -ENOMEM; /* XXX ? */
> > @@ -211,6 +228,15 @@
> >  	/* if there's data reference it with a chain of work reqs */
> >  	for(; i < work_alloc && scat != &rm->m_sg[rm->m_count]; i++) {
> >
> > +		++unsignaled_wrs_count;
> > +
> > +		if ( rds_ib_sysctl_max_unsig_wrs > 0 && unsignaled_wrs_count >=  
> > rds_ib_sysctl_max_unsig_wrs ) {
> > +			unsignaled_wrs_count = 0;
> > +			send->s_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED;
> > +		}
> > +		else
> > +			send->s_wr.send_flags = 0;
> > +
> >  		send->s_wr.next = NULL;
> >  		if (prev)
> >  			prev->s_wr.next = &send->s_wr;
> > @@ -237,6 +263,7 @@
> >  	/* if we finished the message then send completion owns it */
> >  	if (scat == &rm->m_sg[rm->m_count]) {
> >  		prev->s_rm = ic->i_rm;
> > +		prev->s_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED;
> >  		ic->i_rm = NULL;
> >  	}
> 
> OK, again, this fails to set SIG|SOL on the final work request in the  
> case where we consumed the ring but didn't finish the message.
> 
> Why can't we implement this with a single
> 
>    prev->s_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED;
> 
> before we post and get rid of all that noise tracking how many  
> unsig_wrs there are?  Then we'd only raise a completion on the  
> receiver in exactly two cases:
> 
>   1) the last fragment of a message
> 
>   2) the last fragment consumed as the sending ring became full
> 
> Am I missing something?

This scheme would not work well for large messages because it does not
allow for pipelining. For example, consider a stream of messages, each
of which consumes the whole descriptor ring.

In this case, you can only have a single message in flight! In contrast,
by signaling completions once every couple of fragments, you can start
pipelining the next message immediately after the first fragment (of the
previous message) has completed.

As for signaling the last fragment in each send attempt (i.e., when the
ring is full before we complete the message), this also can result in
unnecessary erratic behavior. For example, consider the case in which
multiple short messages are followed by a single large one. We succeed
in posting the small messages but get interrupted in the large one due
to resource shortage. From now on, each completed short message would
result in sending an additional single fragment from the large message
that would be signaled unnecessarily, thereby reducing bandwidth.

> 
> > -	ret = (ring->w_next_free + ring->w_nr_free) % ring->w_nr;
> > +	if (ring->w_next_free + ring->w_nr_free < ring->w_nr)
> > +		ret = ring->w_next_free + ring->w_nr_free;
> > +	else
> > +		ret = ring->w_next_free + ring->w_nr_free - ring->w_nr;
> 
> I'm going to guess that this is an optimization to prefer branching  
> to division.  (A valid optimization on modern processors, I'd  
> argue!).  Please don't make me guess in the future.  Describe all the  
> changes in the patch description.  In any case, this separate change  
> belongs in a separate patch.  Look at it this way, say the move to  
> fewer completions tickles some bug somewhere and we have to revert  
> it.  Why on earth should we also be forced to revert this  
> optimization along with it?

Skip it.
> 
> - z

Thanks,

-- 
Vladimir Sokolovsky <vlad at dev.mellanox.co.il>
Mellanox Technologies Ltd.