[rds-devel] RDS IB transport software flow control?

Wed Nov 7 03:48:10 PST 2007

<<
However, it may be worth to at least implement the c_cong_queue,
and use it to queue messages an application may send to a congested
destination. Right now, we have this very bad interaction between
congestion and poll - any congestion map update will essentially
wake up all processes waiting in poll. Thundering herd all over...
>>

If I'm understanding this - one problem would be that so_sndbuf could be consumed
by sends to a single congested destination - which would block sending to all the other un-congested
destinations....?

Olaf Kirch wrote:
> On Tuesday 06 November 2007 19:26, Richard Frank wrote:
>   
>> Memory consumption is bounded by so_sndbuf + so_rcvbuf * number of 
>> active endpoints.
>>     
>
> Mostly, yes. Except for two problems.
>
> One, if you just use so_rcvbuf to bound you memory consumption,
> every process will have to be given max_amount_of_memory / num_processes,
> which is utterly fair but doesn't allow for peaks in network traffic.
> This is why TCP has these tcp_rmem tunables which provide a global
> upper bound across all sockets. One of my recent patches introduced
> a global RX limit very similar to that.
>
> Second, there's a delay in there:
>
>  -	several peers sending to the same socket
>  -	recvbuf fills up, we flag congestion
>  -	peers keep sending
>  -	congestion map is transmitted
>  -	peers receive congestion map
>  -	peers do not allow apps to queue more packets
> 	for the congested destionation, but drain their
> 	send queues
>
> So you have a delay of several milliseconds during which clients
> can keep sending - and the receiver keeps allocation recvbufs for
> those.
>
> On the other hand, it's hard to think of something better.
> Or Gerlitz referred to the credit based system of SDP, but SDP
> is a stream protocol where you have one pipe with a single sender/
> consumer at each end.
>
> With RDS, the prevalent mode of operation is the disconnected
> socket, where you have no idea who is going to send how much to
> whom, so it's rather difficult to assign credit in advance,
> without doing something silly like so_rcvbuf / number_of_ib_connections
>
> It's probably worth thinking about a way that allows the
> receiver to drop packets that cannot be delivered because
> the queue is full. This would require some sort of NAK packet
> that tells the peer that some particular packet was not accepted
> and needs to be postponed/retransmitted:
>
>  -	When the receiver drops a packet because the queue is
> 	full, it immediately sends a NACK that contains the
> 	sequence number of the packet that was dropped.
>
> 	To be examined - is there any situation where a NACK can
> 	get lost, eg during HCA failover?
>
>  -	When the sender receives a NACK, it puts the rejected
> 	message (and any other messages for that port) onto 
> 	a special queue (conn->c_cong_queue), and marks the
> 	destination port in the congestion map.
>
>  -	The sender keeps transmitting messages off its queue
> 	as usual.
>
> 	When the application tries to send to a destination
> 	that is marked as congested, we could queue the message
> 	on the c_cong_queue rather than returning an error.
> 	This would take care of all the hassles of having to
> 	deal with congestion in user space at all.
>
>  -	When the application on the destionation host is ready
> 	to take a few messages off its receive queue, the port
> 	is marked uncongested in the map, and a map update is
> 	queued to all connections.
>
>  -	When the sender receives the map update, it checks its
> 	queue of rejected messages (c_cong_queue) and retransmits
> 	those for which the port is no longer marked as congested.
>
> In its entirety, this certainly isn't something we should introduce
> for ofed 1.2.5.x, but let's think about it for 1.3.
>
> However, it may be worth to at least implement the c_cong_queue,
> and use it to queue messages an application may send to a congested
> destination. Right now, we have this very bad interaction between
> congestion and poll - any congestion map update will essentially
> wake up all processes waiting in poll. Thundering herd all over...
>
> If you do not expose the fact that a given destination is congested
> to user space, this problem just vanishes. The application will happily
> queue up messages, and if the send queue is full, it stops - period.
> When a destination becomes uncongested, the kernel will move these
> messages to the send queue, and as ACKs start coming back, the
> application is woken up to send more.
>
> And it makes for a much saner poll implementation too :)
>
> Any comments?
>
> Olaf
>