[rds-devel] Fwd: Re: bcopy congestion / flow control... ?

Thu Feb 7 08:26:56 PST 2008

Rick raised the question of congestion control once more, and
I decided to look into this a little.

So here's the boundary conditions

 -	we don't want to change the basic approach right now.
	It's okay to throttle the sender after we go over the
	recv buffer quota - in practice we can live with it,
	and it avoids complex and slow algorithms for assigning
	credit to all peers (and moving credit around!)

 -	The current approach of "wake up everyone when a congestion
	update arrives" doesn't scale at all.

 -	We can't track congestion on an individual addr/port basis;
	the memory cost could become prohibitive. Running
	rds-stress with 1024 tasks would require 1 million
	objects for tracking the remote congestion state; which
	is a bit wasteful in terms of memory and probably hard
	to do fast, too.

So below's an idea I tried out today. Patches and some preliminary
results will follow.

Olaf

----------  Forwarded Message  ----------

Subject: Re: bcopy congestion / flow control... ?
Date: Thursday 07 February 2008 16:58
From: Olaf Kirch <olaf.kirch at oracle.com>
To: Richard Frank <richard.frank at oracle.com>

On Thursday 07 February 2008 09:07, Richard Frank wrote:
> Assuming RDS is supposed to resolve this (in bcopy mode) implies that it 
> has an efficient flow control system - e.g. our congestion management. 
> Hopefully, it's much more efficient than UDP under the same load - even 
> if it's not perfect...

I have a smallish patch to the congestion code that implements
"congestion update" notifications. It goes like this:

 -	you enable congestion monitoring through a setsockopt. This puts
	the socket on a global list of "these sockets want active congestion
	monitoring"
 -	monitoring is based on a 64bit bitmap; several ports are mapped to
	one of the bits. Right now port N corresponds to bit (N % 64).
	Each socket has one of these (so the space overhead of this approach
	is O(N) with N the number of sockets you use).
 -	whenever sendmsg bails out because we tried to send to a congested
	port, the corresponding bit is set in the socket's congestion mask
 -	when a congestion update arrives, we check which ports changed
	from congested to uncongested. I'm combining this with the memcpy
	code that copies the new map over the old one, so it should be
	reasonably efficient.
 -	The resulting 64bit word is passed into rds_cong_map_updated.
	There, we walk the list of sockets and see if it's interested
	in any of these ports. If it is, we record that fact and wake up
	the socket. On the next recvmsg, it will get a control message
	containing the 64bit word representing those ports that were
	previously blocked.

I'll give you the patches later after I've done some more testing. They're
still a little raw - and I first need to do some performance testing to
see if they actually change anything for the better.

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax

-------------------------------------------------------

-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax