[rds-devel] IB/iWARP code separation

Sun Nov 2 23:08:47 PST 2008

Andy Grover wrote:
> First, we have a number of concerns about iWARP compared to IB. These include lack of support for rdma over loopback, send completion semantics, and error reporting semantics that may require changes in the semantics of the RDS user interface. Having a clean separation allows us to assess and work on these issues with no chance of affecting IB support, which is extremely important.
Hi Andy,

Thanks for your detailed response, lets discuss a bit the issues you 
have brought, and then see :)

At a higher level, the Linux RDMA stack was designed from day one to 
provide an API which allows for one ULP IB/iWARP code base, and as I 
mentioned, so far this has been achieved by other ULPs. The way to get 
there may be not trivial and requires design and time. In that respect, 
if the RDS iWARP design/code is not ready for merge, it could have been 
put in an experimental branch (similarly to patches pushed to the -mm 
tree in the mainline kernel environment, yet one more advantage of being 
there...) till its ready for merge.

At a lower level and trying to address the differences you have raised: 
the IB spec indeed requires the HCA to implement "bridge" that allows 
for IB communication in the same node without going out to the 
wire/switch. From what you wrote I assume this is not the case for iWARP 
or is it not the case for some implementation of iWARP? Since iWARP 
connections are not seen by the host TCP stack, I assume that the 
problem is not that this stack implements such connections in 
shared-memory.

As for send completion semantics, the rds IB implementation uses 
software ACKs and doesn't rely on the IB RC ack-ing with the reasoning 
that an ACK is sent (by spec or by specific implementation) before the 
PCI data transaction is done, etc. I am quite confident (sure...) that 
under iWARP it no worse then that... do I miss something?

As for error reporting semantics, are you referring to CQ completion 
status codes? what is the issue here?

> Second, the iwarp patch added a struct of function pointers to handle IB using FMR, iWARP using fastreg (rds_ib_mr_pool_ops). We already have one struct of function pointers (struct rds_transport) so there's no need for another -- mr_pool_ops's functions should be merged into struct rds_transport.
As for the first part of your argument: --both-- IB and iWARP support 
fastreg by spec. As of now, fastreg API has been added to the main-line 
kernel and is supported by the iWARP vendors and the mlx4 IB driver. The 
"IB FMR" which you mention is a --proprietary-- implementation of one 
vendor, so in that sense, you as a maintainer can consider removing its 
support from your code base or maintain the support for both the 
standard and the proprietary ones. I would recommend using the standard 
one at least when its possible if not always (that is require the HW 
driver to support it for RDS deployments), among other reasons as its 
more efficient, will  allow you to maintain one fast reg related code 
base, etc.

I wasn't sure to follow on the second part of this argument.

> Third, I see maintaining two modules of size X as easier than maintaining one module of size 2X. I needed to get this out there so Jon could build on top of it, but this is a "future" release. It will not be submitted anywhere in its current condition. Once we have ib and iwarp transports both working then we can start pulling duplicate functionality up above the rds_transport interface, so that the two modules end up with only non-duplicated ib- and iwarp-specific code.

OK, fair-enough, makes lot of sense. I just want to clarify with you 
what actually has to be in the code which composes the delta. Once this 
is clear, it would be upto you seeing if maintaining two or one module 
makes sense. Assuming that the fastreg way of doing things is only for 
iWARP is an example to what I think need to be re-examined.

> Fourth, RDS transports were close to being modules before, but the fact they weren't meant that some things that should've been cleaned up by the transport (e.g. connections) were actually cleaned up by rds on rmmod. Making them separate modules revealed this immediately, so I like it for keeping the code properly abstracted. If this poses political issues then we can alter the makefile to just compile-in the transports to ib_rds.ko
Transport layering  / modules decomposition is a common practice in 
other domains at the kernel. So I think it perfectly fine to do so. I 
was also thinking that the loopback transport is a different code from 
the IB one, but I may be wrong.

> Going this way was always the plan -- that snafu just made me want to 
> get there faster :)

As for "the plan" - its seems to me that this plan has a "doing giant 
changes and then stabilizing" state-of-mind. Well, for few years now, 
following the 2.5 failure to see the day light, the linux kernel 
development has taken a different approach of "doing changes small 
enough that can be stabilized in about 14 weeks (semester, kidding...) 
cycles". All the evidences show that the iWARP patch doesn't fall into 
that category.  At a minimum it had to be broken into set of logical 
changes patches that built one on top of the other and can be easily 
reviewed at least by someone that knows the code being changed.

The path of taking code (module) of size X to two modules each of size X 
each and then converging to one shared code of size aX and two addons of 
size bX (e.g a=0.9 b=0.1) is possible, but I believe the kernel 
development methods / practices have proved that going the other way 
(taking code of X, finding the aX fraction of general functionality, 
then implementing the bX per-transport functionality) would deliver 
faster and better (larger quality), some people call it start with the 
end-in-mind

Or