[rds-devel] IB/iWARP code separation
Or Gerlitz
ogerlitz at voltaire.com
Sun Nov 2 23:08:47 PST 2008
Andy Grover wrote:
> First, we have a number of concerns about iWARP compared to IB. These include lack of support for rdma over loopback, send completion semantics, and error reporting semantics that may require changes in the semantics of the RDS user interface. Having a clean separation allows us to assess and work on these issues with no chance of affecting IB support, which is extremely important.
Hi Andy,
Thanks for your detailed response, lets discuss a bit the issues you
have brought, and then see :)
At a higher level, the Linux RDMA stack was designed from day one to
provide an API which allows for one ULP IB/iWARP code base, and as I
mentioned, so far this has been achieved by other ULPs. The way to get
there may be not trivial and requires design and time. In that respect,
if the RDS iWARP design/code is not ready for merge, it could have been
put in an experimental branch (similarly to patches pushed to the -mm
tree in the mainline kernel environment, yet one more advantage of being
there...) till its ready for merge.
At a lower level and trying to address the differences you have raised:
the IB spec indeed requires the HCA to implement "bridge" that allows
for IB communication in the same node without going out to the
wire/switch. From what you wrote I assume this is not the case for iWARP
or is it not the case for some implementation of iWARP? Since iWARP
connections are not seen by the host TCP stack, I assume that the
problem is not that this stack implements such connections in
shared-memory.
As for send completion semantics, the rds IB implementation uses
software ACKs and doesn't rely on the IB RC ack-ing with the reasoning
that an ACK is sent (by spec or by specific implementation) before the
PCI data transaction is done, etc. I am quite confident (sure...) that
under iWARP it no worse then that... do I miss something?
As for error reporting semantics, are you referring to CQ completion
status codes? what is the issue here?
> Second, the iwarp patch added a struct of function pointers to handle IB using FMR, iWARP using fastreg (rds_ib_mr_pool_ops). We already have one struct of function pointers (struct rds_transport) so there's no need for another -- mr_pool_ops's functions should be merged into struct rds_transport.
As for the first part of your argument: --both-- IB and iWARP support
fastreg by spec. As of now, fastreg API has been added to the main-line
kernel and is supported by the iWARP vendors and the mlx4 IB driver. The
"IB FMR" which you mention is a --proprietary-- implementation of one
vendor, so in that sense, you as a maintainer can consider removing its
support from your code base or maintain the support for both the
standard and the proprietary ones. I would recommend using the standard
one at least when its possible if not always (that is require the HW
driver to support it for RDS deployments), among other reasons as its
more efficient, will allow you to maintain one fast reg related code
base, etc.
I wasn't sure to follow on the second part of this argument.
> Third, I see maintaining two modules of size X as easier than maintaining one module of size 2X. I needed to get this out there so Jon could build on top of it, but this is a "future" release. It will not be submitted anywhere in its current condition. Once we have ib and iwarp transports both working then we can start pulling duplicate functionality up above the rds_transport interface, so that the two modules end up with only non-duplicated ib- and iwarp-specific code.
OK, fair-enough, makes lot of sense. I just want to clarify with you
what actually has to be in the code which composes the delta. Once this
is clear, it would be upto you seeing if maintaining two or one module
makes sense. Assuming that the fastreg way of doing things is only for
iWARP is an example to what I think need to be re-examined.
> Fourth, RDS transports were close to being modules before, but the fact they weren't meant that some things that should've been cleaned up by the transport (e.g. connections) were actually cleaned up by rds on rmmod. Making them separate modules revealed this immediately, so I like it for keeping the code properly abstracted. If this poses political issues then we can alter the makefile to just compile-in the transports to ib_rds.ko
Transport layering / modules decomposition is a common practice in
other domains at the kernel. So I think it perfectly fine to do so. I
was also thinking that the loopback transport is a different code from
the IB one, but I may be wrong.
> Going this way was always the plan -- that snafu just made me want to
> get there faster :)
As for "the plan" - its seems to me that this plan has a "doing giant
changes and then stabilizing" state-of-mind. Well, for few years now,
following the 2.5 failure to see the day light, the linux kernel
development has taken a different approach of "doing changes small
enough that can be stabilized in about 14 weeks (semester, kidding...)
cycles". All the evidences show that the iWARP patch doesn't fall into
that category. At a minimum it had to be broken into set of logical
changes patches that built one on top of the other and can be easily
reviewed at least by someone that knows the code being changed.
The path of taking code (module) of size X to two modules each of size X
each and then converging to one shared code of size aX and two addons of
size bX (e.g a=0.9 b=0.1) is possible, but I believe the kernel
development methods / practices have proved that going the other way
(taking code of X, finding the aX fraction of general functionality,
then implementing the bX per-transport functionality) would deliver
faster and better (larger quality), some people call it start with the
end-in-mind
Or
More information about the rds-devel
mailing list