[rds-devel] IB/iWARP code separation

Mon Nov 3 07:07:17 PST 2008

Or, please keep your comments on a technical level, if you insist on acting as a project proctor please do so outside of this email distribution. 

As Andy pointed out these changes are not ready for check in - his posting is an RFC to the iWARP contributors. 

WRT . 

"
As for send completion semantics, the rds IB implementation uses 
software ACKs and doesn't rely on the IB RC ack-ing with the reasoning 
that an ACK is sent (by spec or by specific implementation) before the 
PCI data transaction is done, etc. I am quite confident (sure...) that 
under iWARP it no worse then that... do I miss something?

As for error reporting semantics, are you referring to CQ completion 
status codes? what is the issue here?
"

RDS uses driver level acking for reliable bcopy operations. RDS V3 rdma operations are not reliable and therefore do not use RDS driver level acking. 

Status of an operation in the RDS V3 rdma interface - is provided via a completion notification. A successful completion indicates that the rdma operation was processed by the remote HCA - with out error - indicating that the rdma key was valid..etc, and further that the local HCA has completed all operations with the local buffers - thus returning ownership of the local buffers to the client app. The fact that the data may have not been placed into remote host memory (PCI transaction) is understood. If the remote HCA or  host fails after the RDMA completion returns with success - the application must detect the loss of the update and resubmit - if required. An RDS client and depend on:

1) rdma completion indicates buffer ownership is returned to client.
2) rdma has been processed by remote HCA - and is pending delivery to remote host memory.
3) rdma completion indicates status of the rdma operation itself - not just that is was submitted.

With IB, the send completion is used to determine the success of the rdma operation. With iWARP, the send completion does not provide the same semantics. The completion only indicates that the operation has been accepted by the local driver / NIC. iWARP errors are delivered asynchronously - out of band - raising at least two issues:

1) The RDS driver must compensate for these differences by holding off returning a completion until the rdma operation is actually processed !
2) To force processing of the submitted rdma operation requires submitting a subsequent rdma op - perhaps a 0bread (0 byte read )- to flush out the prior rdma - thereby triggering potential async error notifications, etc.
3) We need to correlate an iWARP async error with the operation that triggered the error - which has already recv'd an transport ack. 

As these issues are iWARP specific - they should be handled in an iWARP specific implementation.

WRT

"
As for the first part of your argument: --both-- IB and iWARP support 
fastreg by spec. As of now, fastreg API has been added to the main-line 
kernel and is supported by the iWARP vendors and the mlx4 IB driver. The 
"IB FMR" which you mention is a --proprietary-- implementation of one 
vendor, so in that sense, you as a maintainer can consider removing its 
support from your code base or maintain the support for both the 
standard and the proprietary ones. I would recommend using the standard 
one at least when its possible if not always (that is require the HW 
driver to support it for RDS deployments), among other reasons as its 
more efficient, will  allow you to maintain one fast reg related code 
base, etc.
"

When we designed RDS - 

1) The fastreg interface was not yet completed.
2) It was clear the FMRs were a) operational b) met our requirements for RDS which
specifically stated that rdma key creation did not require ordering of operations
to a destination...fastreg does. 
3) FMRs offered outstanding performance.

The issues with fastreg are:

1) It is not clear how well they will perform - we will switch from FMRs on IB when / if performance is on par.. As we are shipping product with IB - performance and functional criteria are well established. 
2) we need to modify the existing RDS interface to support fastreg use - which requires modifying the RDS apps to deal with identifying the destination of the rdma. Fastreg implements a more restrictive model than FMRs. This imposes some limitations on how we use rdma operations.  
3) As iWARP only implements fastreg - the iWARP specific implementation will serve as validation for them.

Another related issue - is that we are planning to move to SWI (send with invalidate) operations - which I believe are not supported by iWARP NICs... this will be another piece of transport independent support. 

Adding up all of the above - plus a) no native loop back support for bcopy or rdma with at least the current set of iWARP NICs, b) the fact that we are shipping product with RDS on IB today, c) RDS is designed to support modular transports (ok still a work in progress) - it is prudent to handle the iWARP specific support in an iWARP specific module.

Or Gerlitz wrote:
> Andy Grover wrote:
>   
>> First, we have a number of concerns about iWARP compared to IB. These include lack of support for rdma over loopback, send completion semantics, and error reporting semantics that may require changes in the semantics of the RDS user interface. Having a clean separation allows us to assess and work on these issues with no chance of affecting IB support, which is extremely important.
>>     
> Hi Andy,
>
> Thanks for your detailed response, lets discuss a bit the issues you 
> have brought, and then see :)
>
> At a higher level, the Linux RDMA stack was designed from day one to 
> provide an API which allows for one ULP IB/iWARP code base, and as I 
> mentioned, so far this has been achieved by other ULPs. The way to get 
> there may be not trivial and requires design and time. In that respect, 
> if the RDS iWARP design/code is not ready for merge, it could have been 
> put in an experimental branch (similarly to patches pushed to the -mm 
> tree in the mainline kernel environment, yet one more advantage of being 
> there...) till its ready for merge.
>
> At a lower level and trying to address the differences you have raised: 
> the IB spec indeed requires the HCA to implement "bridge" that allows 
> for IB communication in the same node without going out to the 
> wire/switch. From what you wrote I assume this is not the case for iWARP 
> or is it not the case for some implementation of iWARP? Since iWARP 
> connections are not seen by the host TCP stack, I assume that the 
> problem is not that this stack implements such connections in 
> shared-memory.
>
> As for send completion semantics, the rds IB implementation uses 
> software ACKs and doesn't rely on the IB RC ack-ing with the reasoning 
> that an ACK is sent (by spec or by specific implementation) before the 
> PCI data transaction is done, etc. I am quite confident (sure...) that 
> under iWARP it no worse then that... do I miss something?
>
> As for error reporting semantics, are you referring to CQ completion 
> status codes? what is the issue here?
>
>
>   
>> Second, the iwarp patch added a struct of function pointers to handle IB using FMR, iWARP using fastreg (rds_ib_mr_pool_ops). We already have one struct of function pointers (struct rds_transport) so there's no need for another -- mr_pool_ops's functions should be merged into struct rds_transport.
>>     
> As for the first part of your argument: --both-- IB and iWARP support 
> fastreg by spec. As of now, fastreg API has been added to the main-line 
> kernel and is supported by the iWARP vendors and the mlx4 IB driver. The 
> "IB FMR" which you mention is a --proprietary-- implementation of one 
> vendor, so in that sense, you as a maintainer can consider removing its 
> support from your code base or maintain the support for both the 
> standard and the proprietary ones. I would recommend using the standard 
> one at least when its possible if not always (that is require the HW 
> driver to support it for RDS deployments), among other reasons as its 
> more efficient, will  allow you to maintain one fast reg related code 
> base, etc.
>
> I wasn't sure to follow on the second part of this argument.
>
>
>   
>> Third, I see maintaining two modules of size X as easier than maintaining one module of size 2X. I needed to get this out there so Jon could build on top of it, but this is a "future" release. It will not be submitted anywhere in its current condition. Once we have ib and iwarp transports both working then we can start pulling duplicate functionality up above the rds_transport interface, so that the two modules end up with only non-duplicated ib- and iwarp-specific code.
>>     
>
> OK, fair-enough, makes lot of sense. I just want to clarify with you 
> what actually has to be in the code which composes the delta. Once this 
> is clear, it would be upto you seeing if maintaining two or one module 
> makes sense. Assuming that the fastreg way of doing things is only for 
> iWARP is an example to what I think need to be re-examined.
>
>
>   
>> Fourth, RDS transports were close to being modules before, but the fact they weren't meant that some things that should've been cleaned up by the transport (e.g. connections) were actually cleaned up by rds on rmmod. Making them separate modules revealed this immediately, so I like it for keeping the code properly abstracted. If this poses political issues then we can alter the makefile to just compile-in the transports to ib_rds.ko
>>     
> Transport layering  / modules decomposition is a common practice in 
> other domains at the kernel. So I think it perfectly fine to do so. I 
> was also thinking that the loopback transport is a different code from 
> the IB one, but I may be wrong.
>
>   
>> Going this way was always the plan -- that snafu just made me want to 
>> get there faster :)
>>     
>
> As for "the plan" - its seems to me that this plan has a "doing giant 
> changes and then stabilizing" state-of-mind. Well, for few years now, 
> following the 2.5 failure to see the day light, the linux kernel 
> development has taken a different approach of "doing changes small 
> enough that can be stabilized in about 14 weeks (semester, kidding...) 
> cycles". All the evidences show that the iWARP patch doesn't fall into 
> that category.  At a minimum it had to be broken into set of logical 
> changes patches that built one on top of the other and can be easily 
> reviewed at least by someone that knows the code being changed.
>
> The path of taking code (module) of size X to two modules each of size X 
> each and then converging to one shared code of size aX and two addons of 
> size bX (e.g a=0.9 b=0.1) is possible, but I believe the kernel 
> development methods / practices have proved that going the other way 
> (taking code of X, finding the aX fraction of general functionality, 
> then implementing the bX per-transport functionality) would deliver 
> faster and better (larger quality), some people call it start with the 
> end-in-mind
>
> Or
>
>
>
>
>
> _______________________________________________
> rds-devel mailing list
> rds-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/rds-devel
>