Published on Thu Sep 30 17:23:36 EDT 2004
The following patches are against 2.6.9-rc3. Included for reference is the latest version of the RPC transport switch design docment, and a set of notes describing what's next.
Subject: [PATCH] NFS: incorrect "df" results Description: Fix an NFS client bug introduced in 2.6.9-rc1. The "df" command was reporting the size of NFS file systems incorrectly. Test plan: Compare NFS mounts in the output of "df -k" and "df -h".
Subject: [PATCH] NFS: short write warning Recently a patch set was accepted to allow the Linux NFS client to handle short writes by retrying the unwritten portion of the request. The only case that now results in an error is when the server makes no progress; that is, writes zero bytes. This patch changes the kernel log warning that is generated in that case to reflect more accurately the error condition. Test-plan: None.
Subject: [PATCH] NFS: report return code on GETATTR and SETATTR Improve trace debugging messages for NFSv2/3 GETATTR and SETATTR procedures. Test-plan: Enable NFS trace debugging for NFSDBG_PROC and watch for setattr and getattr result messages. Try with NFSv2 and NFSv3.
Subject: [PATCH] NFS: directory trace messages Description: Those who use pre-built kernels from a distribution can't add more trace messages to a kernel when diagnosing a problem in the field. On the other hand, we don't want so many trace messages that the kernel log is overwhelmed with noise when tracing is enabled. This patch reuses NFSDBG_DIRCACHE and NFSDBG_LOOKUPCACHE to provide additional diagnostic messages that trace the operation of the NFS client's directory cache and lookup cache. A few other messages are now generated when NFSDBG_VFS is active, as well, to trace normal VFS activity. Test-plan: Enable NFS trace debugging with flags 1, 2, or 4. You should be able to see different types of trace messages with each flag setting.
Subject: [PATCH] NFS: Sync NFS writes still use kmalloc Replace the kmalloc() and kfree() calls in this path with appropriate invocations of nfs_writedata_alloc() and nfs_writedata_free(). This makes nfs_writepage_sync match all the other write paths in fs/nfs/write.c. Test-plan: Mount with the "sync" option, run millions of fsx operations, then review system memory utilization.
Subject: [PATCH] NFS: Use sizeof() instead of C macro Replace a C macro with sizeof(). Test-plan: Rig a server to return a bad write verifier.
Subject: [PATCH] NFS: better handling of short writes in direct write path Immediately return control to the application if a short NFS write is detected in the NFS client's direct write path. If a short write occurs and the client continues writing, the short write will leave a gap somewhere in the middle of a write request, which is more difficult for applications to recover from. Test-plan: Rig a server to return a short NFS write while running an application that uses direct I/O over NFS.
Subject: [PATCH] NFS: Direct write path allocates nfs_write_data on the stack Reduce stack utilization in the NFS direct write path by using a dynamically allocated nfs_write_data structure instead of allocating one on the stack. This reduces stack utilization of nfs_direct_write_seg from over 900 bytes to less than 100 bytes. Test-plan: Millions of operations with fsx-odirect. OraSim with a direct job file.
Subject: [PATCH] NFS: Direct read path allocates nfs_read_data on the stack Reduce stack utilization in the NFS direct read path by using a dynamically allocated nfs_read_data structure instead of allocating one on the stack. This reduces stack utilization of nfs_direct_read_seg from over 900 bytes to less than 100 bytes. Test-plan: Millions of operations with fsx-odirect. OraSim with a direct job file.
Subject: [PATCH] NFS: Use parallel read operations to do direct read requests The initial implementation of NFS direct reads was entirely synchronous. The direct read logic issued one NFS READ operation at a time, and waited for the server's reply before issuing the next one. For large direct read requests, this is unnecessarily slow. This patch changes the NFS direct read path to dispatch NFS READ operations for a single direct read request in parallel and wait for them once. The direct read path is still synchronous in nature, but because the NFS READ operations are going in parallel, the completion wait should be much shorter. Test-plan: Millions of operations with fsx-odirect. OraSim with a direct job file and small rsize. Use sio with -direct to generate large sequential reads or large random reads. Check that ^C or application segmentation faults do not cause an oops. Verify behavior when reading up to EOF.
Subject: [PATCH] RPC: extract socket logic common to both client and server Move some code that is common to both RPC client- and server-side socket transports into its own source file, net/sunrpc/socklib.c. Test-plan: Millions of fsx operations over UDP. Connectathon over UDP.
Subject: [PATCH] RPC: introduce client-side transport switch This patch introduces a generic transport switch into the kernel RPC client. The RPC transport switch divorces socket-specific implementation from the generic pieces of the RPC client. Such a switch will allow support for RPC over 10GbE, IPsec offload, multiple sockets per mount, IPv6, and transports capable of direct data placement. Here, we move the bulk of socket-specific code into a separate source file, net/sunrpc/clntsock.c. Test-plan: Millions of fsx operations. Performance characterization such as "sio" or "iozone". Destructive testing (unplugging the network temporarily). Connectathon with v2, v3, and v4.
Subject: [PATCH] RPC: move xdr_sendpages under transport switch Move socket-dependent code from net/sunrpc/xdr.c to net/sunrpc/clntsock.c. Reduce stack utilization of the RPC send path, while we're at it. Test-plan: Millions of fsx operations. Performance characterization such as "sio" or "iozone".
Subject: [PATCH] RPC: client-side transport switch cleanup Clean up remaining socket-specific structure naming, remove include/socket.h from most RPC client source files, and change some comments to reflect the realities of the new RPC transport switch mechanism. Also remove the cong_wait field from rpc_xprt, as it is no longer used. Test-plan: Millions of fsx operations. Performance characterization such as "sio" or "iozone".
Subject: [PATCH] RPC: separate TCP and UDP write space callbacks Split the socket write space callback function into a TCP version and UDP version, eliminating one dependence on the "xprt->stream" variable. Also, make both callbacks more CPU efficient by reducing the number of conditional branches taken in the hot path in each function. Test-plan: Write-intensive workload on a single mount point.
Subject: [PATCH] RPC: separate TCP and UDP transport connection logic Remove the xprt_sock_create and xprt_sock_bind functions. Create separate connection worker functions for managing UDP and TCP transport sockets. This eliminates several dependencies on "xprt->stream". Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with v2, v3, and v4.
Subject: [PATCH] RPC: separate TCP and UDP socket write paths Split the RPC client's main socket write path into a TCP version and a UDP version to eliminate another dependency on the "xprt->stream" variable. Rely on compiler optimization to remove conditional branches from xprt_sock_sendpages, as this function is now called with some constant arguments. Test-plan: Millions of fsx operations. Performance characterization such as "sio" or "iozone". Examine oprofile results for any changes before and after this patch is applied.
Subject: [PATCH] RPC: skip over transport-specific heads automatically Add a mechanism for skipping over transport-specific headers when constructing an RPC request. This removes another "xprt->stream" dependency. Test-plan: Write-intensive workload on a single mount point.
Subject: [PATCH] RPC: get rid of xprt->stream Description: Now we can remove the last few places that use the "xprt->stream" variable, and get rid of it from the rpc_xprt structure. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP.
Subject: [PATCH] RPC: transport-specific timeouts Description: This patch prepares the way to remove the "xprt->nocong" variable by adding callouts to the RPC client transport switch API to handle setting RPC timeouts. Test-plan: Use WAN simulation to cause sporadic bursty packet loss. Look for significant regression in performance or client stability.
Subject: [PATCH] RPC: remove xprt->nocong Description: Get rid of the "xprt->nocong" variable. Test-plan: Use WAN simulation to cause sporadic bursty packet loss. Look for significant regression in performance or client stability.
Subject: [PATCH] RPC: switchable buffer allocation Description: In the IPv4 socket transport implementation, RPC buffers are allocated as needed for each RPC message that is sent. Some transport implementations may choose to use pre-allocated buffers for encoding, sending, receiving, and unmarshalling RPC messages. This patch adds RPC client transport switch support for replacing buffer management on a per-transport basis. Test-plan: Millions of fsx operations. Performance characterization with "sio" and "iozone". Use oprofile and other tools to look for significant regression in CPU utilization.
Subject: [PATCH] RPC: pluggable portmapping Description: Introduce new RPC client transport switch methods to handle RPC portmapping in a transport independent way. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked. Probably need to rig a server where certain services aren't running, or that returns an error for some typical operation.
Subject: [PATCH] RPC: API for getting remote peer address Description: Provide an API for retrieving the remote peer address without allowing direct access to the rpc_xprt struct. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked. Probably need to rig a server where certain services aren't running, or that returns an error for some typical operation.
Subject: [PATCH] RPC: display XIDs in network order Description: Ethereal and other tools display RPC XIDs in network order. This patch changes the RPC trace messages that display XIDs to print them in network order so they can be easily matched to XIDs that appear in Ethereal. Test-plan: Run a short program with RPC trace debugging enabled while capturing a packet trace with Ethereal. Compare the output of the trace debugging messages with the contents of the Ethereal window.
Subject: [PATCH] RPC: load RPC transport implementations dynamically Description: Now that we have an RPC transport switch, we have introduced the potential to add new transport capabilities for use by the RPC client at run-time. This patch allows RPC client transport implementations to be loaded as needed, or as they become available from distributors or third-party vendors. This patch is experimental. It is safe to apply, but is functionally incomplete. Currently it is acting only as a placeholder to collect changes related to transport module loading. Test-plan: Build kernel with NFS and SUNRPC in a module. Try loading and unloading the IPv4 transport module. Try mounting without the IPv4 module loaded. Destructive testing. Try unloading SUNRPC or the IPv4 transport module while there are active NFS mounts.
Subject: [PATCH] RPC: add function to get tk_auth from rpc_rqst pointer Description: Create a standardized way to derive the appropriate tk_auth for a given rpc_rqst. This is clean up required for the next patch. Test-plan: None.
Subject: [PATCH] RPC: make rpc_rqst a part of rpc_task Description: The RPC client allocates rpc_rqst structures from slots in a static table. The size of this slot table determines how many RPC requests can be in flight concurrently for that rpc_clnt (NFS mount point). Previously, this static slot table was fixed in size, and contained just 16 slots for every rpc_clnt. In recent kernels, logic was added to allow this static slot table to be sized and allocated dynamically at the time an rpc_clnt is created. The slot table can now be made as large as 128 slots to permit many more concurrent RPC requests, but the memory remains allocated until each rpc_clnt is destroyed, even when there are no outstanding RPC requests. NFSv4.1 sessions provide the ability to negotiate the maximum number of concurrent requests that the transport and the server can handle. The maximum can be increased or decreased during a transport session. This patch makes RPC requests a part of the RPC task structure. This reduces memory fragmentation by keeping all the pieces of an NFS and RPC request together. For asynchronous reads and writes, this means the RPC request, RPC task, and nfs_read/write_data structures reside in the same CPU cache lines, and will move between caches together. RPC slot tables are eliminated entirely, preventing memory from being held even when it's not being used, and allowing the maximum number of outstanding RPC requests to be modulated dynamically. The call_reserve path has been simplified. The only job it has now are to initialize each rpc_rqst structure, or to queue RPC tasks on the xprt's backlog queue if the maximum number per transport has been exceeded. We no longer have a free list, and each XID can be allocated while holding the xprt_lock. Thus we can do away with the reserve_lock entirely. Test-plan: Heavy multi-threaded tests like "sio" and "iozone" on SMP systems. Watch for signficant regression in CPU utilization with oprofile and other tools. Watch for multiple RPCs with the same XID. Destructive testing.
Subject: [PATCH] RPC: remove tk_rqstp field from rpc_task Description: Now that the rpc_rqst structure is part of rpc_task, there is no need for a field in rpc_task that points to the associated RPC request slot. This patch removes the tk_rqstp field. Test-plan: None.
Subject: [PATCH] RPC: remove rq_task field from rpc_rqst Description: Now that the rpc_rqst structure is part of rpc_task, there is no need for a field in rpc_rqst that points to its associated rpc_task. This patch removes the rq_task field. Test-plan: None.
Subject: [PATCH] RPC: use rq_xprt instead of tk_xprt Description: Minor performance optimization. Now that the rpc_rqst structure is part of rpc_task, we can replace this double indirection (task->tk_client->cl_xprt) with a single indirection (task->tk_rqst.rq_xprt) to save a load instruction and eliminate one AGI in several places. Test-plan: None.
Subject: [PATCH] RPC: add rq_rtt field to rpc_rqst Description: Minor performance optimization. Currently it is cumbersome to derive the RTT data from the address of an RPC request. This patch adds an rq_rtt field which is a copy of the RPC client's cl_rtt field. That makes it easier and cleaner to find the RTT data when needed, and reduces the number of loads and AGIs in several places. Test-plan: None.
Subject: [PATCH] NFS: use attribute timeout instead of "noac" mount option
The behavior enabled by the "noac" mount option should be precisely
equivalent to setting acreg{min,max} or acdir{min,max} to zero via
mount options.
Test-plan:
Compare behavior of mounts with "actimeo=0" and "noac".
Subject: [PATCH] NFS: readdir pre-emption The NFS directory logic provides several pre-emption points to allow other work on the system to proceed during the potentially lengthy searches in the NFS directory cache. However, the pre-emption logic is almost never triggered because it waits for 200 loop iterations before calling schedule(). On most 4KB-per-page clients, it is nearly impossible to get 200 directory entries into a single page. This patch adds more frequent pre-emption to the readdir and cached lookup paths. Pre-emption will occur about once per page during multi-page scans, and not at all if only a single page is involved. Category: Performance scalability enhancement Test-plan: Performance characterization of a directory scan like "find" while running a multi-threaded workload.
Subject: [PATCH] NFS: getdents(3) hints When an application invokes getdents(3) on a directory stored in NFS, the directory cache logic always searches from the beginning of the directory to find the cookie in question. For large directories, this is significant overhead, and means that a single walk through the directory using getdents(3) calls can be more than O(n!). This patch adds a page index hint to the directory search algorithm so that getdents(3) can start where it left off, rather than walking the entire directory from the beginning each time it is called. Category: Scalability enhancement Test-plan: Connectathon, "rm -rf" on a large directory tree; tar and untar.