Published on Wed Aug 18 00:02:57 EDT 2004
The following patches are against 2.6.8.1. Included for reference is the latest version of the RPC transport switch design docment, and a set of notes describing what's next.
Subject: [PATCH] NFS: use reasonable block size when copying files Category: Reversion, new feature Description: In 2.4, NFS O_DIRECT used the VFS's O_DIRECT logic to provide direct I/O support for NFS files. The 2.4 VFS O_DIRECT logic was block based, thus the NFS client had to provide a minimum allowable blocksize for O_DIRECT reads and writes on NFS files. For various reasons we chose 512 bytes. In 2.6, there is no requirement for a minimum blocksize. NFS O_DIRECT reads and writes can go to any byte at any offset in a file. Thus we revert the blocksize setting for NFS file systems to the previous behavior, which was to advertise the "wsize" setting as the optimal I/O block size. This improves the performance of applications like 'cp' which use this value as their transfer size. This patch also exposes the server's reported disk block size in the f_frsize of the vfsstat structure. Test-plan: Standard performance regression tests using 'cp' and 'dd'.
Subject: [PATCH] NFS: mount failure recovery cleanup Category: Code re-organization, maintainability Description: Simplify mount error recovery logic. Get rid of nfs_put_super. Test-plan: We don't have any good mount test cases at this time. However, we should make certain that NFSv2/3 and NFSv4 mounting is carefully checked. Probably need to rig a server where certain services aren't running, or that returns an error for some typical operations.
Subject: [PATCH] NFS: break single global lock into per-inode lock Category: Performance scalability enhancement Description: Break the nfs_wreq_lock into per-inode locks. This helps prevent a heavy read or write workload on one file from interfering with workloads against other NFS files. Note that there is still some serialization due to the big kernel lock. Test-plan: Run multi-threaded multi-file tests on large-scale SMP and NUMA clients. Look for performance regressions or stability problems, such as hanging mount points or applications, oopses, or system deadlocks.
Subject: [PATCH] NFS: compare file handles efficiently Category: Performance scalability enhancement Description: NFS file handles can be as large as 128 bytes, but are most commonly no more than 32 bytes. While the storage container for NFS file handles must be able to store a maximum of 128 bytes, it should be necessary to compare only the valid bytes between two file handles, and not the extra pad bytes. This patch creates an efficient standard mechanism for comparing NFS file handles that ignores the unused bytes in a file handle container. This reduces the size of most file handle comparisons from all 128 bytes in the storage container to about 32 bytes on average. Test-plan: All connectathon tests should pass with NFSv2, v3, and v4. Watch for CPU utilization regressions as measured by oprofile and other tools.
Subject: [PATCH] NFS: copy file handles efficiently Category: Performance scalability enhancement Description: Now that file handle comparison ignores the unused parts of the file handle container, there is no longer any need to clear each NFS file handle container before copying in a new file handle. This allows the removal of a 128 byte memset() from several hot paths. Test-plan: All connectathon tests should pass with NFSv2, v3, and v4. Watch for CPU utilization regressions as measured by oprofile and other tools.
Subject: [PATCH] NFS: short write warning Category: Servicability Description: Recently a patch set was submitted to allow the Linux NFS client to handle short writes by retrying the unwritten portion of the request. The only case that now results in an error is when the server makes no progress; that is, writes zero bytes. This patch modifies the kernel log warning that is generated in that case to reflect more accurately the error condition. Test-plan: None.
Subject: [PATCH] NFS: Use asynchronous reads to handle direct read requests Category: Performance enhancement Description: The initial implementation of NFS direct reads was entirely synchronous. The direct read logic issued one NFS READ operation at a time, and waited for the server's reply before issuing the next one. For large direct read requests, this is very slow. This patch changes the NFS direct read path to dispatch NFS READ operations for a single direct read request in parallel and wait for them once. The direct read path is still synchronous in nature, but because the NFS READ operations are going in parallel, the completion wait should be much shorter. Test-plan: Millions of operations with fsx-odirect. OraSim with a direct job file and small rsize. Use sio with -direct to generate large sequential reads or large random reads.
Subject: [PATCH] NFS: Direct writes allocate rpc_task on the stack Category: Stability enhancement Description: Kernel stack utilization is on a diet. Reduce stack utilization in the NFS direct write path by using a dynamically allocated nfs_write_data structure instead of allocating one on the stack. Test-plan: Millions of operations with fsx-odirect. OraSim with a direct job file.
Subject: [PATCH] RPC: Synchronous RPC calls allocate rpc_task on the stack Category: Stability enhancement Description: Reduce stack utilization for all synchronous NFS operations by using a dynamically allocated rpc_task structure instead of allocating one on the stack. This reduces stack utilization by over 200 bytes for all synchronous NFS operations. Test-plan: Performance regression tests that emphasize synchronous metadata operations such as LOOKUP and GETATTR. Examine client-side CPU utilization using kernel profiling tools such as oprofile.
Subject: [PATCH] NFS: getdents(3) hints Category: Scalability enhancement Description: When an application invokes getdents(3) on a directory stored in NFS, the directory cache logic always searches from the beginning of the directory to find the cookie in question. For large directories, this is significant overhead, and means that a single walk through the directory using getdents(3) calls can be more than O(n!). This patch adds a page index hint to the directory search algorithm so that getdents(3) can start where it left off, rather than walking the entire directory from the beginning each time it is called. Test-plan: Connectathon, "rm -rf" on a large directory tree; tar and untar.
Subject: [PATCH] NFS: optimize out READDIR operation on empty directories Category: Performance scalability enhancement Description: NFS directory cookies are opaque and unordered. Thus nfs_readdir() always queries the server to determine which is the next cookie to return when it is asked about a cookie that is not in its cache. If the directory is empty, then it is already clear the cookie does not exist, so we don't need to do a READDIR operation in that case. This eliminates an NFS READDIR that occurs at the end of every directory during "rm -rf". Test-plan: Connectathon, "rm -rf" on a large directory tree; tar and untar.
Subject: [PATCH] NFS: readdir pre-emption Category: Performance scalability enhancement Description The NFS directory logic provides several pre-emption points to allow other work on the system to proceed during the potentially lengthy searches in the NFS directory cache. However, the pre-emption logic is almost never triggered because it waits for 200 loop iterations before calling schedule(). On most 4KB-per-page clients, it is nearly impossible to get 200 directory entries into a single page. This patch adds more frequent pre-emption to the readdir and cached lookup paths. Pre-emption will occur about once per page during multi-page scans, and not at all if only a single page is involved. Test-plan: Performance characterization of a directory scan like "find" while running a multi-threaded workload.
Subject: [PATCH] NFS: directory trace messages Category: Servicability Description: Those who use pre-built kernels from a distribution can't add more trace messages to a kernel when diagnosing a problem in the field. On the other hand, we don't want so many trace messages that the kernel log is overwhelmed with noise when tracing is enabled. This patch reuses NFSDBG_DIRCACHE and NFSDBG_LOOKUPCACHE to provide additional diagnostic messages that trace the operation of the NFS client's directory cache and lookup cache. A few other messages are now generated when NFSDBG_VFS is active, as well. Test-plan: Enable NFS trace debugging with flags 1, 2, or 4. You should be able to see different types of trace messages with each flag setting.
Subject: [PATCH] NFS: more efficient NFSv3 directory decoding Category: Performance scalability enhancement Description: A returned NFSv3 READDIRPLUS directory entry is more complicated than a READDIR directory entry. Sometimes, when walking through a directory that was read in via READDIRPLUS, it is necessary to roll back to the previous entry because there is not enough room in the alloted buffer to decode a full entry along with attributes and file handle. To allow this roll back to occur, each time an NFSv3 READDIRPLUS directory entry is decoded, the old version of the current 54-byte entry is saved in its entirety in the current stack frame, even if the entry comes from READDIR, and not from READDIRPLUS. This patch replaces the save operation in the hot path of NFSv3 directory entry decoding with a single u64 copy. The u64 copy operation should be sufficient to allow the roll back in the rare case where a buffer overflow occurs. Test-plan: Walking a 20K entry directory should take less system CPU time, as measured with oprofile.
Subject: [PATCH] NFS: negative lookup caching Category: Performance scalability enhancement Description: In 2.6, the NFS client uses the readdir cache to eliminate on-the-wire lookup operations in NFS version 3. The version 3 READDIRPLUS operation can return file handles and attributes for each directory entry. However, the lookup cache does not distinguish between "cache valid, entry not found" and "cache invalid". This patch adds that capability. Using this new feature, we add support for revalidating negative dentries using the lookup cache. Test-plan: Connectathon, "rm -rf" and tar/untar. Multi-client software builds.
Subject: [PATCH] NFS: directory cleanups Category: Maintainability Description: This patch provides various cleanups and comment corrections in the NFS directory logic. Test-plan: None.
Subject: [PATCH] RPC: introduce client-side transport switch Category: Code re-organization Description: This patch introduces a transport switch into the kernel RPC client. The RPC transport switch divorces socket-specific implementation logic from the generic pieces of the RPC client. Such a switch will allow support for RPC over 10GbE, IPsec offload, multiple sockets per mount, IPv6, and transports capable of direct data placement. The first patch in the series moves the bulk of socket-specific code into a separate source file, net/sunrpc/ipv4_sock.c. The patch attempts to avoid any functional changes or rewrites. Test-plan: Millions of fsx operations. Performance characterization such as "sio" or "iozone". Destructive testing (unplugging the network temporarily). Connectathon with v2, v3, and v4.
Subject: [PATCH] RPC: move xdr_sendpages under transport switch Category: Code re-organization Description: This patch removes socket-dependent code from net/sunrpc/xdr.c and adds it to net/sunrpc/sock.c. Test-plan: Millions of fsx operations. Performance characterization such as "sio" or "iozone".
Subject: [PATCH] RPC: client-side transport switch cleanup Category: Code re-organization Description: This patch cleans up remaining socket-specific structure naming, removes include/socket.h from most RPC client source files, and changes some comments to reflect the realities of the new RPC transport switch mechanism. It also removes the cong_wait field from rpc_xprt, as it is no longer used. Test-plan: Millions of fsx operations. Performance characterization such as "sio" or "iozone".
Subject: [PATCH] RPC: separate TCP and UDP write space callbacks Category: Code re-organization, maintainability Description: This patch splits the socket write space callback function into a TCP version and UDP version, eliminating one dependence on the "xprt->stream" variable. It also makes both callbacks more CPU efficient by reducing the number of conditional branches taken in the hot path in each function. Test-plan: Write-intensive workload on a single mount point.
Subject: [PATCH] RPC: separate TCP and UDP transport connection logic Category: Code re-organization, maintainability Description: This patch splits the RPC client's connection logic into separate paths for UDP and TCP, eliminating another dependency on the "xprt->stream" variable. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with v2, v3, and v4.
Subject: [PATCH] RPC: separate TCP and UDP socket write paths Category: Code re-organization, maintainability Description: This patch splits the RPC client's main socket write path into a TCP version and a UDP version to eliminate another dependency on the "xprt->stream" variable. Test-plan: Millions of fsx operations. Performance characterization such as "sio" or "iozone". Examine oprofile results for any changes before and after this patch is applied.
Subject: [PATCH] RPC: skip over transport-specific heads automatically Category: Code re-organization, maintainability Description: Add a mechanism for skipping over transport-specific headers when constructing an RPC request. This removes another "xprt->stream" dependency, and gets rid of a conditional branch in the RPC send hot path. Test-plan: Write-intensive workload on a single mount point.
Subject: [PATCH] RPC: get rid of xprt->stream Category: Code re-organization, maintainability Description: Now we can remove the last few places that use the "xprt->stream" variable, and get rid of it from the rpc_xprt structure. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP.
Subject: [PATCH] RPC: transport-specific timeouts Category: Code re-organization, maintainability Description: This patch prepares the way to remove the "xprt->nocong" variable by adding callouts to the RPC client transport switch API to handle setting RPC timeouts. Test-plan: Use WAN simulation to cause sporadic bursty packet loss. Look for significant regression in performance or client stability.
Subject: [PATCH] RPC: remove xprt->nocong Category: Code re-organization, maintainability Description: Get rid of the "xprt->nocong" variable. Test-plan: Use WAN simulation to cause sporadic bursty packet loss. Look for significant regression in performance or client stability.
Subject: [PATCH] RPC: switchable buffer allocation Category: Code re-organization Description: In the IPv4 socket transport implementation, RPC buffers are allocated as needed for each RPC message that is sent. Some transport implementations may choose to use pre-allocated buffers for encoding, sending, receiving, and unmarshalling RPC messages. This patch adds RPC client transport switch support for replacing buffer management on a per-transport basis. Test-plan: Millions of fsx operations. Performance characterization with "sio" and "iozone". Use oprofile and other tools to look for significant regression in CPU utilization.
Subject: [PATCH] RPC: rpc_task cleanup Category: Code re-organization Description: Move some retry counters from the rpc_task structure to the rpc_rqst structure. These count request retry failures, and belong to each request rather than to each task. Test-plan: None.
Subject: [PATCH] RPC: pluggable portmapping Category: Code re-organization, maintainability Description: Introduce new RPC client transport switch methods to handle RPC portmapping in a transport independent way. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked. Probably need to rig a server where certain services aren't running, or that returns an error for some typical operation.
Subject: [PATCH] RPC: API for getting remote peer address Category: Code re-organization, maintainability Description: Provide an API for retrieving the remote peer address without allowing direct access to the rpc_xprt struct. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked. Probably need to rig a server where certain services aren't running, or that returns an error for some typical operation.
Subject: [PATCH] RPC: display XIDs in network order Category: Servicability Description: Ethereal and other tools display RPC XIDs in network order. This patch changes the RPC trace messages that display XIDs to print them in network order so they can be easily matched to XIDs that appear in Ethereal. Test-plan: Run a short program with RPC trace debugging enabled while capturing a packet trace with Ethereal. Compare the output of the trace debugging messages with the contents of the Ethereal window.
Subject: [PATCH] RPC: load RPC transport implementations dynamically Category: Experimental Description: Now that we have an RPC transport switch, we have introduced the potential to add new transport capabilities for use by the RPC client at run-time. This patch allows RPC client transport implementations to be loaded as needed, or as they become available from distributors or third-party vendors. This patch is experimental. It is safe to apply, but is functionally incomplete. Currently it is acting only as a placeholder to collect changes related to transport module loading. Test-plan: Build kernel with NFS and SUNRPC in a module. Try loading and unloading the IPv4 transport module. Try mounting without the IPv4 module loaded. Destructive testing. Try unloading SUNRPC or the IPv4 transport module while there are active NFS mounts.
Subject: [PATCH] RPC: add function to get tk_auth from rpc_rqst pointer Category: Minor code re-organization Description: Create a standardized way to derive the appropriate tk_auth for a given rpc_rqst. This is clean up required for the next patch. Test-plan: None.
Subject: [PATCH] RPC: make rpc_rqst a part of rpc_task Category: Performance scalability enhancement Description: The RPC client allocates rpc_rqst structures from slots in a static table. The size of this slot table determines how many RPC requests can be in flight concurrently for that rpc_clnt (NFS mount point). Previously, this static slot table was fixed in size, and contained just 16 slots for every rpc_clnt. In recent kernels, logic was added to allow this static slot table to be sized and allocated dynamically at the time an rpc_clnt is created. The slot table can now be made as large as 128 slots to permit many more concurrent RPC requests, but the memory remains allocated until each rpc_clnt is destroyed, even when there are no outstanding RPC requests. NFSv4.1 sessions provide the ability to negotiate the maximum number of concurrent requests that the transport and the server can handle. The maximum can be increased or decreased during a transport session. This patch makes RPC requests a part of the RPC task structure. This reduces memory fragmentation by keeping all the pieces of an NFS and RPC request together. For asynchronous reads and writes, this means the RPC request, RPC task, and nfs_read/write_data structures reside in the same CPU cache lines, and will move between caches together. RPC slot tables are eliminated entirely, preventing memory from being held even when it's not being used, and allowing the maximum number of outstanding RPC requests to be modulated dynamically. The call_reserve path has been simplified. The only job it has now are to initialize each rpc_rqst structure, or to queue RPC tasks on the xprt's backlog queue if the maximum number per transport has been exceeded. We no longer have a free list, and each XID can be allocated while holding the xprt_lock. Thus we can do away with the reserve_lock entirely. Test-plan: Heavy multi-threaded tests like "sio" and "iozone" on SMP systems. Watch for signficant regression in CPU utilization with oprofile and other tools. Watch for multiple RPCs with the same XID. Destructive testing.
Subject: [PATCH] RPC: remove tk_rqstp field from rpc_task Category: Minor code re-organization Description: Now that the rpc_rqst structure is part of rpc_task, there is no need for a field in rpc_task that points to the associated RPC request slot. This patch removes the tk_rqstp field. Test-plan: None.
Subject: [PATCH] RPC: remove rq_task field from rpc_rqst Category: Minor code re-organization Description: Now that the rpc_rqst structure is part of rpc_task, there is no need for a field in rpc_rqst that points to its associated rpc_task. This patch removes the rq_task field. Test-plan: None.
Subject: [PATCH] RPC: use rq_xprt instead of tk_xprt Description: Minor performance optimization. Now that the rpc_rqst structure is part of rpc_task, we can replace this double indirection (task->tk_client->cl_xprt) with a single indirection (task->tk_rqst.rq_xprt) to save a load instruction and eliminate one AGI in several places. Test-plan: None.
Subject: [PATCH] RPC: add rq_rtt field to rpc_rqst Description: Minor performance optimization. Currently it is cumbersome to derive the RTT data from the address of an RPC request. This patch adds an rq_rtt field which is a copy of the RPC client's cl_rtt field. That makes it easier and cleaner to find the RTT data when needed, and reduces the number of loads and AGIs in several places. Test-plan: None.