Published on Thu Mar 10 20:03:58 EST 2005
Please read the Release Notes for this patchset release.
[PATCH] linux-2.6.11-CITI_NFS4_ALL-1 CITI_NFS4_ALL includes all 2.6.11 patches available from the CITI web site as of March 7, 2005. This has Trond's 2.6.11 NFS_ALL patch, as well. Test-plan: None.
[PATCH] RPC: extract socket logic common to both client and server Cleanup: Move some code that is common to both RPC client- and server-side socket transports into its own source file, net/sunrpc/socklib.c. Test-plan: Compile kernel with CONFIG_NFS enabled. Millions of fsx operations over UDP, client and server. Connectathon over UDP.
[PATCH] RPC: introduce client-side transport switch Introduce a generic RPC client-side transport API. This "RPC transport switch" divorces socket-specific implementation from the generic pieces of the RPC client. Such a switch will allow efficient support for RPC over TOE, IPsec offload, multiple sockets per mount, IPv6, and transports capable of direct data placement. Here, we move the bulk of client-side socket-specific code into a separate source file, net/sunrpc/xprtsock.c. Test-plan: Millions of fsx operations. Performance characterization such as "sio" or "iozone". Destructive testing (unplugging the network temporarily, server reboots). Connectathon with v2, v3, and v4.
[PATCH] RPC: transport switch function naming Introduce block header comments and a function naming convention to the socket transport implementation. Provide a debug setting for transports that is separate from RPCDBG_XPRT. Eliminate xprt_default_timeout(). Test-plan: Compile kernel with CONFIG_NFS enabled.
[PATCH] RPC: Reduce stack utilization in xs_sendpages Reduce stack utilization of the RPC socket transport's send path. Test-plan: Millions of fsx operations. Performance characterization such as "sio" or "iozone".
[PATCH] RPC: clean up xprt_transmit Cleanup: minor code optimization in xprt_transmit, eliminating a now unnecessary goto. Test-plan: Millions of fsx operations. Performance characterization such as "sio" or "iozone".
[PATCH] RPC: Rename sock_lock Cleanup: replace a name reference to sockets in the generic parts of the RPC client by renaming sock_lock in the rpc_xprt structure. Test-plan: Compile kernel with CONFIG_NFS enabled.
[PATCH] RPC: Rename xprt_lock Cleanup: Replace the xprt_lock with something more aptly named. This lock single-threads the XID and request slot reservation process. Test-plan: Compile kernel with CONFIG_NFS enabled.
[PATCH] RPC: rename the sockstate field Cleanup: get rid of a name reference to sockets in the generic parts of the RPC client by renaming the sockstate field in the rpc_xprt structure. Test-plan: Compile kernel with CONFIG_NFS enabled.
[PATCH] RPC: Eliminate socket.h includes in RPC client Cleanup: get rid of unnecessary socket.h and in.h includes in the generic parts of the RPC client. Test-plan: Compile kernel with CONFIG_NFS enabled.
[PATCH] RPC: Eliminate unused wait field Cleanup: get rid of the cong_wait field in rpc_xprt, which is no longer used. Test-plan: Compile kernel with CONFIG_NFS enabled.
[PATCH] RPC: client-side transport switch cleanup Cleanup: change some comments to reflect the realities of the new RPC transport switch mechanism. Test-plan: Compile kernel with CONFIG_NFS enabled.
[PATCH] RPC: separate TCP and UDP write space callbacks Split the socket write space callback function into a TCP version and UDP version, eliminating one dependence on the "xprt->stream" variable. Keep the common pieces of this path in a single function. Test-plan: Write-intensive workload on a single mount point.
[PATCH] RPC: separate TCP and UDP transport connection logic Create separate connection worker functions for managing UDP and TCP transport sockets. This eliminates several dependencies on "xprt->stream". Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with v2, v3, and v4.
[PATCH] RPC: separate TCP and UDP socket write paths Split the RPC client's main socket write path into a TCP version and a UDP version to eliminate another dependency on the "xprt->stream" variable. Compiler optimization removes unneeded code from xs_sendpages, as this function is now called with some constant arguments. Test-plan: Millions of fsx operations. Performance characterization such as "sio" or "iozone". Examine oprofile results for any changes before and after this patch is applied.
[PATCH] RPC: skip over transport-specific heads automatically Add a generic mechanism for skipping over transport-specific headers when constructing an RPC request. This removes another "xprt->stream" dependency. Test-plan: Write-intensive workload on a single mount point.
[PATCH] RPC: get rid of xprt->stream Now we can fix up the last few places that use the "xprt->stream" variable, and get rid of it from the rpc_xprt structure. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP.
[PATCH] RPC: Disconnect TCP sockets after a major timeout Implement a best practice: When a major RPC timeout occurs on a stream transport, disconnect then reconnect the transport before retransmitting the timed out RPC request. We allow minor timeouts to retransmit the RPC request over stream transports. This is because NFSv2 and v3 servers can potentially drop a request, even when using a stream transport. Note that NFSv4 servers are not allowed to drop requests. Some servers already terminate a TCP connection if any retransmit occurs. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP.
[PATCH] RPC: Parametrize various transport connect timeouts Each transport implementation can now set unique bind, connect, reestablishment, and idle timeout values. These are variables, allowing the values to be modified dynamically. This permits exponential backoff of any of these values, for instance. As an example, we implement exponential backoff for the reestablishment timeout. Also fix up xprt_connect_status: the soft timeout logic was clobbering tk_status, so TCP connect errors were not properly reported on soft mounts. Always use a printk to report errors when connecting. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP.
[PATCH] RPC: kick off socket connect operations faster Make the socket transport kick the event queue to start socket connects immediately. This should improve responsiveness of applications that are sensitive to slow mount operations (like automounters). We are also now careful to cancel the connect worker before destroying the xprt. This eliminates a race where xprt_destroy can finish before the connect worker is even allowed to run. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP. Hard-code impossibly small connect timeout.
[PATCH] RPC: transport-specific timeouts
Prepare the way to remove the "xprt->nocong" variable by adding callouts to
the RPC client transport switch API to handle setting RPC retransmit timeouts.
Note that we move __xprt_{get,put}_cong to work around a compiler inlining
bug. [ gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3) ]
Test-plan:
Use WAN simulation to cause sporadic bursty packet loss. Look for significant
regression in performance or client stability.
[PATCH] RPC: remove xprt->nocong Get rid of the "xprt->nocong" variable. Test-plan: Use WAN simulation to cause sporadic bursty packet loss. Look for significant regression in performance or client stability.
[PATCH] RPC: simplify calling xprt_complete_rqst Cleanup: remove an "unused" argument from xprt_complete_rqst Test-plan: Compile kernel with CONFIG_NFS enabled.
[PATCH] RPC: new interface to force an RPC rebind We'd like to hide fields in rpc_xprt and rpc_clnt from upper layer protocols. Start by creating an API to force RPC rebind, replacing logic that simply sets cl_port to zero. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked. Probably need to rig a server where certain services aren't running, or that returns an error for some typical operation.
[PATCH] RPC: transport switch API for setting port number At some point, transport endpoint addresses will no longer be IPv4. To hide the structure of the rpc_xprt's address field from ULPs and port mappers, add an API for setting the port number during an RPC bind operation. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked. Probably need to rig a server where certain services aren't running, or that returns an error for some typical operation.
[PATCH] RPC: pluggable rpcbind Make the RPC portmapper completely self-contained. Identify and remove all connection and bind state that was maintained in the rpc_clnt structure. This allows us to create a clean interface for plugging in different types of bind mechanisms. For instance, rpcbind can cleanly replace the existing portmapper client, or a transport can choose to implement RPC binding any way it likes. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked. Probably need to rig a server where certain services aren't running, or that returns an error for some typical operation.
[PATCH] RPC: Create API for getting remote peer address Increase the storage capacity of the rpc_xprt struct to allow for future transports (eg. IPv6) that may have an address that is larger than sockaddr_in. Provide an API for retrieving the remote peer address without allowing direct access to the rpc_xprt struct. Also provide an API for formatting this address for printing without knowing its internal structure. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked. Probably need to rig a server where certain services aren't running, or that returns an error for some typical operation.
[PATCH] RPC: Use sockaddr + size when creating remote transport endpoints Prepare for more generic transport endpoint handling needed by transports such as IPv6. Replace the two-call xprt_create_proto/rpc_create_client API with a single rpc_create call. Define a new rpc_create_args structure that allows callers to pass in remote endpoint addresses of varying length. Finally, eliminate now obsolete external xprt_destroy and xprt_shutdown interfaces. Test-plan: Repeated mount and unmount. TCP connects and reconnects. Idle timeouts.
[PATCH] RPC: Hide xprt->prot field Since we need to increment the xprt reference count before referring to xprt->prot, put it in a helper function and hide the field from users outside the RPC client. Test-plan: Repeated mount and unmount. TCP connects and reconnects. Idle timeouts.
[PATCH] RPC: Allow RPC client's port range to be adjustable Select an RPC client source port between 650 and 1023 instead of between 1 and 800. The old range conflicts with a number of network services. Provide sysctls to allow admins to select a different port range. Based on a suggestion by Olaf Kirch. Test-plan: Repeated mount and unmount. TCP connects and reconnects. Idle timeouts.
[PATCH] RPC: Make sure we get the same local port number when reconnecting If the remote end drops our connection, try to reconnect using the same port number. This is important because the RPC server's duplicate reply cache often keys on the source port number. If the client reuses the port number when it reconnects, the server's DRC will be more effective. Based on suggestions by Mike Eisler and Olaf Kirch. Test-plan: Repeated mount and unmount. TCP connects and reconnects. Idle timeouts.
[PATCH] RPC: proper soft timeout behavior for rpcbind and connect Previously, the RPC client's TCP connection logic would continue to retry connecting to a server, even if the mount was soft. Change it to time out a connection attempt properly on soft mounts. Likewise, provide proper behavior for rpcbind under the same circumstances. And, if an rpcbind request fails on a hard mount, retry indefinitely. This also provides an FSM hook for retrying a bind with a different rpcbind protocol version. We'll use this later to try all three rpcbind protocol versions when binding. Test-plan: Hundreds of passes with connectathon NFSv3 locking suite.
[PATCH] RPC: switchable buffer allocation Add RPC client transport switch support for replacing buffer management on a per-transport basis. In the IPv4 socket transport implementation, RPC buffers are allocated as needed for each RPC message that is sent. Some transport implementations may choose to use pre-allocated buffers for encoding, sending, receiving, and unmarshalling RPC messages, however. For transports capable of direct data placement, the buffers can be carved out of a pre-registered area of memory rather than from a slab cache. Test-plan: Millions of fsx operations. Performance characterization with "sio" and "iozone". Use oprofile and other tools to look for significant regression in CPU utilization.
[PATCH] RPC: use size_t where appropriate Cleanup: Let's be consistent about what type is used for variables that store byte counts. Also, use the term "fragment header" instead of "record marker" to be more consistent with terminology found in IETF standards documents. Test-plan: Check socket buffer size on UDP sockets over time. Millions of fsx operations on TCP.
[PATCH] RPC: private transport-specific fields Add a facility for transport implementations to store their own data in the rpc_xprt. Move socket-specific fields into a private struct defined in net/sunrpc/xprtsock.c. Test-plan: Check socket buffer size on UDP sockets over time. Millions of fsx operations on TCP.
[PATCH] RPC: load RPC transport implementations dynamically Now that we have an RPC transport switch, we have introduced the potential to add new transport capabilities for use by the RPC client at run-time. Allow RPC client transport implementations to be loaded as needed, or as they become available from distributors or third-party vendors. Test-plan: Build kernel with NFS and SUNRPC in a module. Try loading and unloading the IPv4 transport module. Try mounting without the IPv4 module loaded. Destructive testing. Try unloading SUNRPC or the IPv4 transport module while there are active NFS mounts. Reboot the client after mounting NFS shares.
[PATCH] RPC: RPC transport switch documentation Add a file to the Documentation directory describing the new client-side RPC transport switch. The new file is based on the original RPC transport switch design document. Test-plan: None.
[PATCH] RPC: add IPv6 address support to RPC portmapper Add support in RPC portmapper for IPv6 socket addresses. Later we will implement v3 and v4 rpcbind operations as well. Test-plan: Destructive testing to ensure IPv4 functionality is undisturbed.
[PATCH] RPC: support IPv6 in the socket transport implementation Use the new RPC transport switch API to add basic support for RPC over IPv6 socket addresses. Based on work done by Gilles Quillard at Bull Open Source. Test-plan: Destructive testing to ensure IPv4 functionality is undisturbed. Standard tests (mount, connectathon) using IPv6 addressing.
[PATCH] RPC: make rpc_rqst a part of rpc_task The RPC client allocates rpc_rqst structures from slots in a static table. The size of this slot table determines how many RPC requests can be in flight concurrently for that rpc_xprt (NFS mount point). Previously, this static slot table was fixed in size, and contained just 16 slots for every rpc_xprt. In recent kernels, support was added to allow this static slot table to be sized and allocated dynamically at the time an rpc_xprt is created. The slot table can now be made as large as 128 slots to permit many more concurrent RPC requests, but the memory remains allocated until each rpc_xprt is destroyed, even when there are no outstanding RPC requests. NFSv4.1 sessions provide the ability to negotiate the maximum number of concurrent requests that the transport and the server can handle. The maximum can be increased or decreased during a transport session. This patch makes RPC requests a part of the RPC task structure. This reduces memory fragmentation by keeping all the pieces of an NFS and RPC request together. For asynchronous reads and writes, this means the RPC request, RPC task, and nfs_read/write_data structures reside in the same CPU cache lines, and will move between caches together. RPC slot tables are eliminated entirely, preventing memory from being held even when it's not being used, and allowing the maximum number of outstanding RPC requests to be modulated dynamically. Test-plan: Heavy multi-threaded tests like "sio" and "iozone" on SMP systems. Watch for signficant regression in CPU utilization with oprofile and other tools. Watch for multiple RPCs with the same XID and NFS workload hangs. Destructive testing.
[PATCH] RPC: remove tk_rqstp field from rpc_task Now that the rpc_rqst structure is part of rpc_task, there is no need for a field in rpc_task that points to the associated RPC request slot. Test-plan: None.
[PATCH] RPC: remove rq_task field from rpc_rqst Now that the rpc_rqst structure is part of rpc_task, there is no need for a field in rpc_rqst that points to its associated rpc_task. Test-plan: None.
[PATCH] RPC: eliminate rq_xprt field rq_xprt is just a copy of tk_xprt, and both are now in the same structure. We can get rid of rq_xprt. Test-plan: Compile kernel with CONFIG_NFS enabled.
[PATCH] RPC: fix print format for tk_pid The tk_pid field is an unsigned short. The proper print format specifier for that type is %5u, not %4d. Also clean up some miscellaneous print formatting nits. Test-plan: Compile kernel with CONFIG_NFS enabled on 32-bit and 64-bit architectures.
[PATCH] NFS: support large reads and writes on the wire Most NFS server implementations allow up to 64KB reads and writes on the wire. The Solaris NFS server allows up to a megabyte. Now the Linux NFS client supports transfer sizes up to 1MB, too. This will help reduce protocol and context switch overhead on read/write intensive NFS workloads. Test-plan: Connectathon and iozone on mount point with wsize=rsize>32768 over TCP. Tests with NFS over UDP to verify the maximum RPC payload size cap.
[PATCH] NFS: readdir pre-emption The NFS directory logic provides a pre-emption point to allow other work on the system to proceed during potentially lengthy searches through cached NFS directories. However, the current pre-emption logic is almost never triggered because it waits for 200 loop iterations before calling schedule(). On 4KB-per-page clients, it is nearly impossible to get 200 directory entries into a single page. Add more frequent pre-emption to the NFS readdir path. Pre-emption will occur about once per page during multi-page scans, and not at all if only a single page is involved. Test-plan: Performance characterization of a directory scan like "find" while running a multi-threaded workload.
[PATCH] NFS: directory trace messages Description: Reuse NFSDBG_DIRCACHE and NFSDBG_LOOKUPCACHE to provide additional diagnostic messages that trace the operation of the NFS client's directory cache. A few new messages are now generated when NFSDBG_VFS is active, as well, to trace normal VFS activity. This compromise provides better trace debugging for those who use pre-built kernels, without adding a lot of extra noise to the standard debug settings. Test-plan: Enable NFS trace debugging with flags 1, 2, or 4. You should be able to see different types of trace messages with each flag setting.
[PATCH] NFS: Detect directory restoration READDIR and READDIRPLUS read not only cookies but also file handles from the server. The Linux NFS client caches these items in the system's page cache. If a directory is restored via NDMP restore, rsync, or some other mechanism, the object names in the directory and the directory size and mtime could remain the same, while the file handles and cookies for the objects contained in the directory have changed because they were recreated by the underlying local file system. To validate contents of the page cache, the NFS client uses mtime and size returned by the server to determine whether a file or directory has been changed on the server. The NFS client now watches ctime on directories to catch restore operations that might invalidate cached file handles. Test-plan: Combinations of rsync and "ls -l" on multiple clients. No stale file handles should be reported on the contents of changed directories. Standard performance tests; little or no loss of performance is expected.
[PATCH] RPC: BKL no longer required for ->tk_callback callback The tk_callback callback function is used to invoke only two remaining functions: pmap_getport_done, and xprt_connect_status. Neither of these require the BKL, so now the RPC client no longer holds the BKL while invoking the tk_callback function. Test-plan: OraSim, fsx, and iozone on SMP clients. Run multiple parallel "rm -rf" jobs on the same directory tree on SMP clients. Generate a mount flood and look for TCP connection and portmapper races.
[PATCH] RPC: BKL no longer required for ->tk_action callback The RPC client finite state machine does not require that the global kernel lock is held for any of the typical RPC client states. Only the NFS client's asynchronous unlink logic needs the lock. Remove the BKL around the tk_action invocations in the RPC scheduler. The tk_action callback is invoked on average ten times per RPC request, so removing the BKL will have obvious SMP performance scalability implications when the BKL is no longer taken in hot NFS paths. Test-plan: OraSim, fsx, and iozone on SMP clients. Run multiple parallel "rm -rf" jobs on the same directory tree on SMP clients. Generate a mount flood and look for TCP connection and portmapper races.
[PATCH] RPC: BKL no longer required for ->tk_exit callback Move BKL acquisition into the callback functions that need it (NFS async read, write, commit, and unlink completion, and NLM callbacks). This completes the removal of the BKL from the RPC client, and makes clear where the NFS and NLM clients still have strong dependencies on global kernel locking. Test-plan: OraSim, fsx, and iozone on SMP clients. Run multiple parallel "rm -rf" jobs on the same directory tree on SMP clients.
[PATCH] NFS: Direct I/O no longer acquires BKL Now that the RPC client no longer acquires the BKL, we can begin removing it from the NFS client. A logical first step is to remove it completely from the direct I/O path. Test-plan: OraSim, fsx, and iozone in direct I/o mode on SMP clients.
[PATCH] VFS: New /proc file /proc/self/mountstats Create a new file under /proc/self, called mountstats, where mounted file systems can export information (configuration options, performance counters, and so on). Use a mechanism similar to /proc/mounts and s_ops->show_options. This mechanism does not violate namespace security, and is safe to use while other processes are unmounting file systems. Test-plan: Test concurrent mount/unmount operations while cat'ing /proc/self/mountstats.
[PATCH] NFS: add I/O performance counters Add an extensible per-superblock performance counter facility to the NFS client. This facility mimics the counters available for block devices and for networking. Expose these new counters via /proc/self/mountstats. Test-plan: fsx and iozone on SMP systems. Watch for memory overwrite bugs, and performance loss (significantly more CPU required per op).
[PATCH] NFS/RPC: roll-up all of 2.6.11 patchset Description: Roll up all of cel's 2.6.11 patches. For a complete description, see the individual patches.