Published on Mon Dec 6 09:17:48 EST 2004
Please read the Release Notes for this patchset release.
Subject: [NFS/RPC] Trond's NFS_ALL patch for 2.6.9 Description: Final version of Trond's NFS_ALL patch for 2.6.9. This includes an extra patch that does not appear in Trond's published version of NFS_ALL, but will appear in 2.6.11.
Subject: [NFS/RPC]: CITI's NFS4_ALL for 2.6.9-rc3 Description: CITI's NFS4_ALL for 2.6.9-rc3, adapted and applied to 2.6.9 + Trond's 2.6.9 NFS_ALL.
Subject: [PATCH] NFS: readdir pre-emption Add more frequent pre-emption to the NFS readdir path. Pre-emption will occur about once per page during multi-page scans, and not at all if only a single page is involved. The NFS directory logic provides several pre-emption points to allow other work on the system to proceed during potentially lengthy searches through cached NFS directories. However, the current pre-emption logic is almost never triggered because it waits for 200 loop iterations before calling schedule(). On 4KB-per-page clients, it is nearly impossible to get 200 directory entries into a single page. Test-plan: Performance characterization of a directory scan like "find" while running a multi-threaded workload.
Subject: [PATCH] NFS: directory trace messages Description: Reuse NFSDBG_DIRCACHE and NFSDBG_LOOKUPCACHE to provide additional diagnostic messages that trace the operation of the NFS client's directory cache. A few new messages are now generated when NFSDBG_VFS is active, as well, to trace normal VFS activity. This compromise provides better trace debugging for those who use pre-built kernels without adding a lot of extra noise to the standard debug settings. Test-plan: Enable NFS trace debugging with flags 1, 2, or 4. You should be able to see different types of trace messages with each flag setting.
Subject: [PATCH] NFS: support 64KB reads and writes on the wire Most NFS client implementations allow up to 64KB reads and writes on the wire. Now Linux does too. This will help reduce protocol and context switch overhead on read/write intensive NFS workloads. Some header file and macro cleanup simplifies the calculation of the number of page slots needed in the nfs_readargs and nfs_writeargs structures, and dispenses with the need for the MAX_IOVEC macro. As a bonus, the RPC client now reports the maximum payload size it can support. This is something a little less than 64KB for RPC over UDP, and about 2GB - 1 for RPC over TCP. The effective rsize and wsize values are not allowed to exceed the reported maximum RPC payload size. Test-plan: Connectathon and iozone on mount point with wsize=rsize=65536 over TCP. Tests with NFS over UDP to verify the maximum RPC payload size cap. Note: server side support for 64KB reads and writes is also required.
Subject: [PATCH] RPC: cancel outstanding keventd work before freeing rpc_xprt It is possible for a connect operation to time out before keventd can run the connect worker. In that case, xprt_destroy can race with the still pending connect worker. If the worker runs after rpc_xprt is freed, a variety of bad things can occur, depending on how the freed memory is then reused. Test-plan: Hard code an unrealistically short connect timeout, then run basic tests to make sure the failure modes are clean.
Subject: [PATCH] NFS: Detect directory restoration Description: READDIR and READDIRPLUS read not only cookies but also file handles from the server. The Linux NFS client caches these items for objects in each directory in the page cache. To validate contents of the page cache, the NFS client uses mtime and size returned by the server to determine whether a file or directory has been changed on the server. However, if a directory is restored via NDMP restore, rsync, or some other mechanism, the object names in the directory and the directory size and mtime could remain the same, while the file handles and cookies for the objects contained in the directory have changed. This patch changes the NFS client to watch ctime on directories to catch restore operations that might invalidate cached file handles. If a directory ctime change is detected, the client now invalidates any file handle information stored in the page cache for that directory. Test-plan: Combinations of rsync and "ls -l" on multiple clients. No stale file handles should be reported on the contents of changed directories. Standard performance tests; little or no loss of performance is expected.
Subject: [PATCH] RPC: extract socket logic common to both client and server Move some code that is common to both RPC client- and server-side socket transports into its own source file, net/sunrpc/socklib.c. Test-plan: Millions of fsx operations over UDP. Connectathon over UDP.
Subject: [PATCH] RPC: introduce client-side transport switch Introduce a generic RPC client-side transport API. This "RPC transport switch" divorces socket-specific implementation from the generic pieces of the RPC client. Such a switch will allow efficient support for RPC over TOE, IPsec offload, multiple sockets per mount, IPv6, and transports capable of direct data placement. Here, we move the bulk of client-side socket-specific code into a separate source file, net/sunrpc/xprtsock.c. Test-plan: Millions of fsx operations. Performance characterization such as "sio" or "iozone". Destructive testing (unplugging the network temporarily, server reboots). Connectathon with v2, v3, and v4.
Subject: [PATCH] RPC: move xdr_sendpages under transport switch Move socket-dependent code from net/sunrpc/xdr.c to net/sunrpc/xprtsock.c. Reduce stack utilization of the RPC send path, while we're at it. Test-plan: Millions of fsx operations. Performance characterization such as "sio" or "iozone".
Subject: [PATCH] RPC: client-side transport switch cleanup Clean up remaining socket-specific structure naming, remove include/socket.h from most RPC client source files, and change some comments to reflect the realities of the new RPC transport switch mechanism. Remove the cong_wait field from rpc_xprt, as it is no longer used. Remove the rq_creddata field from rpc_rqst, as it is not used. Fix some dprintk nits in clnt.c, sched.c, and auth_*.c. Test-plan: Millions of fsx operations. Performance characterization such as "sio" or "iozone".
Subject: [PATCH] RPC: separate TCP and UDP write space callbacks Split the socket write space callback function into a TCP version and UDP version, eliminating one dependence on the "xprt->stream" variable. Keep the common pieces of this path in a single function. Test-plan: Write-intensive workload on a single mount point.
Subject: [PATCH] RPC: separate TCP and UDP transport connection logic Remove the xs_create and xs_bind functions. Create separate connection worker functions for managing UDP and TCP transport sockets. This eliminates several dependencies on "xprt->stream". Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with v2, v3, and v4.
Subject: [PATCH] RPC: separate TCP and UDP socket write paths Split the RPC client's main socket write path into a TCP version and a UDP version to eliminate another dependency on the "xprt->stream" variable. Rely on compiler optimization to remove conditional branches from xs_sendpages, as this function is now called with some constant arguments. Test-plan: Millions of fsx operations. Performance characterization such as "sio" or "iozone". Examine oprofile results for any changes before and after this patch is applied.
Subject: [PATCH] RPC: skip over transport-specific heads automatically Add a generic mechanism for skipping over transport-specific headers when constructing an RPC request. This removes another "xprt->stream" dependency. Test-plan: Write-intensive workload on a single mount point.
Subject: [PATCH] RPC: get rid of xprt->stream Description: Now we can remove the last few places that use the "xprt->stream" variable, and get rid of it from the rpc_xprt structure. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP.
Subject: [PATCH] RPC: Disconnect TCP sockets after a major timeout Description: Implement a best practice: When a major RPC timeout occurs on a stream transport, disconnect then reconnect the transport before sending the next RPC request. We allow minor timeouts to retransmit the RPC request over stream transports. This is because NFSv2 and v3 servers can potentially drop a request, even when using a stream transport. Note that NFSv4 servers are not allowed to drop requests. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP.
Subject: [PATCH] RPC: Parametrize various transport connect timeouts Description: Each transport implementation can now set unique bind, connect, reestablishment, and idle timeout values. These are variables, allowing the values to be modified dynamically. This permits exponential backoff of any of these values, for instance. As an example, we implement exponential backoff for the reestablishment timeout. Also fix up xprt_connect_status: the soft timeout logic was clobbering tk_status, so TCP connect errors were not properly reported on soft mounts. Always use a printk to report errors when connecting. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP.
Subject: [PATCH] RPC: kick off connect operations faster Description: Make the socket transport kick the event queue to start socket connects immediately. This should improve responsiveness of applications that are sensitive to slow mount operations (like automounters). Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP.
Subject: [PATCH] RPC: transport-specific timeouts
Description:
Prepare the way to remove the "xprt->nocong" variable by adding callouts to
the RPC client transport switch API to handle setting RPC timeouts.
Note that we move __xprt_{get,put}_cong to work around a compiler inlining
bug. [ gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3) ]
Test-plan:
Use WAN simulation to cause sporadic bursty packet loss. Look for
significant regression in performance or client stability.
Subject: [PATCH] RPC: remove xprt->nocong Description: Get rid of the "xprt->nocong" variable. Test-plan: Use WAN simulation to cause sporadic bursty packet loss. Look for significant regression in performance or client stability.
Subject: [PATCH] RPC: pluggable portmapping Description: Introduce new RPC client transport switch methods to handle RPC portmapping in a transport independent way. Some added complexity is due to the desire to prevent applications from diddling directly with the port value in the rpc_xprt structure. Also, remove pmap_lock, replacing it with test_and_set style synchronization. Move the pmap arguments into rpc_xprt. Handle rpc_portmap fields only by a single rpc_task at a time. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked. Probably need to rig a server where certain services aren't running, or that returns an error for some typical operation.
Subject: [PATCH] RPC: Create API for getting remote peer address Description: Provide an API for retrieving the remote peer address without allowing direct access to the rpc_xprt struct. Again, the desire is to provide a cleaner mechanism for callers to access the remote peer address and port number without diddling directly with the contents of the rpc_xprt struct. At the same time, increase the storage capacity of the rpc_xprt struct to allow for future transports (eg. IPv6) that may have an address that is larger than sockaddr_in. Test-plan: Destructive testing (unplugging the network temporarily). Connectathon with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked. Probably need to rig a server where certain services aren't running, or that returns an error for some typical operation.
Subject: [PATCH] RPC: Use sockaddr + size for remote transport endpoints Description: Prepare for more generic transport endpoint handling needed by transports such as IPv6. Replace the two-call xprt_create_proto/rpc_create_client API with a single rpc_create call. Define a new rpc_create_args structure that allows callers to pass in remote endpoint addresses of varying length. Finally, eliminate no-longer-needed external xprt_destroy and xprt_shutdown interfaces. Test-plan: Repeated mount and unmount. TCP connects and reconnects. Idle timeouts.
Subject: [PATCH] RPC: switchable buffer allocation Description: Add RPC client transport switch support for replacing buffer management on a per-transport basis. In the IPv4 socket transport implementation, RPC buffers are allocated as needed for each RPC message that is sent. Some transport implementations may choose to use pre-allocated buffers for encoding, sending, receiving, and unmarshalling RPC messages, however. For transports capable of direct data placement, the buffers are carved out of a pre-registered area of memory rather than from a slab cache. Test-plan: Millions of fsx operations. Performance characterization with "sio" and "iozone". Use oprofile and other tools to look for significant regression in CPU utilization.
Subject: [PATCH] RPC: private transport-specific fields Description: Add a facility for transport implementations to store their own data in the rpc_xprt. Move socket-specific fields into a private struct defined in net/sunrpc/xprtsock.c. Still need to do something with rpc_portmap. Test-plan: Check socket buffer size on UDP sockets over time. Millions of fsx operations on TCP.
Subject: [PATCH] RPC: load RPC transport implementations dynamically Description: Now that we have an RPC transport switch, we have introduced the potential to add new transport capabilities for use by the RPC client at run-time. This patch allows RPC client transport implementations to be loaded as needed, or as they become available from distributors or third-party vendors. Test-plan: Build kernel with NFS and SUNRPC in a module. Try loading and unloading the IPv4 transport module. Try mounting without the IPv4 module loaded. Destructive testing. Try unloading SUNRPC or the IPv4 transport module while there are active NFS mounts. Reboot the client after mounting NFS shares.
Subject: [PATCH] RPC: RPC transport switch documentation Description: Add a file to the Documentation directory describing the new RPC transport switch. The file is based on the original RPC transport switch design document. Test-plan: None.
Subject: [PATCH] RPC: make rpc_rqst a part of rpc_task Description: The RPC client allocates rpc_rqst structures from slots in a static table. The size of this slot table determines how many RPC requests can be in flight concurrently for that rpc_xprt (NFS mount point). Previously, this static slot table was fixed in size, and contained just 16 slots for every rpc_xprt. In recent kernels, support was added to allow this static slot table to be sized and allocated dynamically at the time an rpc_xprt is created. The slot table can now be made as large as 128 slots to permit many more concurrent RPC requests, but the memory remains allocated until each rpc_xprt is destroyed, even when there are no outstanding RPC requests. NFSv4.1 sessions provide the ability to negotiate the maximum number of concurrent requests that the transport and the server can handle. The maximum can be increased or decreased during a transport session. This patch makes RPC requests a part of the RPC task structure. This reduces memory fragmentation by keeping all the pieces of an NFS and RPC request together. For asynchronous reads and writes, this means the RPC request, RPC task, and nfs_read/write_data structures reside in the same CPU cache lines, and will move between caches together. RPC slot tables are eliminated entirely, preventing memory from being held even when it's not being used, and allowing the maximum number of outstanding RPC requests to be modulated dynamically. The call_reserve path has been simplified. The only job it has now are to initialize each rpc_rqst structure, or to queue RPC tasks on the xprt's backlog queue if the maximum number per transport has been exceeded. We no longer have a free list, and each XID can be allocated while holding the xprt_lock. Thus we can do away with the reserve_lock entirely. Test-plan: Heavy multi-threaded tests like "sio" and "iozone" on SMP systems. Watch for signficant regression in CPU utilization with oprofile and other tools. Watch for multiple RPCs with the same XID. Destructive testing.
Subject: [PATCH] RPC: remove tk_rqstp field from rpc_task Description: Remove the tk_rqstp field. Now that the rpc_rqst structure is part of rpc_task, there is no need for a field in rpc_task that points to the associated RPC request slot. Also fix some printk alignment issues in rpc_show_tasks while we're at it. Test-plan: None.
Subject: [PATCH] RPC: remove rq_task field from rpc_rqst Description: Remove the rq_task field. Now that the rpc_rqst structure is part of rpc_task, there is no need for a field in rpc_rqst that points to its associated rpc_task. Test-plan: None.
Subject: [PATCH] RPC: use rq_xprt instead of tk_xprt Description: Now that the rpc_rqst structure is part of rpc_task, we can replace this double indirection (task->tk_client->cl_xprt) with a single indirection (task->tk_rqst.rq_xprt) to save a load instruction and eliminate one AGI in several places. Test-plan: None.
Subject: [PATCH] RPC: add rq_rtt field to rpc_rqst Description: Currently it is cumbersome to derive the RTT data from the address of an RPC request. This patch adds an rq_rtt field which is a copy of the RPC client's cl_rtt field. That makes it easier and cleaner to find the RTT data when needed, and reduces the number of loads and AGIs in several places. Test-plan: None.
Subject: [PATCH] RPC: fix print format for tk_pid Description: The tk_pid field is an unsigned short. The proper print format specifier for that type is %5u, not %4d. So there. Test-plan: None.
Subject: [PATCH] RPC: BKL no longer required for ->tk_callback callback Description: The tk_callback callback function is used to invoke only two remaining functions: pmap_getport_done, and xprt_connect_status. Neither of these require the BKL, so now the RPC client no longer holds the BKL while invoking the tk_callback function. Test-plan: OraSim, fsx, and iozone on SMP clients. Run multiple parallel "rm -rf" jobs on the same directory tree on SMP clients. Generate a mount flood and look for TCP connection and portmapper races.
Subject: [PATCH] RPC: BKL no longer required for ->tk_action callback Description: The RPC client finite state machine does not require that the global kernel lock is held for any of the typical RPC client states. Only the NFS client's asynchronous unlink logic needs the lock. Remove the BKL around the tk_action invocations in the RPC scheduler. The tk_action callback is invoked on average ten times per RPC request, so removing the BKL will have obvious SMP performance scalability implications when the BKL is no longer taken in hot NFS paths. Test-plan: OraSim, fsx, and iozone on SMP clients. Run multiple parallel "rm -rf" jobs on the same directory tree on SMP clients. Generate a mount flood and look for TCP connection and portmapper races.
Subject: [PATCH] RPC: BKL no longer required for ->tk_exit callback Description: Move BKL acquisition into the callback functions that need it (NFS async read, write, commit, and unlink completion, and NLM callbacks). This completes the removal of the BKL from the RPC client, and makes clear where the NFS and NLM clients still have strong dependencies on global kernel locking. Test-plan: OraSim, fsx, and iozone on SMP clients. Run multiple parallel "rm -rf" jobs on the same directory tree on SMP clients.
Subject: [PATCH] NFS: Direct I/O no longer acquires BKL Description: Now that the RPC client no longer acquires the BKL, we can begin removing it from the NFS client. A logical first step is to remove it completely from the direct I/O path. Test-plan: OraSim, fsx, and iozone in direct I/o mode on SMP clients.
Subject: [PATCH] NFS: add I/O performance counters Description: Add an extensible per-superblock performance counter facility to the NFS client. This facility mimics the counters available for block devices and for networking. Currently there is no way to view the counter data from user-land. Eventually we plan to use named attributes to export this data from the kernel securely. However, at this time named attribute support is not available in the Linux NFS client. Until it is, distributors may use this patch plus an export mechanism of their choice to make NFS I/O performance counters available to user-level applications. Test-plan: fsx and iozone on SMP systems. Watch for memory overwrite bugs, and performance loss (significantly more CPU required per op).
Subject: [PATCH] NFS/RPC: roll-up all of 2.6.9-a patchset Description: Roll up all of cel's 2.6.9 patches. For a complete description, see the individual patches.