Patchset for Linux NFS client

Published on Mon Dec 6 09:17:48 EST 2004

Please read the Release Notes for this patchset release.


File 01-trond-NFS_ALL.patch

 Subject: [NFS/RPC] Trond's NFS_ALL patch for 2.6.9

 Description:
 Final version of Trond's NFS_ALL patch for 2.6.9.  This includes an extra
 patch that does not appear in Trond's published version of NFS_ALL, but will
 appear in 2.6.11.


File 02-CITI-NFS4_ALL.patch

 Subject: [NFS/RPC]: CITI's NFS4_ALL for 2.6.9-rc3

 Description:
 CITI's NFS4_ALL for 2.6.9-rc3, adapted and applied to 2.6.9 + Trond's 2.6.9
 NFS_ALL.


File 10-nfs_readdir-sched.patch

 Subject: [PATCH] NFS: readdir pre-emption

 Add more frequent pre-emption to the NFS readdir path.  Pre-emption will
 occur about once per page during multi-page scans, and not at all if only
 a single page is involved.

 The NFS directory logic provides several pre-emption points to allow other
 work on the system to proceed during potentially lengthy searches through
 cached NFS directories.  However, the current pre-emption logic is almost
 never triggered because it waits for 200 loop iterations before calling
 schedule().  On 4KB-per-page clients, it is nearly impossible to get 200
 directory entries into a single page.

 Test-plan:
 Performance characterization of a directory scan like "find" while running
 a multi-threaded workload.


File 11-nfs-dir-msgs.patch

 Subject: [PATCH] NFS: directory trace messages

 Description:
 Reuse NFSDBG_DIRCACHE and NFSDBG_LOOKUPCACHE to provide additional diagnostic
 messages that trace the operation of the NFS client's directory cache.  A few
 new messages are now generated when NFSDBG_VFS is active, as well, to trace
 normal VFS activity.  This compromise provides better trace debugging for
 those who use pre-built kernels without adding a lot of extra noise to the
 standard debug settings.

 Test-plan:
 Enable NFS trace debugging with flags 1, 2, or 4.  You should be able to
 see different types of trace messages with each flag setting.


File 12-nfs-large-io.patch

 Subject: [PATCH] NFS: support 64KB reads and writes on the wire

 Most NFS client implementations allow up to 64KB reads and writes on the wire.
 Now Linux does too.  This will help reduce protocol and context switch
 overhead on read/write intensive NFS workloads.  Some header file and macro
 cleanup simplifies the calculation of the number of page slots needed in
 the nfs_readargs and nfs_writeargs structures, and dispenses with the
 need for the MAX_IOVEC macro.

 As a bonus, the RPC client now reports the maximum payload size it can
 support.  This is something a little less than 64KB for RPC over UDP, and
 about 2GB - 1 for RPC over TCP.  The effective rsize and wsize values are not
 allowed to exceed the reported maximum RPC payload size.

 Test-plan:
 Connectathon and iozone on mount point with wsize=rsize=65536 over TCP.  Tests
 with NFS over UDP to verify the maximum RPC payload size cap.  Note: server
 side support for 64KB reads and writes is also required.


File 13-rpc-cancel-work.patch

 Subject: [PATCH] RPC: cancel outstanding keventd work before freeing rpc_xprt

 It is possible for a connect operation to time out before keventd can run the
 connect worker.  In that case, xprt_destroy can race with the still pending
 connect worker.  If the worker runs after rpc_xprt is freed, a variety of
 bad things can occur, depending on how the freed memory is then reused.

 Test-plan:
 Hard code an unrealistically short connect timeout, then run basic tests to
 make sure the failure modes are clean.


File 14-nfs-readdir-ctime.patch

 Subject: [PATCH] NFS: Detect directory restoration
 
 Description:
 READDIR and READDIRPLUS read not only cookies but also file handles from
 the server.  The Linux NFS client caches these items for objects in each
 directory in the page cache.

 To validate contents of the page cache, the NFS client uses mtime and size
 returned by the server to determine whether a file or directory has been
 changed on the server.  However, if a directory is restored via NDMP restore,
 rsync, or some other mechanism, the object names in the directory and the
 directory size and mtime could remain the same, while the file handles and
 cookies for the objects contained in the directory have changed.

 This patch changes the NFS client to watch ctime on directories to catch
 restore operations that might invalidate cached file handles.  If a
 directory ctime change is detected, the client now invalidates any file
 handle information stored in the page cache for that directory.

 Test-plan:
 Combinations of rsync and "ls -l" on multiple clients.  No stale file handles
 should be reported on the contents of changed directories.  Standard
 performance tests; little or no loss of performance is expected.


File 20-rpc-socklib.patch

 Subject: [PATCH] RPC: extract socket logic common to both client and server

 Move some code that is common to both RPC client- and server-side socket
 transports into its own source file, net/sunrpc/socklib.c.

 Test-plan:
 Millions of fsx operations over UDP.  Connectathon over UDP.


File 21-rpc-xprt-switch.patch

 Subject: [PATCH] RPC: introduce client-side transport switch

 Introduce a generic RPC client-side transport API.  This "RPC transport
 switch" divorces socket-specific implementation from the generic pieces
 of the RPC client.  Such a switch will allow efficient support for RPC
 over TOE, IPsec offload, multiple sockets per mount, IPv6, and transports
 capable of direct data placement.

 Here, we move the bulk of client-side socket-specific code into a separate
 source file, net/sunrpc/xprtsock.c.

 Test-plan:
 Millions of fsx operations.  Performance characterization such as "sio" or
 "iozone".  Destructive testing (unplugging the network temporarily, server
 reboots).  Connectathon with v2, v3, and v4.


File 22-rpc-xdr_sendpages.patch

 Subject: [PATCH] RPC: move xdr_sendpages under transport switch

 Move socket-dependent code from net/sunrpc/xdr.c to net/sunrpc/xprtsock.c.
 Reduce stack utilization of the RPC send path, while we're at it.

 Test-plan:
 Millions of fsx operations.  Performance characterization such as "sio" or
 "iozone".


File 23-rpc-switch-cleanup.patch

 Subject: [PATCH] RPC: client-side transport switch cleanup

 Clean up remaining socket-specific structure naming, remove include/socket.h
 from most RPC client source files, and change some comments to reflect the
 realities of the new RPC transport switch mechanism.  Remove the cong_wait
 field from rpc_xprt, as it is no longer used.  Remove the rq_creddata field
 from rpc_rqst, as it is not used.  Fix some dprintk nits in clnt.c, sched.c,
 and auth_*.c.

 Test-plan:
 Millions of fsx operations.  Performance characterization such as "sio" or
 "iozone".


File 30-rpc-write_space.patch

 Subject: [PATCH] RPC: separate TCP and UDP write space callbacks

 Split the socket write space callback function into a TCP version and UDP
 version, eliminating one dependence on the "xprt->stream" variable.  Keep the
 common pieces of this path in a single function.

 Test-plan:
 Write-intensive workload on a single mount point.


File 31-rpc-connect.patch

 Subject: [PATCH] RPC: separate TCP and UDP transport connection logic

 Remove the xs_create and xs_bind functions.  Create separate connection
 worker functions for managing UDP and TCP transport sockets.  This
 eliminates several dependencies on "xprt->stream".

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon with
 v2, v3, and v4.


File 32-rpc-send_request.patch

 Subject: [PATCH] RPC: separate TCP and UDP socket write paths

 Split the RPC client's main socket write path into a TCP version and a UDP
 version to eliminate another dependency on the "xprt->stream" variable.
 Rely on compiler optimization to remove conditional branches from
 xs_sendpages, as this function is now called with some constant arguments.

 Test-plan:
 Millions of fsx operations.  Performance characterization such as
 "sio" or "iozone".  Examine oprofile results for any changes before and
 after this patch is applied.


File 33-rpc-tsh_size.patch

 Subject: [PATCH] RPC: skip over transport-specific heads automatically

 Add a generic mechanism for skipping over transport-specific headers when
 constructing an RPC request.  This removes another "xprt->stream" dependency.

 Test-plan:
 Write-intensive workload on a single mount point.


File 34-rpc-xprt-stream.patch

 Subject: [PATCH] RPC: get rid of xprt->stream

 Description:
 Now we can remove the last few places that use the "xprt->stream"
 variable, and get rid of it from the rpc_xprt structure.

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with UDP and TCP.


File 35-rpc-stream-disconnect.patch

 Subject: [PATCH] RPC: Disconnect TCP sockets after a major timeout

 Description:
 Implement a best practice: When a major RPC timeout occurs on a stream
 transport, disconnect then reconnect the transport before sending the
 next RPC request.

 We allow minor timeouts to retransmit the RPC request over stream transports.
 This is because NFSv2 and v3 servers can potentially drop a request, even
 when using a stream transport.  Note that NFSv4 servers are not allowed to
 drop requests.

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with UDP and TCP.


File 36-rpc-timeouts.patch

 Subject: [PATCH] RPC: Parametrize various transport connect timeouts

 Description:
 Each transport implementation can now set unique bind, connect,
 reestablishment, and idle timeout values.  These are variables, allowing the
 values to be modified dynamically.  This permits exponential backoff of any
 of these values, for instance.

 As an example, we implement exponential backoff for the reestablishment
 timeout.

 Also fix up xprt_connect_status: the soft timeout logic was clobbering
 tk_status, so TCP connect errors were not properly reported on soft mounts.
 Always use a printk to report errors when connecting.

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with UDP and TCP.


File 37-rpc-flush-connects.patch

 Subject: [PATCH] RPC: kick off connect operations faster

 Description:
 Make the socket transport kick the event queue to start socket connects
 immediately.  This should improve responsiveness of applications that are
 sensitive to slow mount operations (like automounters).

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with UDP and TCP.


File 38-rpc-congestion.patch

 Subject: [PATCH] RPC: transport-specific timeouts

 Description:
 Prepare the way to remove the "xprt->nocong" variable by adding callouts to
 the RPC client transport switch API to handle setting RPC timeouts.

 Note that we move __xprt_{get,put}_cong to work around a compiler inlining
 bug.  [ gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3) ]

 Test-plan:
 Use WAN simulation to cause sporadic bursty packet loss.  Look for
 significant regression in performance or client stability.


File 39-rpc-xprt-nocong.patch

 Subject: [PATCH] RPC: remove xprt->nocong

 Description:
 Get rid of the "xprt->nocong" variable.

 Test-plan:
 Use WAN simulation to cause sporadic bursty packet loss.  Look for
 significant regression in performance or client stability.


File 40-rpc_portmap.patch

 Subject: [PATCH] RPC: pluggable portmapping

 Description:
 Introduce new RPC client transport switch methods to handle RPC portmapping
 in a transport independent way.  Some added complexity is due to the desire
 to prevent applications from diddling directly with the port value in the
 rpc_xprt structure.

 Also, remove pmap_lock, replacing it with test_and_set style synchronization.
 Move the pmap arguments into rpc_xprt.  Handle rpc_portmap fields only 
 by a single rpc_task at a time.

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with UDP and TCP.  NFSv2/3 and NFSv4 mounting should be carefully checked.
 Probably need to rig a server where certain services aren't running, or
 that returns an error for some typical operation.


File 41-rpc_peeraddr.patch

 Subject: [PATCH] RPC: Create API for getting remote peer address

 Description:
 Provide an API for retrieving the remote peer address without allowing
 direct access to the rpc_xprt struct.  Again, the desire is to provide
 a cleaner mechanism for callers to access the remote peer address and
 port number without diddling directly with the contents of the rpc_xprt
 struct.

 At the same time, increase the storage capacity of the rpc_xprt struct
 to allow for future transports (eg. IPv6) that may have an address that
 is larger than sockaddr_in.

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with UDP and TCP.  NFSv2/3 and NFSv4 mounting should be carefully checked.
 Probably need to rig a server where certain services aren't running, or
 that returns an error for some typical operation.


File 42-rpc_create.patch

 Subject: [PATCH] RPC: Use sockaddr + size for remote transport endpoints

 Description:
 Prepare for more generic transport endpoint handling needed by transports
 such as IPv6.  Replace the two-call xprt_create_proto/rpc_create_client
 API with a single rpc_create call.  Define a new rpc_create_args structure
 that allows callers to pass in remote endpoint addresses of varying length.

 Finally, eliminate no-longer-needed external xprt_destroy and xprt_shutdown
 interfaces.

 Test-plan:
 Repeated mount and unmount.  TCP connects and reconnects.  Idle timeouts.


File 50-rpc-buffer.patch

 Subject: [PATCH] RPC: switchable buffer allocation

 Description:
 Add RPC client transport switch support for replacing buffer management on
 a per-transport basis.

 In the IPv4 socket transport implementation, RPC buffers are allocated as
 needed for each RPC message that is sent.  Some transport implementations
 may choose to use pre-allocated buffers for encoding, sending, receiving,
 and unmarshalling RPC messages, however.  For transports capable of direct
 data placement, the buffers are carved out of a pre-registered area of
 memory rather than from a slab cache.

 Test-plan:
 Millions of fsx operations.  Performance characterization with "sio" and
 "iozone".  Use oprofile and other tools to look for significant regression
 in CPU utilization.


File 51-rpc-private-xprt.patch

 Subject: [PATCH] RPC: private transport-specific fields

 Description:
 Add a facility for transport implementations to store their own data
 in the rpc_xprt.  Move socket-specific fields into a private struct
 defined in net/sunrpc/xprtsock.c.

 Still need to do something with rpc_portmap.

 Test-plan:
 Check socket buffer size on UDP sockets over time.  Millions of fsx
 operations on TCP.


File 52-rpc-xprt-modules.patch

 Subject: [PATCH] RPC: load RPC transport implementations dynamically
 
 Description:
 Now that we have an RPC transport switch, we have introduced the potential
 to add new transport capabilities for use by the RPC client at run-time.
 This patch allows RPC client transport implementations to be loaded as
 needed, or as they become available from distributors or third-party
 vendors.

 Test-plan:
 Build kernel with NFS and SUNRPC in a module.  Try loading and unloading
 the IPv4 transport module.  Try mounting without the IPv4 module loaded.
 Destructive testing.  Try unloading SUNRPC or the IPv4 transport module
 while there are active NFS mounts.  Reboot the client after mounting NFS
 shares.


File 53-rpc-xprt-doc.patch

 Subject: [PATCH] RPC: RPC transport switch documentation
 
 Description:
 Add a file to the Documentation directory describing the new RPC transport
 switch.  The file is based on the original RPC transport switch design
 document.

 Test-plan:
 None.


File 60-rpc_rqst.patch

 Subject: [PATCH] RPC: make rpc_rqst a part of rpc_task
 
 Description:
 The RPC client allocates rpc_rqst structures from slots in a static table.
 The size of this slot table determines how many RPC requests can be in
 flight concurrently for that rpc_xprt (NFS mount point).

 Previously, this static slot table was fixed in size, and contained just 16
 slots for every rpc_xprt.  In recent kernels, support was added to allow this
 static slot table to be sized and allocated dynamically at the time an
 rpc_xprt is created.  The slot table can now be made as large as 128 slots
 to permit many more concurrent RPC requests, but the memory remains
 allocated until each rpc_xprt is destroyed, even when there are no
 outstanding RPC requests.

 NFSv4.1 sessions provide the ability to negotiate the maximum number of
 concurrent requests that the transport and the server can handle.  The
 maximum can be increased or decreased during a transport session.

 This patch makes RPC requests a part of the RPC task structure.  This
 reduces memory fragmentation by keeping all the pieces of an NFS and RPC
 request together.  For asynchronous reads and writes, this means the RPC
 request, RPC task, and nfs_read/write_data structures reside in the same
 CPU cache lines, and will move between caches together.  RPC slot tables
 are eliminated entirely, preventing memory from being held even when it's
 not being used, and allowing the maximum number of outstanding RPC requests
 to be modulated dynamically.

 The call_reserve path has been simplified.  The only job it has now are to
 initialize each rpc_rqst structure, or to queue RPC tasks on the xprt's
 backlog queue if the maximum number per transport has been exceeded.

 We no longer have a free list, and each XID can be allocated while holding
 the xprt_lock.  Thus we can do away with the reserve_lock entirely.

 Test-plan:
 Heavy multi-threaded tests like "sio" and "iozone" on SMP systems.  Watch
 for signficant regression in CPU utilization with oprofile and other tools.
 Watch for multiple RPCs with the same XID.  Destructive testing.


File 61-rpc-tk_rqstp.patch

 Subject: [PATCH] RPC: remove tk_rqstp field from rpc_task
 
 Description:
 Remove the tk_rqstp field.  Now that the rpc_rqst structure is part of
 rpc_task, there is no need for a field in rpc_task that points to the
 associated RPC request slot.

 Also fix some printk alignment issues in rpc_show_tasks while we're at it.

 Test-plan:
 None.


File 62-rpc-rq_task.patch

 Subject: [PATCH] RPC: remove rq_task field from rpc_rqst
 
 Description:
 Remove the rq_task field.  Now that the rpc_rqst structure is part of
 rpc_task, there is no need for a field in rpc_rqst that points to its
 associated rpc_task.

 Test-plan:
 None.


File 63-rpc-rq_xprt.patch

 Subject: [PATCH] RPC: use rq_xprt instead of tk_xprt
 
 Description:
 Now that the rpc_rqst structure is part of rpc_task, we can replace this
 double indirection (task->tk_client->cl_xprt) with a single indirection
 (task->tk_rqst.rq_xprt) to save a load instruction and eliminate one AGI
 in several places.

 Test-plan:
 None.


File 64-rpc-rq_rtt.patch

 Subject: [PATCH] RPC: add rq_rtt field to rpc_rqst
 
 Description:
 Currently it is cumbersome to derive the RTT data from the address of an
 RPC request.  This patch adds an rq_rtt field which is a copy of the RPC
 client's cl_rtt field.  That makes it easier and cleaner to find the RTT
 data when needed, and reduces the number of loads and AGIs in several
 places.

 Test-plan:
 None.


File 65-rpc-dprintk.patch

 Subject: [PATCH] RPC: fix print format for tk_pid
 
 Description:
 The tk_pid field is an unsigned short.  The proper print format specifier for
 that type is %5u, not %4d.  So there.

 Test-plan:
 None.


File 70-rpc-bkl1.patch

 Subject: [PATCH] RPC: BKL no longer required for ->tk_callback callback
 
 Description:
 The tk_callback callback function is used to invoke only two remaining
 functions: pmap_getport_done, and xprt_connect_status.  Neither of these
 require the BKL, so now the RPC client no longer holds the BKL while
 invoking the tk_callback function.

 Test-plan:
 OraSim, fsx, and iozone on SMP clients.  Run multiple parallel "rm -rf"
 jobs on the same directory tree on SMP clients.  Generate a mount flood
 and look for TCP connection and portmapper races.


File 71-rpc-bkl2.patch

 Subject: [PATCH] RPC: BKL no longer required for ->tk_action callback
 
 Description:
 The RPC client finite state machine does not require that the global kernel
 lock is held for any of the typical RPC client states.  Only the NFS client's
 asynchronous unlink logic needs the lock.

 Remove the BKL around the tk_action invocations in the RPC scheduler.  The
 tk_action callback is invoked on average ten times per RPC request, so
 removing the BKL will have obvious SMP performance scalability implications
 when the BKL is no longer taken in hot NFS paths.

 Test-plan:
 OraSim, fsx, and iozone on SMP clients.  Run multiple parallel "rm -rf"
 jobs on the same directory tree on SMP clients.  Generate a mount flood
 and look for TCP connection and portmapper races.


File 72-rpc-bkl3.patch

 Subject: [PATCH] RPC: BKL no longer required for ->tk_exit callback
 
 Description:
 Move BKL acquisition into the callback functions that need it (NFS async
 read, write, commit, and unlink completion, and NLM callbacks).

 This completes the removal of the BKL from the RPC client, and makes
 clear where the NFS and NLM clients still have strong dependencies on
 global kernel locking.

 Test-plan:
 OraSim, fsx, and iozone on SMP clients.  Run multiple parallel "rm -rf"
 jobs on the same directory tree on SMP clients.


File 73-nfs-direct-bkl.patch

 Subject: [PATCH] NFS: Direct I/O no longer acquires BKL
 
 Description:
 Now that the RPC client no longer acquires the BKL, we can begin removing
 it from the NFS client.  A logical first step is to remove it completely
 from the direct I/O path.

 Test-plan:
 OraSim, fsx, and iozone in direct I/o mode on SMP clients.


File 80-nfs-counters.patch

 Subject: [PATCH] NFS: add I/O performance counters
 
 Description:
 Add an extensible per-superblock performance counter facility to the NFS
 client.  This facility mimics the counters available for block devices and
 for networking.

 Currently there is no way to view the counter data from user-land.
 Eventually we plan to use named attributes to export this data from the
 kernel securely.  However, at this time named attribute support is not
 available in the Linux NFS client.  Until it is, distributors may use this
 patch plus an export mechanism of their choice to make NFS I/O performance
 counters available to user-level applications.

 Test-plan:
 fsx and iozone on SMP systems.  Watch for memory overwrite bugs, and
 performance loss (significantly more CPU required per op).


File CEL_NFS-ALL.patch

 Subject: [PATCH] NFS/RPC: roll-up all of 2.6.9-a patchset
 
 Description:
 Roll up all of cel's 2.6.9 patches.  For a complete description, see the
 individual patches.