Patchset for Linux NFS client

Published on Fri Oct 22 17:38:36 EDT 2004

The following patches are against 2.6.9 + Trond's NFS4_ALL patch. Included for reference is the latest version of the RPC transport switch design docment, and a set of notes describing what's next.

A patch that rolls up these patches into a single diff is here.


File 02-nfs-short-write-msg.patch

 Subject: [PATCH] NFS: short write warning

 Recently a patch set was accepted to allow the Linux NFS client to handle
 short writes by retrying the unwritten portion of the request.  The only
 case that now results in an error is when the server makes no progress;
 that is, writes zero bytes.

 This patch changes the kernel log warning that is generated in that case
 to reflect the error condition more accurately.

 Test-plan:
 None.


File 03-nfs-getattr-msgs.patch

 Subject: [PATCH] NFS: report return code on GETATTR and SETATTR

 Improve trace debugging messages for NFSv2/3 GETATTR and SETATTR
 procedures.

 Test-plan:
 Enable NFS trace debugging for NFSDBG_PROC and watch for setattr and
 getattr result messages.  Try with NFSv2 and NFSv3.


File 04-nfs-dir-msgs.patch

 Subject: [PATCH] NFS: directory trace messages

 Description:
 Those who use pre-built kernels from a distribution can't add more trace
 messages to a kernel when diagnosing a problem in the field.  On the other
 hand, we don't want so many trace messages that the kernel log is
 overwhelmed with noise when tracing is enabled.

 This patch reuses NFSDBG_DIRCACHE and NFSDBG_LOOKUPCACHE to provide
 additional diagnostic messages that trace the operation of the NFS
 client's directory cache and lookup cache.  A few other messages are
 now generated when NFSDBG_VFS is active, as well, to trace normal VFS
 activity.

 Test-plan:
 Enable NFS trace debugging with flags 1, 2, or 4.  You should be able to
 see different types of trace messages with each flag setting.


File 05-rpc-htonl.patch

 Subject: [PATCH] RPC: display XIDs in network order

 Description:
 Ethereal and other tools display RPC XIDs in network order.  This patch
 changes the RPC trace messages that display XIDs to print them in network
 order so they can be easily matched to XIDs that appear in Ethereal.

 Test-plan:
 Run a short program with RPC trace debugging enabled while capturing a
 packet trace with Ethereal.  Compare the output of the trace debugging
 messages with the contents of the Ethereal window.


File 06-nfs-sync-write-alloc.patch

 Subject: [PATCH] NFS: Sync NFS writes still use kmalloc

 Replace the kmalloc() and kfree() calls in this path with appropriate
 invocations of nfs_writedata_alloc() and nfs_writedata_free().  This
 makes nfs_writepage_sync match all the other write paths in fs/nfs/write.c.

 Test-plan:
 Mount with the "sync" option, run millions of fsx operations, then
 review system memory utilization.


File 07-nfs-large-io.patch

 Subject: [PATCH] NFS: support 64KB reads and writes on the wire

 Most NFS client implementations allow up to 64KB reads and writes
 on the wire.  Now Linux does too.  This will help reduce protocol
 and context switch overhead on read/write intensive NFS workloads.

 Test-plan:
 Connectathon and iozone on mount point with wsize=rsize=65536.


File 10-nfs-direct-write-verf.patch

 Subject: [PATCH] NFS: Use sizeof() instead of C macro

 Replace a C macro with sizeof().

 Test-plan:
 Rig a server to return a bad write verifier.


File 11-nfs-short-direct-write.patch

 Subject: [PATCH] NFS: better handling of short writes in direct write path

 Immediately return control to the application if a short NFS write is
 detected in the NFS client's direct write path.  This is better behavior
 than what the direct write path does today, which could result in data
 appearing at the wrong offset in the file.

 Eventually this code path should retry short writes at least once before
 giving up.

 Test-plan:
 Rig a server to return a short NFS write while running an application that
 uses direct I/O over NFS.


File 12-nfs-direct-write-alloc.patch

 Subject: [PATCH] NFS: Direct write path allocates nfs_write_data on the stack

 Reduce stack utilization in the NFS direct write path by using a
 dynamically allocated nfs_write_data structure instead of allocating one
 on the stack.  This reduces stack utilization of nfs_direct_write_seg
 from over 900 bytes to less than 100 bytes.

 Test-plan:
 Millions of operations with fsx-odirect.  OraSim with a direct job file.


File 13-nfs-direct-read-alloc.patch

 Subject: [PATCH] NFS: Direct read path allocates nfs_read_data on the stack

 Reduce stack utilization in the NFS direct read path by using a
 dynamically allocated nfs_read_data structure instead of allocating one
 on the stack.  This reduces stack utilization of nfs_direct_read_seg
 from over 900 bytes to less than 100 bytes.

 Test-plan:
 Millions of operations with fsx-odirect.  OraSim with a direct job file.


File 14-nfs-direct-parallel-read.patch

 Subject: [PATCH] NFS: Use parallel read operations to do direct read requests

 The initial implementation of NFS direct reads was entirely synchronous.
 The direct read logic issued one NFS READ operation at a time, and waited
 for the server's reply before issuing the next one.  For large direct
 read requests, this is unnecessarily slow.

 This patch changes the NFS direct read path to dispatch NFS READ operations
 for a single direct read request in parallel and wait for them once.  The
 direct read path is still synchronous in nature, but because the NFS READ
 operations are going in parallel, the completion wait should be much shorter.

 Test-plan:
 Millions of operations with fsx-odirect.  OraSim with a direct job file and
 small rsize.  Use sio with -direct to generate large sequential reads or
 large random reads.  Check that ^C or application segmentation faults do
 not cause an oops.  Verify behavior when reading up to EOF.


File 15-nfs-direct-wb-all.patch

 Subject: [PATCH] NFS: Direct reads and writes need to flush dirty cache pages

 Other parts of the NFS client invoke nfs_wb_all() when they want to flush dirty
 cache pages.  The direct path needs to do that, too.

 Test-plan:
 Millions of operations with fsx-odirect using very large files and op sizes.


File 20-rpc-socklib.patch

 Subject: [PATCH] RPC: extract socket logic common to both client and server

 Move some code that is common to both RPC client- and server-side socket
 transports into its own source file, net/sunrpc/socklib.c.

 Test-plan:
 Millions of fsx operations over UDP.  Connectathon over UDP.


File 21-rpc-xprt-switch.patch

 Subject: [PATCH] RPC: introduce client-side transport switch

 This patch introduces a generic transport switch into the kernel RPC
 client.  The RPC transport switch divorces socket-specific implementation
 from the generic pieces of the RPC client.  Such a switch will allow
 support for RPC over 10GbE, IPsec offload, multiple sockets per mount,
 IPv6, and transports capable of direct data placement.

 Here, we move the bulk of client-side socket-specific code into a separate
 source file, net/sunrpc/xprtsock.c.

 Test-plan:
 Millions of fsx operations.  Performance characterization such as
 "sio" or "iozone".  Destructive testing (unplugging the network
 temporarily).  Connectathon with v2, v3, and v4.


File 22-rpc-xdr_sendpages.patch

 Subject: [PATCH] RPC: move xdr_sendpages under transport switch

 Move socket-dependent code from net/sunrpc/xdr.c to net/sunrpc/xprtsock.c.
 Reduce stack utilization of the RPC send path, while we're at it.

 Test-plan:
 Millions of fsx operations.  Performance characterization such as "sio" or
 "iozone".


File 23-rpc-switch-cleanup.patch

 Subject: [PATCH] RPC: client-side transport switch cleanup

 Clean up remaining socket-specific structure naming, remove include/socket.h
 from most RPC client source files, and change some comments to reflect the
 realities of the new RPC transport switch mechanism.  Also remove the
 cong_wait field from rpc_xprt, as it is no longer used.

 Test-plan:
 Millions of fsx operations.  Performance characterization such as
 "sio" or "iozone".


File 24-rpc-write_space.patch

 Subject: [PATCH] RPC: separate TCP and UDP write space callbacks

 Split the socket write space callback function into a TCP version and
 UDP version, eliminating one dependence on the "xprt->stream" variable.

 Also, make both callbacks more CPU efficient by reducing the number
 of conditional branches taken in the hot path in each function.

 Test-plan:
 Write-intensive workload on a single mount point.


File 25-rpc-connect.patch

 Subject: [PATCH] RPC: separate TCP and UDP transport connection logic

 Remove the xs_create and xs_bind functions.  Create separate connection
 worker functions for managing UDP and TCP transport sockets.  This
 eliminates several dependencies on "xprt->stream".

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with v2, v3, and v4.


File 26-rpc-send_request.patch

 Subject: [PATCH] RPC: separate TCP and UDP socket write paths

 Split the RPC client's main socket write path into a TCP version and a UDP
 version to eliminate another dependency on the "xprt->stream" variable.
 Rely on compiler optimization to remove conditional branches from
 xs_sendpages, as this function is now called with some constant arguments.

 Test-plan:
 Millions of fsx operations.  Performance characterization such as
 "sio" or "iozone".  Examine oprofile results for any changes before and
 after this patch is applied.


File 27-rpc-tsh_size.patch

 Subject: [PATCH] RPC: skip over transport-specific heads automatically

 Add a mechanism for skipping over transport-specific headers when constructing
 an RPC request.  This removes another "xprt->stream" dependency.

 Test-plan:
 Write-intensive workload on a single mount point.


File 28-rpc-xprt-stream.patch

 Subject: [PATCH] RPC: get rid of xprt->stream

 Description:
 Now we can remove the last few places that use the "xprt->stream"
 variable, and get rid of it from the rpc_xprt structure.

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with UDP and TCP.


File 29-rpc-congestion.patch

 Subject: [PATCH] RPC: transport-specific timeouts

 Description:
 This patch prepares the way to remove the "xprt->nocong" variable by
 adding callouts to the RPC client transport switch API to handle
 setting RPC timeouts.

 Test-plan:
 Use WAN simulation to cause sporadic bursty packet loss.  Look for
 significant regression in performance or client stability.


File 30-rpc-xprt-nocong.patch

 Subject: [PATCH] RPC: remove xprt->nocong

 Description:
 Get rid of the "xprt->nocong" variable.

 Test-plan:
 Use WAN simulation to cause sporadic bursty packet loss.  Look for
 significant regression in performance or client stability.


File 31-rpc-buffer.patch

 Subject: [PATCH] RPC: switchable buffer allocation

 Description:
 In the IPv4 socket transport implementation, RPC buffers are allocated as
 needed for each RPC message that is sent.  Some transport implementations
 may choose to use pre-allocated buffers for encoding, sending, receiving,
 and unmarshalling RPC messages.

 This patch adds RPC client transport switch support for replacing buffer
 management on a per-transport basis.

 Test-plan:
 Millions of fsx operations.  Performance characterization with "sio" and
 "iozone".  Use oprofile and other tools to look for significant regression
 in CPU utilization.


File 32-rpc-portmap.patch

 Subject: [PATCH] RPC: pluggable portmapping

 Description:
 Introduce new RPC client transport switch methods to handle RPC portmapping
 in a transport independent way.

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with UDP and TCP.  NFSv2/3 and NFSv4 mounting should be carefully checked.
 Probably need to rig a server where certain services aren't running, or
 that returns an error for some typical operation.


File 33-rpc-xprt_peeraddr.patch

 Subject: [PATCH] RPC: API for getting remote peer address

 Description:
 Provide an API for retrieving the remote peer address without allowing
 direct access to the rpc_xprt struct.

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with UDP and TCP.  NFSv2/3 and NFSv4 mounting should be carefully checked.
 Probably need to rig a server where certain services aren't running, or
 that returns an error for some typical operation.


File 34-rpc-private-xprt.patch

 Subject: [PATCH] RPC: private transport-specific fields

 Description:
 Add a facility for transport implementations to store their own data
 in the rpc_xprt.  Move socket-specific fields into a private struct.

 Still need to do something with rpc_portmap.

 Test-plan:
 Check socket buffer size on UDP sockets over time.  Millions of fsx
 operations on TCP.


File 35-rpc-xprt-modules.patch

 Subject: [PATCH] RPC: load RPC transport implementations dynamically
 
 Description:
 Now that we have an RPC transport switch, we have introduced the potential
 to add new transport capabilities for use by the RPC client at run-time.
 This patch allows RPC client transport implementations to be loaded as
 needed, or as they become available from distributors or third-party
 vendors.

 This patch is experimental.  It is safe to apply, but is functionally
 incomplete.  Currently it is acting only as a placeholder to collect
 changes related to transport module loading.

 Test-plan:
 Build kernel with NFS and SUNRPC in a module.  Try loading and unloading
 the IPv4 transport module.  Try mounting without the IPv4 module loaded.
 Destructive testing.  Try unloading SUNRPC or the IPv4 transport module
 while there are active NFS mounts.


File 40-rpc-tk_auth.patch

 Subject: [PATCH] RPC: add function to get tk_auth from rpc_rqst pointer

 Description:
 Create a standardized way to derive the appropriate tk_auth for a given
 rpc_rqst.  This is clean up required for the next patch.

 Test-plan:
 None.


File 41-rpc_rqst.patch

 Subject: [PATCH] RPC: make rpc_rqst a part of rpc_task
 
 Description:
 The RPC client allocates rpc_rqst structures from slots in a static table.
 The size of this slot table determines how many RPC requests can be in
 flight concurrently for that rpc_clnt (NFS mount point).

 Previously, this static slot table was fixed in size, and contained just 16
 slots for every rpc_clnt.  In recent kernels, logic was added to allow this
 static slot table to be sized and allocated dynamically at the time an
 rpc_clnt is created.  The slot table can now be made as large as 128 slots
 to permit many more concurrent RPC requests, but the memory remains
 allocated until each rpc_clnt is destroyed, even when there are no
 outstanding RPC requests.

 NFSv4.1 sessions provide the ability to negotiate the maximum number of
 concurrent requests that the transport and the server can handle.  The
 maximum can be increased or decreased during a transport session.

 This patch makes RPC requests a part of the RPC task structure.  This
 reduces memory fragmentation by keeping all the pieces of an NFS and RPC
 request together.  For asynchronous reads and writes, this means the RPC
 request, RPC task, and nfs_read/write_data structures reside in the same
 CPU cache lines, and will move between caches together.  RPC slot tables
 are eliminated entirely, preventing memory from being held even when it's
 not being used, and allowing the maximum number of outstanding RPC requests
 to be modulated dynamically.

 The call_reserve path has been simplified.  The only job it has now are to
 initialize each rpc_rqst structure, or to queue RPC tasks on the xprt's
 backlog queue if the maximum number per transport has been exceeded.

 We no longer have a free list, and each XID can be allocated while holding
 the xprt_lock.  Thus we can do away with the reserve_lock entirely.

 Test-plan:
 Heavy multi-threaded tests like "sio" and "iozone" on SMP systems.  Watch
 for signficant regression in CPU utilization with oprofile and other tools.
 Watch for multiple RPCs with the same XID.  Destructive testing.


File 42-rpc-tk_rqstp.patch

 Subject: [PATCH] RPC: remove tk_rqstp field from rpc_task
 
 Description:
 Now that the rpc_rqst structure is part of rpc_task, there is no need for
 a field in rpc_task that points to the associated RPC request slot.

 This patch removes the tk_rqstp field.

 Test-plan:
 None.


File 43-rpc-rq_task.patch

 Subject: [PATCH] RPC: remove rq_task field from rpc_rqst
 
 Description:
 Now that the rpc_rqst structure is part of rpc_task, there is no need for a
 field in rpc_rqst that points to its associated rpc_task.

 This patch removes the rq_task field.

 Test-plan:
 None.


File 44-rpc-rq_xprt.patch

 Subject: [PATCH] RPC: use rq_xprt instead of tk_xprt
 
 Description: Minor performance optimization.

 Now that the rpc_rqst structure is part of rpc_task, we can replace this
 double indirection (task->tk_client->cl_xprt) with a single indirection
 (task->tk_rqst.rq_xprt) to save a load instruction and eliminate one AGI
 in several places.

 Test-plan:
 None.


File 45-rpc-rq_rtt.patch

 Subject: [PATCH] RPC: add rq_rtt field to rpc_rqst
 
 Description: Minor performance optimization.

 Currently it is cumbersome to derive the RTT data from the address of an
 RPC request.  This patch adds an rq_rtt field which is a copy of the RPC
 client's cl_rtt field.  That makes it easier and cleaner to find the RTT
 data when needed, and reduces the number of loads and AGIs in several
 places.

 Test-plan:
 None.


File 50-nfs-noac.patch

 Subject: [PATCH] NFS: use attribute timeout instead of "noac" mount option

 The behavior enabled by the "noac" mount option should be precisely
 equivalent to setting acreg{min,max} or acdir{min,max} to zero via mount
 options.

 Test-plan:
 Compare behavior of mounts with "actimeo=0" and "noac".


File 51-nfs_readdir-sched.patch

 Subject: [PATCH] NFS: readdir pre-emption

 The NFS directory logic provides several pre-emption points to allow other
 work on the system to proceed during the potentially lengthy searches in
 the NFS directory cache.

 However, the pre-emption logic is almost never triggered because it waits
 for 200 loop iterations before calling schedule().  On most 4KB-per-page
 clients, it is nearly impossible to get 200 directory entries into a
 single page.

 This patch adds more frequent pre-emption to the readdir and cached lookup
 paths.  Pre-emption will occur about once per page during multi-page scans,
 and not at all if only a single page is involved.

 Test-plan:
 Performance characterization of a directory scan like "find" while
 running a multi-threaded workload.