Patchset for 2.6.9-rc3

Published on Thu Sep 30 17:23:36 EDT 2004

The following patches are against 2.6.9-rc3. Included for reference is the latest version of the RPC transport switch design docment, and a set of notes describing what's next.


File 02-nfs-blocksize2.patch

 Subject: [PATCH] NFS: incorrect "df" results

 Description:
 Fix an NFS client bug introduced in 2.6.9-rc1.  The "df" command was
 reporting the size of NFS file systems incorrectly.

 Test plan:
 Compare NFS mounts in the output of "df -k" and "df -h".


File 03-nfs-short-write-msg.patch

 Subject: [PATCH] NFS: short write warning

 Recently a patch set was accepted to allow the Linux NFS client to handle
 short writes by retrying the unwritten portion of the request.  The only
 case that now results in an error is when the server makes no progress;
 that is, writes zero bytes.

 This patch changes the kernel log warning that is generated in that case
 to reflect more accurately the error condition.

 Test-plan:
 None.


File 04-nfs-getattr-msgs.patch

 Subject: [PATCH] NFS: report return code on GETATTR and SETATTR

 Improve trace debugging messages for NFSv2/3 GETATTR and SETATTR
 procedures.

 Test-plan:
 Enable NFS trace debugging for NFSDBG_PROC and watch for setattr and
 getattr result messages.  Try with NFSv2 and NFSv3.


File 05-nfs-dir-msgs.patch

 Subject: [PATCH] NFS: directory trace messages

 Description:
 Those who use pre-built kernels from a distribution can't add more trace
 messages to a kernel when diagnosing a problem in the field.  On the other
 hand, we don't want so many trace messages that the kernel log is
 overwhelmed with noise when tracing is enabled.

 This patch reuses NFSDBG_DIRCACHE and NFSDBG_LOOKUPCACHE to provide
 additional diagnostic messages that trace the operation of the NFS
 client's directory cache and lookup cache.  A few other messages are
 now generated when NFSDBG_VFS is active, as well, to trace normal VFS
 activity.

 Test-plan:
 Enable NFS trace debugging with flags 1, 2, or 4.  You should be able to
 see different types of trace messages with each flag setting.


File 06-nfs-sync-write-alloc.patch

 Subject: [PATCH] NFS: Sync NFS writes still use kmalloc

 Replace the kmalloc() and kfree() calls in this path with appropriate
 invocations of nfs_writedata_alloc() and nfs_writedata_free().  This
 makes nfs_writepage_sync match all the other write paths in fs/nfs/write.c.

 Test-plan:
 Mount with the "sync" option, run millions of fsx operations, then
 review system memory utilization.


File 10-nfs-direct-write-verf.patch

 Subject: [PATCH] NFS: Use sizeof() instead of C macro

 Replace a C macro with sizeof().

 Test-plan:
 Rig a server to return a bad write verifier.


File 11-nfs-short-direct-write.patch

 Subject: [PATCH] NFS: better handling of short writes in direct write path

 Immediately return control to the application if a short NFS write is
 detected in the NFS client's direct write path.  If a short write occurs
 and the client continues writing, the short write will leave a gap
 somewhere in the middle of a write request, which is more difficult
 for applications to recover from.

 Test-plan:
 Rig a server to return a short NFS write while running an application that
 uses direct I/O over NFS.


File 12-nfs-direct-write-alloc.patch

 Subject: [PATCH] NFS: Direct write path allocates nfs_write_data on the stack

 Reduce stack utilization in the NFS direct write path by using a
 dynamically allocated nfs_write_data structure instead of allocating one
 on the stack.  This reduces stack utilization of nfs_direct_write_seg
 from over 900 bytes to less than 100 bytes.

 Test-plan:
 Millions of operations with fsx-odirect.  OraSim with a direct job file.


File 13-nfs-direct-read-alloc.patch

 Subject: [PATCH] NFS: Direct read path allocates nfs_read_data on the stack

 Reduce stack utilization in the NFS direct read path by using a
 dynamically allocated nfs_read_data structure instead of allocating one
 on the stack.  This reduces stack utilization of nfs_direct_read_seg
 from over 900 bytes to less than 100 bytes.

 Test-plan:
 Millions of operations with fsx-odirect.  OraSim with a direct job file.


File 14-nfs-direct-parallel-read.patch

 Subject: [PATCH] NFS: Use parallel read operations to do direct read requests

 The initial implementation of NFS direct reads was entirely synchronous.
 The direct read logic issued one NFS READ operation at a time, and waited
 for the server's reply before issuing the next one.  For large direct
 read requests, this is unnecessarily slow.

 This patch changes the NFS direct read path to dispatch NFS READ operations
 for a single direct read request in parallel and wait for them once.  The
 direct read path is still synchronous in nature, but because the NFS READ
 operations are going in parallel, the completion wait should be much shorter.

 Test-plan:
 Millions of operations with fsx-odirect.  OraSim with a direct job file and
 small rsize.  Use sio with -direct to generate large sequential reads or
 large random reads.  Check that ^C or application segmentation faults do
 not cause an oops.  Verify behavior when reading up to EOF.


File 20-rpc-socklib.patch

 Subject: [PATCH] RPC: extract socket logic common to both client and server

 Move some code that is common to both RPC client- and server-side socket
 transports into its own source file, net/sunrpc/socklib.c.

 Test-plan:
 Millions of fsx operations over UDP.  Connectathon over UDP.


File 21-rpc-xprt-switch.patch

 Subject: [PATCH] RPC: introduce client-side transport switch

 This patch introduces a generic transport switch into the kernel RPC
 client.  The RPC transport switch divorces socket-specific implementation
 from the generic pieces of the RPC client.  Such a switch will allow
 support for RPC over 10GbE, IPsec offload, multiple sockets per mount,
 IPv6, and transports capable of direct data placement.

 Here, we move the bulk of socket-specific code into a separate source file,
 net/sunrpc/clntsock.c.

 Test-plan:
 Millions of fsx operations.  Performance characterization such as
 "sio" or "iozone".  Destructive testing (unplugging the network
 temporarily).  Connectathon with v2, v3, and v4.


File 22-rpc-xdr_sendpages.patch

 Subject: [PATCH] RPC: move xdr_sendpages under transport switch

 Move socket-dependent code from net/sunrpc/xdr.c to net/sunrpc/clntsock.c.
 Reduce stack utilization of the RPC send path, while we're at it.

 Test-plan:
 Millions of fsx operations.  Performance characterization such as "sio" or
 "iozone".


File 23-rpc-switch-cleanup.patch

 Subject: [PATCH] RPC: client-side transport switch cleanup

 Clean up remaining socket-specific structure naming, remove include/socket.h
 from most RPC client source files, and change some comments to reflect the
 realities of the new RPC transport switch mechanism.  Also remove the
 cong_wait field from rpc_xprt, as it is no longer used.

 Test-plan:
 Millions of fsx operations.  Performance characterization such as
 "sio" or "iozone".


File 24-rpc-write_space.patch

 Subject: [PATCH] RPC: separate TCP and UDP write space callbacks

 Split the socket write space callback function into a TCP version and
 UDP version, eliminating one dependence on the "xprt->stream" variable.

 Also, make both callbacks more CPU efficient by reducing the number
 of conditional branches taken in the hot path in each function.

 Test-plan:
 Write-intensive workload on a single mount point.


File 25-rpc-connect.patch

 Subject: [PATCH] RPC: separate TCP and UDP transport connection logic

 Remove the xprt_sock_create and xprt_sock_bind functions.  Create separate
 connection worker functions for managing UDP and TCP transport sockets.
 This eliminates several dependencies on "xprt->stream".

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with v2, v3, and v4.


File 26-rpc-send_request.patch

 Subject: [PATCH] RPC: separate TCP and UDP socket write paths

 Split the RPC client's main socket write path into a TCP version and a UDP
 version to eliminate another dependency on the "xprt->stream" variable.
 Rely on compiler optimization to remove conditional branches from
 xprt_sock_sendpages, as this function is now called with some constant
 arguments.

 Test-plan:
 Millions of fsx operations.  Performance characterization such as
 "sio" or "iozone".  Examine oprofile results for any changes before and
 after this patch is applied.


File 27-rpc-tsh_size.patch

 Subject: [PATCH] RPC: skip over transport-specific heads automatically

 Add a mechanism for skipping over transport-specific headers when constructing
 an RPC request.  This removes another "xprt->stream" dependency.

 Test-plan:
 Write-intensive workload on a single mount point.


File 28-rpc-xprt-stream.patch

 Subject: [PATCH] RPC: get rid of xprt->stream

 Description:
 Now we can remove the last few places that use the "xprt->stream"
 variable, and get rid of it from the rpc_xprt structure.

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with UDP and TCP.


File 29-rpc-congestion.patch

 Subject: [PATCH] RPC: transport-specific timeouts

 Description:
 This patch prepares the way to remove the "xprt->nocong" variable by
 adding callouts to the RPC client transport switch API to handle
 setting RPC timeouts.

 Test-plan:
 Use WAN simulation to cause sporadic bursty packet loss.  Look for
 significant regression in performance or client stability.


File 30-rpc-xprt-nocong.patch

 Subject: [PATCH] RPC: remove xprt->nocong

 Description:
 Get rid of the "xprt->nocong" variable.

 Test-plan:
 Use WAN simulation to cause sporadic bursty packet loss.  Look for
 significant regression in performance or client stability.


File 31-rpc-buffer.patch

 Subject: [PATCH] RPC: switchable buffer allocation

 Description:
 In the IPv4 socket transport implementation, RPC buffers are allocated as
 needed for each RPC message that is sent.  Some transport implementations
 may choose to use pre-allocated buffers for encoding, sending, receiving,
 and unmarshalling RPC messages.

 This patch adds RPC client transport switch support for replacing buffer
 management on a per-transport basis.

 Test-plan:
 Millions of fsx operations.  Performance characterization with "sio" and
 "iozone".  Use oprofile and other tools to look for significant regression
 in CPU utilization.


File 32-rpc-portmap.patch

 Subject: [PATCH] RPC: pluggable portmapping

 Description:
 Introduce new RPC client transport switch methods to handle RPC portmapping
 in a transport independent way.

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with UDP and TCP.  NFSv2/3 and NFSv4 mounting should be carefully checked.
 Probably need to rig a server where certain services aren't running, or
 that returns an error for some typical operation.


File 33-rpc-xprt_peeraddr.patch

 Subject: [PATCH] RPC: API for getting remote peer address

 Description:
 Provide an API for retrieving the remote peer address without allowing
 direct access to the rpc_xprt struct.

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with UDP and TCP.  NFSv2/3 and NFSv4 mounting should be carefully checked.
 Probably need to rig a server where certain services aren't running, or
 that returns an error for some typical operation.


File 34-rpc-xid-htonl.patch

 Subject: [PATCH] RPC: display XIDs in network order

 Description:
 Ethereal and other tools display RPC XIDs in network order.  This patch
 changes the RPC trace messages that display XIDs to print them in network
 order so they can be easily matched to XIDs that appear in Ethereal.

 Test-plan:
 Run a short program with RPC trace debugging enabled while capturing a
 packet trace with Ethereal.  Compare the output of the trace debugging
 messages with the contents of the Ethereal window.


File 35-rpc-xprt-modules.patch

 Subject: [PATCH] RPC: load RPC transport implementations dynamically
 
 Description:
 Now that we have an RPC transport switch, we have introduced the potential
 to add new transport capabilities for use by the RPC client at run-time.
 This patch allows RPC client transport implementations to be loaded as
 needed, or as they become available from distributors or third-party
 vendors.

 This patch is experimental.  It is safe to apply, but is functionally
 incomplete.  Currently it is acting only as a placeholder to collect
 changes related to transport module loading.

 Test-plan:
 Build kernel with NFS and SUNRPC in a module.  Try loading and unloading
 the IPv4 transport module.  Try mounting without the IPv4 module loaded.
 Destructive testing.  Try unloading SUNRPC or the IPv4 transport module
 while there are active NFS mounts.


File 36-rpc-tk_auth.patch

 Subject: [PATCH] RPC: add function to get tk_auth from rpc_rqst pointer

 Description:
 Create a standardized way to derive the appropriate tk_auth for a given
 rpc_rqst.  This is clean up required for the next patch.

 Test-plan:
 None.


File 37-rpc_rqst.patch

 Subject: [PATCH] RPC: make rpc_rqst a part of rpc_task
 
 Description:
 The RPC client allocates rpc_rqst structures from slots in a static
 table.  The size of this slot table determines how many RPC requests
 can be in flight concurrently for that rpc_clnt (NFS mount point).

 Previously, this static slot table was fixed in size, and contained
 just 16 slots for every rpc_clnt.  In recent kernels, logic was added
 to allow this static slot table to be sized and allocated dynamically
 at the time an rpc_clnt is created.  The slot table can now be made as
 large as 128 slots to permit many more concurrent RPC requests, but the
 memory remains allocated until each rpc_clnt is destroyed, even when
 there are no outstanding RPC requests.

 NFSv4.1 sessions provide the ability to negotiate the maximum number of
 concurrent requests that the transport and the server can handle.  The
 maximum can be increased or decreased during a transport session.

 This patch makes RPC requests a part of the RPC task structure.  This
 reduces memory fragmentation by keeping all the pieces of an NFS and
 RPC request together.  For asynchronous reads and writes, this means
 the RPC request, RPC task, and nfs_read/write_data structures reside
 in the same CPU cache lines, and will move between caches together.
 RPC slot tables are eliminated entirely, preventing memory from being
 held even when it's not being used, and allowing the maximum number of
 outstanding RPC requests to be modulated dynamically.

 The call_reserve path has been simplified.  The only job it has now
 are to initialize each rpc_rqst structure, or to queue RPC tasks on the
 xprt's backlog queue if the maximum number per transport has been exceeded.

 We no longer have a free list, and each XID can be allocated while holding
 the xprt_lock.  Thus we can do away with the reserve_lock entirely.

 Test-plan:
 Heavy multi-threaded tests like "sio" and "iozone" on SMP systems.  Watch
 for signficant regression in CPU utilization with oprofile and other tools.
 Watch for multiple RPCs with the same XID.  Destructive testing.


File 38-rpc-tk_rqstp.patch

 Subject: [PATCH] RPC: remove tk_rqstp field from rpc_task
 
 Description:
 Now that the rpc_rqst structure is part of rpc_task, there is no need
 for a field in rpc_task that points to the associated RPC request slot.

 This patch removes the tk_rqstp field.

 Test-plan:
 None.


File 39-rpc-rq_task.patch

 Subject: [PATCH] RPC: remove rq_task field from rpc_rqst
 
 Description:
 Now that the rpc_rqst structure is part of rpc_task, there is no need
 for a field in rpc_rqst that points to its associated rpc_task.

 This patch removes the rq_task field.

 Test-plan:
 None.


File 40-rpc-rq_xprt.patch

 Subject: [PATCH] RPC: use rq_xprt instead of tk_xprt
 
 Description: Minor performance optimization.

 Now that the rpc_rqst structure is part of rpc_task, we can replace
 this double indirection (task->tk_client->cl_xprt) with a single
 indirection (task->tk_rqst.rq_xprt) to save a load instruction and
 eliminate one AGI in several places.

 Test-plan:
 None.


File 41-rpc-rq_rtt.patch

 Subject: [PATCH] RPC: add rq_rtt field to rpc_rqst
 
 Description: Minor performance optimization.

 Currently it is cumbersome to derive the RTT data from the address
 of an RPC request.  This patch adds an rq_rtt field which is a copy
 of the RPC client's cl_rtt field.  That makes it easier and cleaner
 to find the RTT data when needed, and reduces the number of loads
 and AGIs in several places.

 Test-plan:
 None.


File 60-nfs-noac.patch

 Subject: [PATCH] NFS: use attribute timeout instead of "noac" mount option

 The behavior enabled by the "noac" mount option should be precisely
 equivalent to setting acreg{min,max} or acdir{min,max} to zero via
 mount options.

 Test-plan:
 Compare behavior of mounts with "actimeo=0" and "noac".


File 61-nfs_readdir-sched.patch

 Subject: [PATCH] NFS: readdir pre-emption

 The NFS directory logic provides several pre-emption points to allow
 other work on the system to proceed during the potentially lengthy
 searches in the NFS directory cache.

 However, the pre-emption logic is almost never triggered because it
 waits for 200 loop iterations before calling schedule().  On most
 4KB-per-page clients, it is nearly impossible to get 200 directory
 entries into a single page.

 This patch adds more frequent pre-emption to the readdir and cached
 lookup paths.  Pre-emption will occur about once per page during
 multi-page scans, and not at all if only a single page is involved.

 Category: Performance scalability enhancement

 Test-plan:
 Performance characterization of a directory scan like "find" while
 running a multi-threaded workload.


File 62-nfs_readdir-hint.patch

 Subject: [PATCH] NFS: getdents(3) hints

 When an application invokes getdents(3) on a directory stored in
 NFS, the directory cache logic always searches from the beginning
 of the directory to find the cookie in question.  For large
 directories, this is significant overhead, and means that a single
 walk through the directory using getdents(3) calls can be more
 than O(n!).

 This patch adds a page index hint to the directory search algorithm
 so that getdents(3) can start where it left off, rather than walking
 the entire directory from the beginning each time it is called.

 Category: Scalability enhancement

 Test-plan:
 Connectathon, "rm -rf" on a large directory tree; tar and untar.