Patchset for 2.6.8.1

Published on Wed Aug 18 00:02:57 EDT 2004

The following patches are against 2.6.8.1. Included for reference is the latest version of the RPC transport switch design docment, and a set of notes describing what's next.


File 01-blocksize.patch

 Subject: [PATCH] NFS: use reasonable block size when copying files
 
 Category: Reversion, new feature

 Description:
 In 2.4, NFS O_DIRECT used the VFS's O_DIRECT logic to provide direct I/O
 support for NFS files.  The 2.4 VFS O_DIRECT logic was block based, thus
 the NFS client had to provide a minimum allowable blocksize for O_DIRECT
 reads and writes on NFS files.  For various reasons we chose 512 bytes.

 In 2.6, there is no requirement for a minimum blocksize.  NFS O_DIRECT
 reads and writes can go to any byte at any offset in a file.  Thus we
 revert the blocksize setting for NFS file systems to the previous
 behavior, which was to advertise the "wsize" setting as the optimal I/O
 block size.  This improves the performance of applications like 'cp'
 which use this value as their transfer size.

 This patch also exposes the server's reported disk block size in the
 f_frsize of the vfsstat structure.

 Test-plan:
 Standard performance regression tests using 'cp' and 'dd'.


File 02-nfs_put_super.patch

 Subject: [PATCH] NFS: mount failure recovery cleanup

 Category: Code re-organization, maintainability

 Description:
 Simplify mount error recovery logic.  Get rid of nfs_put_super.

 Test-plan:
 We don't have any good mount test cases at this time.  However, we should
 make certain that NFSv2/3 and NFSv4 mounting is carefully checked.
 Probably need to rig a server where certain services aren't running, or
 that returns an error for some typical operations.


File 03-nfsi-req-lock.patch

 Subject: [PATCH] NFS: break single global lock into per-inode lock

 Category: Performance scalability enhancement

 Description:
 Break the nfs_wreq_lock into per-inode locks.  This helps prevent a heavy
 read or write workload on one file from interfering with workloads against
 other NFS files.

 Note that there is still some serialization due to the big kernel lock.

 Test-plan:
 Run multi-threaded multi-file tests on large-scale SMP and NUMA clients.
 Look for performance regressions or stability problems, such as hanging
 mount points or applications, oopses, or system deadlocks.


File 04-nfs_fh_compare.patch

 Subject: [PATCH] NFS: compare file handles efficiently

 Category: Performance scalability enhancement

 Description:
 NFS file handles can be as large as 128 bytes, but are most commonly no
 more than 32 bytes.  While the storage container for NFS file handles
 must be able to store a maximum of 128 bytes, it should be necessary to
 compare only the valid bytes between two file handles, and not the extra
 pad bytes.

 This patch creates an efficient standard mechanism for comparing NFS file
 handles that ignores the unused bytes in a file handle container.  This
 reduces the size of most file handle comparisons from all 128 bytes in
 the storage container to about 32 bytes on average.

 Test-plan:
 All connectathon tests should pass with NFSv2, v3, and v4.  Watch for CPU
 utilization regressions as measured by oprofile and other tools.


File 05-nfs_copy_fh.patch

 Subject: [PATCH] NFS: copy file handles efficiently

 Category: Performance scalability enhancement

 Description:
 Now that file handle comparison ignores the unused parts of the file
 handle container, there is no longer any need to clear each NFS file
 handle container before copying in a new file handle.  This allows the
 removal of a 128 byte memset() from several hot paths.

 Test-plan:
 All connectathon tests should pass with NFSv2, v3, and v4.  Watch for CPU
 utilization regressions as measured by oprofile and other tools.


File 06-nfs-short-write-msg.patch

 Subject: [PATCH] NFS: short write warning

 Category: Servicability

 Description:
 Recently a patch set was submitted to allow the Linux NFS client to handle
 short writes by retrying the unwritten portion of the request.  The only
 case that now results in an error is when the server makes no progress;
 that is, writes zero bytes.

 This patch modifies the kernel log warning that is generated in that case
 to reflect more accurately the error condition.

 Test-plan:
 None.


File 07-nfs-odirect-read.patch

 Subject: [PATCH] NFS: Use asynchronous reads to handle direct read requests

 Category: Performance enhancement

 Description:
 The initial implementation of NFS direct reads was entirely synchronous.
 The direct read logic issued one NFS READ operation at a time, and waited
 for the server's reply before issuing the next one.  For large direct
 read requests, this is very slow.

 This patch changes the NFS direct read path to dispatch NFS READ
 operations for a single direct read request in parallel and wait for them
 once.  The direct read path is still synchronous in nature, but because
 the NFS READ operations are going in parallel, the completion wait should
 be much shorter.

 Test-plan:
 Millions of operations with fsx-odirect.  OraSim with a direct job file
 and small rsize.  Use sio with -direct to generate large sequential reads
 or large random reads.


File 08-nfs-odirect-write.patch

 Subject: [PATCH] NFS: Direct writes allocate rpc_task on the stack

 Category: Stability enhancement

 Description:
 Kernel stack utilization is on a diet.  Reduce stack utilization in the
 NFS direct write path by using a dynamically allocated nfs_write_data
 structure instead of allocating one on the stack.

 Test-plan:
 Millions of operations with fsx-odirect.  OraSim with a direct job file.


File 09-rpc_call_sync.patch

 Subject: [PATCH] RPC: Synchronous RPC calls allocate rpc_task on the stack

 Category: Stability enhancement

 Description:
 Reduce stack utilization for all synchronous NFS operations by using a
 dynamically allocated rpc_task structure instead of allocating one on
 the stack.  This reduces stack utilization by over 200 bytes for all
 synchronous NFS operations.

 Test-plan:
 Performance regression tests that emphasize synchronous metadata operations
 such as LOOKUP and GETATTR.  Examine client-side CPU utilization using
 kernel profiling tools such as oprofile.


File 20-nfs_readdir-hint.patch

 Subject: [PATCH] NFS: getdents(3) hints

 Category: Scalability enhancement

 Description:
 When an application invokes getdents(3) on a directory stored in
 NFS, the directory cache logic always searches from the beginning
 of the directory to find the cookie in question.  For large
 directories, this is significant overhead, and means that a single
 walk through the directory using getdents(3) calls can be more
 than O(n!).

 This patch adds a page index hint to the directory search algorithm
 so that getdents(3) can start where it left off, rather than walking
 the entire directory from the beginning each time it is called.

 Test-plan:
 Connectathon, "rm -rf" on a large directory tree; tar and untar.


File 21-nfs_readdir-eof.patch

 Subject: [PATCH] NFS: optimize out READDIR operation on empty directories

 Category: Performance scalability enhancement

 Description:
 NFS directory cookies are opaque and unordered.  Thus nfs_readdir()
 always queries the server to determine which is the next cookie to
 return when it is asked about a cookie that is not in its cache.

 If the directory is empty, then it is already clear the cookie does
 not exist, so we don't need to do a READDIR operation in that case.
 This eliminates an NFS READDIR that occurs at the end of every
 directory during "rm -rf".

 Test-plan:
 Connectathon, "rm -rf" on a large directory tree; tar and untar.


File 22-nfs_readdir-sched.patch

 Subject: [PATCH] NFS: readdir pre-emption

 Category: Performance scalability enhancement

 Description
 The NFS directory logic provides several pre-emption points to allow
 other work on the system to proceed during the potentially lengthy
 searches in the NFS directory cache.

 However, the pre-emption logic is almost never triggered because it
 waits for 200 loop iterations before calling schedule().  On most
 4KB-per-page clients, it is nearly impossible to get 200 directory
 entries into a single page.

 This patch adds more frequent pre-emption to the readdir and cached
 lookup paths.  Pre-emption will occur about once per page during
 multi-page scans, and not at all if only a single page is involved.

 Test-plan:
 Performance characterization of a directory scan like "find" while
 running a multi-threaded workload.


File 23-nfs-dir-printk.patch

 Subject: [PATCH] NFS: directory trace messages

 Category: Servicability

 Description:
 Those who use pre-built kernels from a distribution can't add more trace
 messages to a kernel when diagnosing a problem in the field.  On the other
 hand, we don't want so many trace messages that the kernel log is
 overwhelmed with noise when tracing is enabled.

 This patch reuses NFSDBG_DIRCACHE and NFSDBG_LOOKUPCACHE to provide
 additional diagnostic messages that trace the operation of the NFS
 client's directory cache and lookup cache.  A few other messages are
 now generated when NFSDBG_VFS is active, as well.

 Test-plan:
 Enable NFS trace debugging with flags 1, 2, or 4.  You should be able to
 see different types of trace messages with each flag setting.


File 24-nfs-decode-dirent.patch

 Subject: [PATCH] NFS: more efficient NFSv3 directory decoding

 Category: Performance scalability enhancement

 Description:
 A returned NFSv3 READDIRPLUS directory entry is more complicated than
 a READDIR directory entry.  Sometimes, when walking through a
 directory that was read in via READDIRPLUS, it is necessary to roll
 back to the previous entry because there is not enough room in the
 alloted buffer to decode a full entry along with attributes and file
 handle.

 To allow this roll back to occur, each time an NFSv3 READDIRPLUS
 directory entry is decoded, the old version of the current 54-byte
 entry is saved in its entirety in the current stack frame, even if
 the entry comes from READDIR, and not from READDIRPLUS.

 This patch replaces the save operation in the hot path of NFSv3
 directory entry decoding with a single u64 copy.  The u64 copy
 operation should be sufficient to allow the roll back in the rare
 case where a buffer overflow occurs.

 Test-plan:
 Walking a 20K entry directory should take less system CPU time, as
 measured with oprofile.


File 25-nfs-neg-lookup-cache.patch

 Subject: [PATCH] NFS: negative lookup caching

 Category: Performance scalability enhancement

 Description:
 In 2.6, the NFS client uses the readdir cache to eliminate on-the-wire
 lookup operations in NFS version 3.  The version 3 READDIRPLUS
 operation can return file handles and attributes for each directory
 entry.  However, the lookup cache does not distinguish between "cache
 valid, entry not found" and "cache invalid".  This patch adds that
 capability.

 Using this new feature, we add support for revalidating negative
 dentries using the lookup cache.

 Test-plan:
 Connectathon, "rm -rf" and tar/untar.  Multi-client software builds.


File 26-nfs-dir-cleanup.patch

 Subject: [PATCH] NFS: directory cleanups

 Category: Maintainability

 Description:
 This patch provides various cleanups and comment corrections in the
 NFS directory logic.

 Test-plan:
 None.


File 40-xprt-switch.patch

 Subject: [PATCH] RPC: introduce client-side transport switch

 Category: Code re-organization

 Description:
 This patch introduces a transport switch into the kernel RPC client.
 The RPC transport switch divorces socket-specific implementation logic
 from the generic pieces of the RPC client.  Such a switch will allow
 support for RPC over 10GbE, IPsec offload, multiple sockets per mount,
 IPv6, and transports capable of direct data placement.

 The first patch in the series moves the bulk of socket-specific code
 into a separate source file, net/sunrpc/ipv4_sock.c.  The patch attempts
 to avoid any functional changes or rewrites.

 Test-plan:
 Millions of fsx operations.  Performance characterization such as
 "sio" or "iozone".  Destructive testing (unplugging the network
 temporarily).  Connectathon with v2, v3, and v4.


File 41-xdr_sendpages.patch

 Subject: [PATCH] RPC: move xdr_sendpages under transport switch

 Category: Code re-organization

 Description:
 This patch removes socket-dependent code from net/sunrpc/xdr.c and adds it
 to net/sunrpc/sock.c.

 Test-plan:
 Millions of fsx operations.  Performance characterization such as "sio" or
 "iozone".


File 42-xprt-switch-cleanup.patch

 Subject: [PATCH] RPC: client-side transport switch cleanup

 Category: Code re-organization

 Description:
 This patch cleans up remaining socket-specific structure naming, removes
 include/socket.h from most RPC client source files, and changes some
 comments to reflect the realities of the new RPC transport switch
 mechanism.  It also removes the cong_wait field from rpc_xprt, as it
 is no longer used.

 Test-plan:
 Millions of fsx operations.  Performance characterization such as
 "sio" or "iozone".


File 43-write_space-tcp-udp.patch

 Subject: [PATCH] RPC: separate TCP and UDP write space callbacks

 Category: Code re-organization, maintainability

 Description:
 This patch splits the socket write space callback function into a TCP
 version and UDP version, eliminating one dependence on the "xprt->stream"
 variable.

 It also makes both callbacks more CPU efficient by reducing the number
 of conditional branches taken in the hot path in each function.

 Test-plan:
 Write-intensive workload on a single mount point.


File 44-connect-tcp-udp.patch

 Subject: [PATCH] RPC: separate TCP and UDP transport connection logic

 Category: Code re-organization, maintainability

 Description:
 This patch splits the RPC client's connection logic into separate paths
 for UDP and TCP, eliminating another dependency on the "xprt->stream"
 variable.

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with v2, v3, and v4.


File 45-send_request-tcp-udp.patch

 Subject: [PATCH] RPC: separate TCP and UDP socket write paths

 Category: Code re-organization, maintainability

 Description:
 This patch splits the RPC client's main socket write path into a TCP version
 and a UDP version to eliminate another dependency on the "xprt->stream"
 variable.

 Test-plan:
 Millions of fsx operations.  Performance characterization such as
 "sio" or "iozone".  Examine oprofile results for any changes before and
 after this patch is applied.


File 46-tsh_size.patch

 Subject: [PATCH] RPC: skip over transport-specific heads automatically

 Category: Code re-organization, maintainability

 Description:
 Add a mechanism for skipping over transport-specific headers when constructing
 an RPC request.  This removes another "xprt->stream" dependency, and gets rid
 of a conditional branch in the RPC send hot path.

 Test-plan:
 Write-intensive workload on a single mount point.


File 47-xprt-stream.patch

 Subject: [PATCH] RPC: get rid of xprt->stream

 Category: Code re-organization, maintainability

 Description:
 Now we can remove the last few places that use the "xprt->stream"
 variable, and get rid of it from the rpc_xprt structure.

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with UDP and TCP.


File 48-congestion.patch

 Subject: [PATCH] RPC: transport-specific timeouts

 Category: Code re-organization, maintainability

 Description:
 This patch prepares the way to remove the "xprt->nocong" variable by
 adding callouts to the RPC client transport switch API to handle
 setting RPC timeouts.

 Test-plan:
 Use WAN simulation to cause sporadic bursty packet loss.  Look for
 significant regression in performance or client stability.


File 49-xprt-nocong.patch

 Subject: [PATCH] RPC: remove xprt->nocong

 Category: Code re-organization, maintainability

 Description:
 Get rid of the "xprt->nocong" variable.

 Test-plan:
 Use WAN simulation to cause sporadic bursty packet loss.  Look for
 significant regression in performance or client stability.


File 50-xprt-buffer.patch

 Subject: [PATCH] RPC: switchable buffer allocation

 Category: Code re-organization

 Description:
 In the IPv4 socket transport implementation, RPC buffers are allocated as
 needed for each RPC message that is sent.  Some transport implementations
 may choose to use pre-allocated buffers for encoding, sending, receiving,
 and unmarshalling RPC messages.

 This patch adds RPC client transport switch support for replacing buffer
 management on a per-transport basis.

 Test-plan:
 Millions of fsx operations.  Performance characterization with "sio" and
 "iozone".  Use oprofile and other tools to look for significant regression
 in CPU utilization.


File 51-rpc_task-cleanup.patch

 Subject: [PATCH] RPC: rpc_task cleanup

 Category: Code re-organization

 Description:
 Move some retry counters from the rpc_task structure to the rpc_rqst
 structure.  These count request retry failures, and belong to each
 request rather than to each task.

 Test-plan:
 None.


File 52-xprt-portmap.patch

 Subject: [PATCH] RPC: pluggable portmapping

 Category: Code re-organization, maintainability

 Description:
 Introduce new RPC client transport switch methods to handle RPC portmapping
 in a transport independent way.

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with UDP and TCP.  NFSv2/3 and NFSv4 mounting should be carefully checked.
 Probably need to rig a server where certain services aren't running, or
 that returns an error for some typical operation.


File 53-xprt_peeraddr.patch

 Subject: [PATCH] RPC: API for getting remote peer address

 Category: Code re-organization, maintainability

 Description:
 Provide an API for retrieving the remote peer address without allowing
 direct access to the rpc_xprt struct.

 Test-plan:
 Destructive testing (unplugging the network temporarily).  Connectathon
 with UDP and TCP.  NFSv2/3 and NFSv4 mounting should be carefully checked.
 Probably need to rig a server where certain services aren't running, or
 that returns an error for some typical operation.


File 54-rpc-xid-htonl.patch

 Subject: [PATCH] RPC: display XIDs in network order

 Category: Servicability

 Description:
 Ethereal and other tools display RPC XIDs in network order.  This patch
 changes the RPC trace messages that display XIDs to print them in network
 order so they can be easily matched to XIDs that appear in Ethereal.

 Test-plan:
 Run a short program with RPC trace debugging enabled while capturing a
 packet trace with Ethereal.  Compare the output of the trace debugging
 messages with the contents of the Ethereal window.


File 55-rpc-xprt-modules.patch

 Subject: [PATCH] RPC: load RPC transport implementations dynamically
 
 Category: Experimental

 Description:
 Now that we have an RPC transport switch, we have introduced the potential
 to add new transport capabilities for use by the RPC client at run-time.
 This patch allows RPC client transport implementations to be loaded as
 needed, or as they become available from distributors or third-party
 vendors.

 This patch is experimental.  It is safe to apply, but is functionally
 incomplete.  Currently it is acting only as a placeholder to collect
 changes related to transport module loading.

 Test-plan:
 Build kernel with NFS and SUNRPC in a module.  Try loading and unloading
 the IPv4 transport module.  Try mounting without the IPv4 module loaded.
 Destructive testing.  Try unloading SUNRPC or the IPv4 transport module
 while there are active NFS mounts.


File 56-tk_auth.patch

 Subject: [PATCH] RPC: add function to get tk_auth from rpc_rqst pointer

 Category: Minor code re-organization

 Description:
 Create a standardized way to derive the appropriate tk_auth for a given
 rpc_rqst.  This is clean up required for the next patch.

 Test-plan:
 None.


File 57-rpc_rqst.patch

 Subject: [PATCH] RPC: make rpc_rqst a part of rpc_task
 
 Category: Performance scalability enhancement

 Description:
 The RPC client allocates rpc_rqst structures from slots in a static
 table.  The size of this slot table determines how many RPC requests
 can be in flight concurrently for that rpc_clnt (NFS mount point).

 Previously, this static slot table was fixed in size, and contained
 just 16 slots for every rpc_clnt.  In recent kernels, logic was added
 to allow this static slot table to be sized and allocated dynamically
 at the time an rpc_clnt is created.  The slot table can now be made as
 large as 128 slots to permit many more concurrent RPC requests, but the
 memory remains allocated until each rpc_clnt is destroyed, even when
 there are no outstanding RPC requests.

 NFSv4.1 sessions provide the ability to negotiate the maximum number of
 concurrent requests that the transport and the server can handle.  The
 maximum can be increased or decreased during a transport session.

 This patch makes RPC requests a part of the RPC task structure.  This
 reduces memory fragmentation by keeping all the pieces of an NFS and
 RPC request together.  For asynchronous reads and writes, this means
 the RPC request, RPC task, and nfs_read/write_data structures reside
 in the same CPU cache lines, and will move between caches together.
 RPC slot tables are eliminated entirely, preventing memory from being
 held even when it's not being used, and allowing the maximum number of
 outstanding RPC requests to be modulated dynamically.

 The call_reserve path has been simplified.  The only job it has now
 are to initialize each rpc_rqst structure, or to queue RPC tasks on the
 xprt's backlog queue if the maximum number per transport has been exceeded.

 We no longer have a free list, and each XID can be allocated while holding
 the xprt_lock.  Thus we can do away with the reserve_lock entirely.

 Test-plan:
 Heavy multi-threaded tests like "sio" and "iozone" on SMP systems.  Watch
 for signficant regression in CPU utilization with oprofile and other tools.
 Watch for multiple RPCs with the same XID.  Destructive testing.


File 58-rpc-tk_rqstp.patch

 Subject: [PATCH] RPC: remove tk_rqstp field from rpc_task
 
 Category: Minor code re-organization

 Description:
 Now that the rpc_rqst structure is part of rpc_task, there is no need
 for a field in rpc_task that points to the associated RPC request slot.

 This patch removes the tk_rqstp field.

 Test-plan:
 None.


File 59-rpc-rq_task.patch

 Subject: [PATCH] RPC: remove rq_task field from rpc_rqst
 
 Category: Minor code re-organization

 Description:
 Now that the rpc_rqst structure is part of rpc_task, there is no need
 for a field in rpc_rqst that points to its associated rpc_task.

 This patch removes the rq_task field.

 Test-plan:
 None.


File 60-rpc-rq_xprt.patch

 Subject: [PATCH] RPC: use rq_xprt instead of tk_xprt
 
 Description: Minor performance optimization.

 Now that the rpc_rqst structure is part of rpc_task, we can replace
 this double indirection (task->tk_client->cl_xprt) with a single
 indirection (task->tk_rqst.rq_xprt) to save a load instruction and
 eliminate one AGI in several places.

 Test-plan:
 None.


File 61-rpc-rq_rtt.patch

 Subject: [PATCH] RPC: add rq_rtt field to rpc_rqst
 
 Description: Minor performance optimization.

 Currently it is cumbersome to derive the RTT data from the address
 of an RPC request.  This patch adds an rq_rtt field which is a copy
 of the RPC client's cl_rtt field.  That makes it easier and cleaner
 to find the RTT data when needed, and reduces the number of loads
 and AGIs in several places.

 Test-plan:
 None.