Proposed Strategy for managing ESTALE results in the Linux NFS client

Author: Chuck Lever
Date: Mon Dec 20 11:32:41 EST 2004

Introduction and Problem Statement

According to RFC 1813, an NFS server returns the error status NFSERR_STALE whenever a client has requested an operation against a file handle that either no longer exists, or to which the client no longer has access and an NFSERR_ACCESS return is not appropriate. This occurs most often in multi-client sharing environments where one client removes and/or replaces directories and files that other clients have cached.

Currently applications on Linux receive an ESTALE error more often than with other NFS client implementations. In addition, local file systems on Linux never return ESTALE, so there is a compatibility issue for applications running on NFS file systems that were designed for local file systems. And, when an ESTALE error does occur, the Linux NFS client does not recover without user-level intervention.

The goal of this work is to improve the usability of the Linux NFS client by greatly reducing the likelihood that an application will ever receive an ESTALE result. We will accomplish this in two ways: first, by creating a robust mechanism for recoverying directory cache state after a server ESTALE result; second, by substituting more familiar and intuitive failure modes instead of returning ESTALE to applications. In other words, the resulting implementation will behave more like local file systems whereever possible.

Even though this work will proceed on NFSv3, all NFS versions have these issues. Until the NFSv4 implementation has directory delegation, it will also require adequate ESTALE recovery.

Pathname Recovery

Applications expect that open(2) does not return ESTALE. Already we have fixed two bugs, both of which reduce the likelihood that ESTALE will be returned from open(2):

  1. Clients must detect directory restore operations that can cause the file handles and cookies in a directory to change. When a client detects such a change, it should purge any data (file handles, file attributes, or cookies) it has cached for that directory.
  2. Close-to-open requires a client to perform a GETATTR during open(2) to verify cached data and metadata. Move that GETATTR into the pathname resolution logic (lookup revalidation) so that a client can detect and recover from a stale file handle before recovery becomes impractical.

But these are not enough.

If an ESTALE result occurs during pathname resolution, the best recovery strategy is to purge the stale cached pathname components and retry the pathname resolution from the top. This can happen if the VFS layer itself recognizes the ESTALE and redrives the pathname resolution process. The VFS layer must handle this case whenever a pathname resolution is required, not just at open(2) time.

In addition, the retry logic must take care not to loop forever. When resolving a relative pathname, for example, it is possible that "." or ".." may be stale, in which case there is no way to resolve the pathname because retrying pathname resolution after purging the stale file handle from the local cache will never result in anything but ESTALE.

The Solaris VFS layer retries any pathname resolution that returns an ESTALE. The NFS client implementation has logic to return ESTALE only once for a given stale cached file handle, which means the pathname resolution will always terminate. If the ESTALE condition is not cleared by retries, the resolution logic returns ENOENT. The downside of this is that two processes trying a pathname resolution at the same time can get different results.

Implementation

One possible implementation wraps the core pathname resolution engine in fs/namei.c with simple logic to redrive the resolution from the top of the pathname in the event any part of the resolution fails with ESTALE. If redriving the resolution does not clear the ESTALE condition after a limited number of retries, pathname resolution exits with ENOENT. We've prototyped this idea by wrapping the link_path_walk function.

Redriving pathname resolution requires that the NFS client purges all data for a file object, and as much metadata as possible, when a stale file handle is encountered. This might include invalidating the parent directory's cache. If the NFS client does not invalidate stale inodes, retrying pathname resolution will continue using the stale cached information.

There are several issues with this implementation.

  1. Not every system call that requires pathname resolution uses link_path_walk, but most do, especially all variants of open(2).
  2. We don't have a clean mechanism for preventing infinite pathname resolution retries when it becomes obvious that the ESTALE condition won't be cleared (currently we cap the number of retries at the number of slashes in the pathname, but this is a workaround more than a final solution).
  3. The Linux NFS client never allows the root dentry of a mounted NFS file system to go stale (ie it lies). With the addition of pathname resolution retry, the Linux NFS client may be finally capable of reporting safely that the root file handle has gone stale.

Renaming ESTALE

Although the Linux NFS client must signal an error when an operation fails because a stale file handle is encountered, applications are often unprepared for the ESTALE return code. Local file systems use ENOENT to indicate that an object no longer exists, and the NFS client should do the same for metadata operations such as rmdir or setattr in order to be consistent with other file system implementations.

File I/O operations are somewhat different. In general operating systems that comply with the POSIX standard allow applications to continue reading to and writing from files that are open but unlinked. The NFS client can usually accomodate this, except when the file was unlinked by some other client. One can argue whether it is better to return EIO because applications may be ready to recognize this; or ESTALE, which applications recognize as an error, but might also recognize as an indication that they can recover by re-opening the file (thus redriving pathname resolution from user-level).

As a general rule, to emulate local file system behavior completely, the Linux NFS client should avoid returning an ESTALE result to the VFS layer, except during pathname resolution.

For example, in this set of operations, returning ESTALE from follow_link, lookup, and permission would allow the VFS layer to detect stale directory entries, and redrive pathname resolution if necessary. In all other cases, the NFS client could return ENOENT if it receives an ESTALE while revalidating the inode, or during the requested operation itself. This would prevent applications from seeing ESTALE, but allow an error return that is meaningful and consistent with other file system implementations in Linux.

struct inode_operations {
        int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
        struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
        int (*link) (struct dentry *,struct inode *,struct dentry *);
        int (*unlink) (struct inode *,struct dentry *);
        int (*symlink) (struct inode *,struct dentry *,const char *);
        int (*mkdir) (struct inode *,struct dentry *,int);
        int (*rmdir) (struct inode *,struct dentry *);
        int (*mknod) (struct inode *,struct dentry *,int,dev_t);
        int (*rename) (struct inode *, struct dentry *,
                        struct inode *, struct dentry *);
        int (*readlink) (struct dentry *, char __user *,int);
        int (*follow_link) (struct dentry *, struct nameidata *);
        void (*put_link) (struct dentry *, struct nameidata *);
        void (*truncate) (struct inode *);
        int (*permission) (struct inode *, int, struct nameidata *);
        int (*setattr) (struct dentry *, struct iattr *);
        int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
        int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
        ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
        ssize_t (*listxattr) (struct dentry *, char *, size_t);
        int (*removexattr) (struct dentry *, const char *);
};

Instead of returning ESTALE, the data-related operations in this collection (such as read, sendfile, or fsync) could return EIO if a stale file handle is encountered. The metadata operations, such as lock and llseek could return either EIO or ENOENT.

struct file_operations {
        struct module *owner;
        loff_t (*llseek) (struct file *, loff_t, int);
        ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
        ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);
        ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
        ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t);
        int (*readdir) (struct file *, void *, filldir_t);
        unsigned int (*poll) (struct file *, struct poll_table_struct *);
        int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
        int (*mmap) (struct file *, struct vm_area_struct *);
        int (*open) (struct inode *, struct file *);
        int (*flush) (struct file *);
        int (*release) (struct inode *, struct file *);
        int (*fsync) (struct file *, struct dentry *, int datasync);
        int (*aio_fsync) (struct kiocb *, int datasync);
        int (*fasync) (int, struct file *, int);
        int (*lock) (struct file *, int, struct file_lock *);
        ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
        ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
        ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
        ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
        unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long,
                unsigned long, unsigned long);
        int (*check_flags)(int);
        int (*dir_notify)(struct file *filp, unsigned long arg);
        int (*flock) (struct file *, int, struct file_lock *);
};

The dentry operations return a boolean value or nothing at all. d_revalidate should continue to return zero if it encounters a stale file handle.

struct dentry_operations {
        int (*d_revalidate)(struct dentry *, struct nameidata *);
        int (*d_hash) (struct dentry *, struct qstr *);
        int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
        int (*d_delete)(struct dentry *);
        void (*d_release)(struct dentry *);
        void (*d_iput)(struct dentry *, struct inode *);
};