[Ocfs2-users] OCFS2, NFS and random Stale NFS file handles

Tue Jul 16 21:44:50 PDT 2013

Here's various outputs:

# grep nfs /etc/mtab:
rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
192.168.0.160:/var/log/dms /mnt/dmslogs nfs
rw,noexec,nosuid,nodev,noatime,vers=4,addr=192.168.0.160,clientaddr=192.168.0.150
0 0
192.168.0.160:/mnt/storage /mnt/storage nfs
rw,noexec,nosuid,nodev,noatime,vers=4,addr=192.168.0.160,clientaddr=192.168.0.150
0 0
# grep nfs /proc/mounts:
rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
192.168.0.160:/var/log/dms /mnt/dmslogs nfs4
rw,nosuid,nodev,noexec,noatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.150,local_lock=none,addr=192.168.0.160
0 0
192.168.0.160:/mnt/storage /mnt/storage nfs4
rw,nosuid,nodev,noexec,noatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.150,local_lock=none,addr=192.168.0.160
0 0

Also, the output of df -hT | grep nfs:
192.168.0.160:/var/log/dms nfs       273G  5.6G  253G   3% /mnt/dmslogs
192.168.0.160:/mnt/storage nfs       2.8T  1.8T  986G  65% /mnt/storage

>From the looks of it, it appears to be nfs version 4 (though I thought that
I was running version 3, hrm...).

With regards to the ls -lid, one of the directories that wasn't
altered, but for whatever reason was not accessible due to the handler
is this:

# ls -lid /mnt/storage/reports/5306
185862043 drwxrwxrwx 4 1095 users 45056 Jul 15 21:37 /mnt/storage/reports/5306

In the directory where we create new documents, which creates a folder
for each document (legacy decision), it looks something like this:

# ls -lid /mnt/storage/dms/documents/819/* | head -n 10
290518712 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:39
/mnt/storage/dms/documents/819/8191174
290518714 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:39
/mnt/storage/dms/documents/819/8191175
290518716 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:39
/mnt/storage/dms/documents/819/8191176
290518718 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:39
/mnt/storage/dms/documents/819/8191177
290518720 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:39
/mnt/storage/dms/documents/819/8191178
290518722 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:40
/mnt/storage/dms/documents/819/8191179
290518724 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:40
/mnt/storage/dms/documents/819/8191180
290518726 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:47
/mnt/storage/dms/documents/819/8191181
290518728 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:50
/mnt/storage/dms/documents/819/8191182
290518730 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:52
/mnt/storage/dms/documents/819/8191183

The stale handles seem to appear more when there's load on the system,
but that's not overly true. I received notice of two failures (both
from the same server) tonight, as seen here:

Jul 16 19:27:40 imaging4 php: Output of: ls -l
/mnt/storage/dms/documents/819/8191226/ 2>&1:
Jul 16 19:27:40 imaging4 php:    ls: cannot access
/mnt/storage/dms/documents/819/8191226/: Stale NFS file handle
Jul 16 19:44:15 imaging4 php: Output of: ls -l
/mnt/storage/dms/documents/819/8191228/ 2>&1:
Jul 16 19:44:15 imaging4 php:    ls: cannot access
/mnt/storage/dms/documents/819/8191228/: Stale NFS file handle

The above is logged out of my e-mail collecting daemon, which is
written in PHP. When I can't access the directory that was just
created, it uses syslog() to write the above information out.

>From the same server, doing ls -lid I get these for those two directories:

290518819 drwxrwxrwx 2 nobody nobody 3896 Jul 16 19:44
/mnt/storage/dms/documents/819/8191228
290518816 drwxrwxrwx 2 nobody nobody 3896 Jul 16 19:27
/mnt/storage/dms/documents/819/8191226

Stating the directories showed that the modified times coorespond to
the logs above:

Modify: 2013-07-16 19:27:40.786142391 -0700
Modify: 2013-07-16 19:44:15.458250738 -0700

By the time it happened, to the time I got back, the stale handle
cleared itself.

If it's at all relevant, this is the fstab:

192.168.0.160:/var/log/dms                 /mnt/dmslogs      nfs
defaults,nodev,nosuid,noexec,noatime            0 0
192.168.0.160:/mnt/storage                 /mnt/storage      nfs
defaults,nodev,nosuid,noexec,noatime            0 0

Lastly, in a fit of grasping at straws, I did unmount the ocfs2
partition on the secondary server, and stopped ocfs2 service. I was
thinking that maybe having it in master/master mode could cause what I
was seeing. Alas, that's not the case as the above errors came after I
did that.

Is there anything else that I can provide that might be of help?

Adam.

On Tue, Jul 16, 2013 at 5:15 PM, Patrick J. LoPresti <lopresti at gmail.com>wrote:

> What version is the NFS mount? ("cat /proc/mounts" on the NFS client)
>
> NFSv2 only allowed 64 bits in the file handle. With the
> "subtree_check" option on the NFS server, 32 of those bits are used
> for the subtree check, leaving only 32 for the inode. (This is from
> memory; I may have the exact numbers wrong. But the principle
> applies.)
>
> See <
> https://oss.oracle.com/projects/ocfs2/dist/documentation/v1.2/ocfs2_faq.html#NFS
> >
>
> If you run "ls -lid <directory>" for directories that work and those
> that fail, and you find that the failing directories all have huge
> inode numbers, that will help confirm that this is the problem.
>
> Also if you are using NFSv2 and switch to v3 or set the
> "no_subtree_check" option and it fixes the problem, that will also
> help confirm that this is the problem. :-)
>
>  - Pat
>
>
> On Tue, Jul 16, 2013 at 5:07 PM, Adam Randall <randalla at gmail.com> wrote:
> > Please forgive my lack of experience, but I've just recently started
> deeply
> > working with ocfs2 and am not familiar with all it's caveats.
> >
> > We've just deployed two servers that have SAN arrays attached to them.
> These
> > arrays are synchronized with DRBD in master/master mode, with ocfs2
> > configured on top of that. In all my testing everything worked well,
> except
> > for an issue with symbolic links throwing an exception in the kernel (ths
> > was fixed by applying a patch I found here:
> > comments.gmane.org/gmane.comp.file-systems.ocfs2.devel/8008). Of these
> > machines, one of them is designated the master and the other is it's
> backup.
> >
> > Host is Gentoo linux running the 3.8.13.
> >
> > I have four other machines that are connecting to the master ocfs2
> partition
> > using nfs. The problem I'm having is that on these machines, I'm randomly
> > getting read errors while trying to enter directories over nfs. In all of
> > these cases, except on, these directories are immediately unavailable
> after
> > they are created. The error that comes back is always something like
> this:
> >
> > ls: cannot access /mnt/storage/documents/818/8189794/: Stale NFS file
> handle
> >
> > The mount point is /mnt/storage. Other directories on the mount are
> > available, and on other servers the same directory can be accessed
> perfectly
> > fine.
> >
> > I haven't been able to reproduce this issue in isolated testing.
> >
> > The four machines that connect via NFS are doing one of two things:
> >
> > 1) processing e-mail through a php driven daemon (read and write,
> creating
> > directories)
> > 2) serving report files in PDF format over the web via a php web
> application
> > (read only)
> >
> > I believe that the ocfs2 version if 1.5. I found this in the kernel
> source
> > itself, but haven't figured out how to determine this in the shell.
> > ocfs2-tools is version 1.8.2, which is what ocfs2 wanted (maybe this is
> > ocfs2 1.8 then?).
> >
> > The only other path I can think to take is to abandon OCFS2 and use DRBD
> in
> > master/slave mode with ext4 on top of that. This would still provide me
> with
> > the redundancy I want, but at a lack of not being able to use both
> machines
> > simultaneously.
> >
> > If anyone has any advice, I'd love to hear it.
> >
> > Thanks in advance,
> >
> > Adam.
> >
> >
> > --
> > Adam Randall
> > http://www.xaren.net
> > AIM: blitz574
> > Twitter: @randalla0622
> >
> > "To err is human... to really foul up requires the root password."
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > https://oss.oracle.com/mailman/listinfo/ocfs2-users
>

-- 
Adam Randall
http://www.xaren.net
AIM: blitz574
Twitter: @randalla0622

"To err is human... to really foul up requires the root password."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20130716/dee45b8d/attachment-0001.html