[Ocfs2-users] OCFS2, NFS and random Stale NFS file handles

Tue Jul 16 23:22:25 PDT 2013

I've been doing more digging, and I've changed some of the configuration:

1) I've changed my nfs mount options to this:

192.168.0.160:/mnt/storage                 /mnt/i2xstorage   nfs
 defaults,nosuid,noexec,noatime,nodiratime        0 0

2) I've changed the /etc/exports for /mnt/storage to this:

     /mnt/storage -rw,sync,subtree_check,no_root_squash @trusted

In #1, I've removed nodev, which I think I accidentally copied over from a
tmpfs mount point above it when I originally set up the nfs mount point so
long ago. Additionally, I added nodiratime. In #2, it used to be
-rw,async,no_subtree_check,no_root_squash. I think the async may be causing
what I'm seeing potentially, and the subtree_check should be okay for
testing.

Hopefully, this will have an effect.

Adam.

On Tue, Jul 16, 2013 at 9:44 PM, Adam Randall <randalla at gmail.com> wrote:

> Here's various outputs:
>
> # grep nfs /etc/mtab:
> rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
> 192.168.0.160:/var/log/dms /mnt/dmslogs nfs
> rw,noexec,nosuid,nodev,noatime,vers=4,addr=192.168.0.160,clientaddr=192.168.0.150
> 0 0
> 192.168.0.160:/mnt/storage /mnt/storage nfs
> rw,noexec,nosuid,nodev,noatime,vers=4,addr=192.168.0.160,clientaddr=192.168.0.150
> 0 0
> # grep nfs /proc/mounts:
> rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
> 192.168.0.160:/var/log/dms /mnt/dmslogs nfs4
> rw,nosuid,nodev,noexec,noatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.150,local_lock=none,addr=192.168.0.160
> 0 0
> 192.168.0.160:/mnt/storage /mnt/storage nfs4
> rw,nosuid,nodev,noexec,noatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.150,local_lock=none,addr=192.168.0.160
> 0 0
>
> Also, the output of df -hT | grep nfs:
> 192.168.0.160:/var/log/dms nfs       273G  5.6G  253G   3% /mnt/dmslogs
> 192.168.0.160:/mnt/storage nfs       2.8T  1.8T  986G  65% /mnt/storage
>
> From the looks of it, it appears to be nfs version 4 (though I thought that
> I was running version 3, hrm...).
>
> With regards to the ls -lid, one of the directories that wasn't altered, but for whatever reason was not accessible due to the handler is this:
>
> # ls -lid /mnt/storage/reports/5306
> 185862043 drwxrwxrwx 4 1095 users 45056 Jul 15 21:37 /mnt/storage/reports/5306
>
> In the directory where we create new documents, which creates a folder for each document (legacy decision), it looks something like this:
>
> # ls -lid /mnt/storage/dms/documents/819/* | head -n 10
> 290518712 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:39 /mnt/storage/dms/documents/819/8191174
> 290518714 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:39 /mnt/storage/dms/documents/819/8191175
> 290518716 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:39 /mnt/storage/dms/documents/819/8191176
> 290518718 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:39 /mnt/storage/dms/documents/819/8191177
> 290518720 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:39 /mnt/storage/dms/documents/819/8191178
> 290518722 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:40 /mnt/storage/dms/documents/819/8191179
> 290518724 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:40 /mnt/storage/dms/documents/819/8191180
> 290518726 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:47 /mnt/storage/dms/documents/819/8191181
> 290518728 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:50 /mnt/storage/dms/documents/819/8191182
> 290518730 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:52 /mnt/storage/dms/documents/
> 819/8191183
>
> The stale handles seem to appear more when there's load on the system, but that's not overly true. I received notice of two failures (both from the same server) tonight, as seen here:
>
> Jul 16 19:27:40 imaging4 php: Output of: ls -l /mnt/storage/dms/documents/819/8191226/ 2>&1:
> Jul 16 19:27:40 imaging4 php:    ls: cannot access /mnt/storage/dms/documents/819/8191226/: Stale NFS file handle
> Jul 16 19:44:15 imaging4 php: Output of: ls -l /mnt/storage/dms/documents/819/8191228/ 2>&1:
> Jul 16 19:44:15 imaging4 php:    ls: cannot access /mnt/storage/dms/documents/819/8191228/: Stale NFS file handle
>
> The above is logged out of my e-mail collecting daemon, which is written in PHP. When I can't access the directory that was just created, it uses syslog() to write the above information out.
>
> From the same server, doing ls -lid I get these for those two directories:
>
> 290518819 drwxrwxrwx 2 nobody nobody 3896 Jul 16 19:44 /mnt/storage/dms/documents/819/8191228
> 290518816 drwxrwxrwx 2 nobody nobody 3896 Jul 16 19:27 /mnt/storage/dms/documents/819/8191226
>
> Stating the directories showed that the modified times coorespond to the logs above:
>
> Modify: 2013-07-16 19:27:40.786142391 -0700
> Modify: 2013-07-16 19:44:15.458250738 -0700
>
> By the time it happened, to the time I got back, the stale handle cleared itself.
>
> If it's at all relevant, this is the fstab:
>
> 192.168.0.160:/var/log/dms                 /mnt/dmslogs      nfs defaults,nodev,nosuid,noexec,noatime            0 0
> 192.168.0.160:/mnt/storage                 /mnt/storage      nfs defaults,nodev,nosuid,noexec,noatime            0 0
>
> Lastly, in a fit of grasping at straws, I did unmount the ocfs2 partition on the secondary server, and stopped ocfs2 service. I was thinking that maybe having it in master/master mode could cause what I was seeing. Alas, that's not the case as the above errors came after I did that.
>
> Is there anything else that I can provide that might be of help?
>
> Adam.
>
>
>
> On Tue, Jul 16, 2013 at 5:15 PM, Patrick J. LoPresti <lopresti at gmail.com>wrote:
>
>> What version is the NFS mount? ("cat /proc/mounts" on the NFS client)
>>
>> NFSv2 only allowed 64 bits in the file handle. With the
>> "subtree_check" option on the NFS server, 32 of those bits are used
>> for the subtree check, leaving only 32 for the inode. (This is from
>> memory; I may have the exact numbers wrong. But the principle
>> applies.)
>>
>> See <
>> https://oss.oracle.com/projects/ocfs2/dist/documentation/v1.2/ocfs2_faq.html#NFS
>> >
>>
>> If you run "ls -lid <directory>" for directories that work and those
>> that fail, and you find that the failing directories all have huge
>> inode numbers, that will help confirm that this is the problem.
>>
>> Also if you are using NFSv2 and switch to v3 or set the
>> "no_subtree_check" option and it fixes the problem, that will also
>> help confirm that this is the problem. :-)
>>
>>  - Pat
>>
>>
>> On Tue, Jul 16, 2013 at 5:07 PM, Adam Randall <randalla at gmail.com> wrote:
>> > Please forgive my lack of experience, but I've just recently started
>> deeply
>> > working with ocfs2 and am not familiar with all it's caveats.
>> >
>> > We've just deployed two servers that have SAN arrays attached to them.
>> These
>> > arrays are synchronized with DRBD in master/master mode, with ocfs2
>> > configured on top of that. In all my testing everything worked well,
>> except
>> > for an issue with symbolic links throwing an exception in the kernel
>> (ths
>> > was fixed by applying a patch I found here:
>> > comments.gmane.org/gmane.comp.file-systems.ocfs2.devel/8008). Of these
>> > machines, one of them is designated the master and the other is it's
>> backup.
>> >
>> > Host is Gentoo linux running the 3.8.13.
>> >
>> > I have four other machines that are connecting to the master ocfs2
>> partition
>> > using nfs. The problem I'm having is that on these machines, I'm
>> randomly
>> > getting read errors while trying to enter directories over nfs. In all
>> of
>> > these cases, except on, these directories are immediately unavailable
>> after
>> > they are created. The error that comes back is always something like
>> this:
>> >
>> > ls: cannot access /mnt/storage/documents/818/8189794/: Stale NFS file
>> handle
>> >
>> > The mount point is /mnt/storage. Other directories on the mount are
>> > available, and on other servers the same directory can be accessed
>> perfectly
>> > fine.
>> >
>> > I haven't been able to reproduce this issue in isolated testing.
>> >
>> > The four machines that connect via NFS are doing one of two things:
>> >
>> > 1) processing e-mail through a php driven daemon (read and write,
>> creating
>> > directories)
>> > 2) serving report files in PDF format over the web via a php web
>> application
>> > (read only)
>> >
>> > I believe that the ocfs2 version if 1.5. I found this in the kernel
>> source
>> > itself, but haven't figured out how to determine this in the shell.
>> > ocfs2-tools is version 1.8.2, which is what ocfs2 wanted (maybe this is
>> > ocfs2 1.8 then?).
>> >
>> > The only other path I can think to take is to abandon OCFS2 and use
>> DRBD in
>> > master/slave mode with ext4 on top of that. This would still provide me
>> with
>> > the redundancy I want, but at a lack of not being able to use both
>> machines
>> > simultaneously.
>> >
>> > If anyone has any advice, I'd love to hear it.
>> >
>> > Thanks in advance,
>> >
>> > Adam.
>> >
>> >
>> > --
>> > Adam Randall
>> > http://www.xaren.net
>> > AIM: blitz574
>> > Twitter: @randalla0622
>> >
>> > "To err is human... to really foul up requires the root password."
>> >
>> > _______________________________________________
>> > Ocfs2-users mailing list
>> > Ocfs2-users at oss.oracle.com
>> > https://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>
>
>
> --
> Adam Randall
> http://www.xaren.net
> AIM: blitz574
> Twitter: @randalla0622
>
> "To err is human... to really foul up requires the root password."
>

-- 
Adam Randall
http://www.xaren.net
AIM: blitz574
Twitter: @randalla0622

"To err is human... to really foul up requires the root password."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20130716/638c6c3d/attachment.html