[Ocfs2-users] OCFS2, NFS and random Stale NFS file handles

Wed Jul 17 07:47:00 PDT 2013

My changes to exports had no effect it seems. I awoke to four errors from
my processing engine. All of them came from the same server, which makes me
curious. I've turned that one off and will see what happens.

On Tue, Jul 16, 2013 at 11:22 PM, Adam Randall <randalla at gmail.com> wrote:

> I've been doing more digging, and I've changed some of the configuration:
>
> 1) I've changed my nfs mount options to this:
>
> 192.168.0.160:/mnt/storage                 /mnt/i2xstorage   nfs
>  defaults,nosuid,noexec,noatime,nodiratime        0 0
>
> 2) I've changed the /etc/exports for /mnt/storage to this:
>
>      /mnt/storage -rw,sync,subtree_check,no_root_squash @trusted
>
> In #1, I've removed nodev, which I think I accidentally copied over from a
> tmpfs mount point above it when I originally set up the nfs mount point so
> long ago. Additionally, I added nodiratime. In #2, it used to be
> -rw,async,no_subtree_check,no_root_squash. I think the async may be causing
> what I'm seeing potentially, and the subtree_check should be okay for
> testing.
>
> Hopefully, this will have an effect.
>
> Adam.
>
>
> On Tue, Jul 16, 2013 at 9:44 PM, Adam Randall <randalla at gmail.com> wrote:
>
>>
>> Here's various outputs:
>>
>> # grep nfs /etc/mtab:
>> rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
>> 192.168.0.160:/var/log/dms /mnt/dmslogs nfs
>> rw,noexec,nosuid,nodev,noatime,vers=4,addr=192.168.0.160,clientaddr=192.168.0.150
>> 0 0
>> 192.168.0.160:/mnt/storage /mnt/storage nfs
>> rw,noexec,nosuid,nodev,noatime,vers=4,addr=192.168.0.160,clientaddr=192.168.0.150
>> 0 0
>> # grep nfs /proc/mounts:
>> rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
>> 192.168.0.160:/var/log/dms /mnt/dmslogs nfs4
>> rw,nosuid,nodev,noexec,noatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.150,local_lock=none,addr=192.168.0.160
>> 0 0
>> 192.168.0.160:/mnt/storage /mnt/storage nfs4
>> rw,nosuid,nodev,noexec,noatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.150,local_lock=none,addr=192.168.0.160
>> 0 0
>>
>> Also, the output of df -hT | grep nfs:
>> 192.168.0.160:/var/log/dms nfs       273G  5.6G  253G   3% /mnt/dmslogs
>> 192.168.0.160:/mnt/storage nfs       2.8T  1.8T  986G  65% /mnt/storage
>>
>> From the looks of it, it appears to be nfs version 4 (though I thought that
>> I was running version 3, hrm...).
>>
>> With regards to the ls -lid, one of the directories that wasn't altered, but for whatever reason was not accessible due to the handler is this:
>>
>> # ls -lid /mnt/storage/reports/5306
>> 185862043 drwxrwxrwx 4 1095 users 45056 Jul 15 21:37 /mnt/storage/reports/5306
>>
>> In the directory where we create new documents, which creates a folder for each document (legacy decision), it looks something like this:
>>
>> # ls -lid /mnt/storage/dms/documents/819/* | head -n 10
>> 290518712 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:39 /mnt/storage/dms/documents/819/8191174
>> 290518714 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:39 /mnt/storage/dms/documents/819/8191175
>> 290518716 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:39 /mnt/storage/dms/documents/819/8191176
>> 290518718 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:39 /mnt/storage/dms/documents/819/8191177
>> 290518720 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:39 /mnt/storage/dms/documents/819/8191178
>> 290518722 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:40 /mnt/storage/dms/documents/819/8191179
>> 290518724 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:40 /mnt/storage/dms/documents/819/8191180
>> 290518726 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:47 /mnt/storage/dms/documents/819/8191181
>> 290518728 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:50 /mnt/storage/dms/documents/819/8191182
>> 290518730 drwxrwxrwx 2 nobody nobody 3896 Jul 16 18:52 /mnt/storage/dms/documents/
>> 819/8191183
>>
>> The stale handles seem to appear more when there's load on the system, but that's not overly true. I received notice of two failures (both from the same server) tonight, as seen here:
>>
>> Jul 16 19:27:40 imaging4 php: Output of: ls -l /mnt/storage/dms/documents/819/8191226/ 2>&1:
>> Jul 16 19:27:40 imaging4 php:    ls: cannot access /mnt/storage/dms/documents/819/8191226/: Stale NFS file handle
>> Jul 16 19:44:15 imaging4 php: Output of: ls -l /mnt/storage/dms/documents/819/8191228/ 2>&1:
>> Jul 16 19:44:15 imaging4 php:    ls: cannot access /mnt/storage/dms/documents/819/8191228/: Stale NFS file handle
>>
>> The above is logged out of my e-mail collecting daemon, which is written in PHP. When I can't access the directory that was just created, it uses syslog() to write the above information out.
>>
>> From the same server, doing ls -lid I get these for those two directories:
>>
>> 290518819 drwxrwxrwx 2 nobody nobody 3896 Jul 16 19:44 /mnt/storage/dms/documents/819/8191228
>> 290518816 drwxrwxrwx 2 nobody nobody 3896 Jul 16 19:27 /mnt/storage/dms/documents/819/8191226
>>
>> Stating the directories showed that the modified times coorespond to the logs above:
>>
>> Modify: 2013-07-16 19:27:40.786142391 -0700
>> Modify: 2013-07-16 19:44:15.458250738 -0700
>>
>> By the time it happened, to the time I got back, the stale handle cleared itself.
>>
>> If it's at all relevant, this is the fstab:
>>
>> 192.168.0.160:/var/log/dms                 /mnt/dmslogs      nfs defaults,nodev,nosuid,noexec,noatime            0 0
>> 192.168.0.160:/mnt/storage                 /mnt/storage      nfs defaults,nodev,nosuid,noexec,noatime            0 0
>>
>> Lastly, in a fit of grasping at straws, I did unmount the ocfs2 partition on the secondary server, and stopped ocfs2 service. I was thinking that maybe having it in master/master mode could cause what I was seeing. Alas, that's not the case as the above errors came after I did that.
>>
>> Is there anything else that I can provide that might be of help?
>>
>> Adam.
>>
>>
>>
>> On Tue, Jul 16, 2013 at 5:15 PM, Patrick J. LoPresti <lopresti at gmail.com>wrote:
>>
>>> What version is the NFS mount? ("cat /proc/mounts" on the NFS client)
>>>
>>> NFSv2 only allowed 64 bits in the file handle. With the
>>> "subtree_check" option on the NFS server, 32 of those bits are used
>>> for the subtree check, leaving only 32 for the inode. (This is from
>>> memory; I may have the exact numbers wrong. But the principle
>>> applies.)
>>>
>>> See <
>>> https://oss.oracle.com/projects/ocfs2/dist/documentation/v1.2/ocfs2_faq.html#NFS
>>> >
>>>
>>> If you run "ls -lid <directory>" for directories that work and those
>>> that fail, and you find that the failing directories all have huge
>>> inode numbers, that will help confirm that this is the problem.
>>>
>>> Also if you are using NFSv2 and switch to v3 or set the
>>> "no_subtree_check" option and it fixes the problem, that will also
>>> help confirm that this is the problem. :-)
>>>
>>>  - Pat
>>>
>>>
>>> On Tue, Jul 16, 2013 at 5:07 PM, Adam Randall <randalla at gmail.com>
>>> wrote:
>>> > Please forgive my lack of experience, but I've just recently started
>>> deeply
>>> > working with ocfs2 and am not familiar with all it's caveats.
>>> >
>>> > We've just deployed two servers that have SAN arrays attached to them.
>>> These
>>> > arrays are synchronized with DRBD in master/master mode, with ocfs2
>>> > configured on top of that. In all my testing everything worked well,
>>> except
>>> > for an issue with symbolic links throwing an exception in the kernel
>>> (ths
>>> > was fixed by applying a patch I found here:
>>> > comments.gmane.org/gmane.comp.file-systems.ocfs2.devel/8008). Of these
>>> > machines, one of them is designated the master and the other is it's
>>> backup.
>>> >
>>> > Host is Gentoo linux running the 3.8.13.
>>> >
>>> > I have four other machines that are connecting to the master ocfs2
>>> partition
>>> > using nfs. The problem I'm having is that on these machines, I'm
>>> randomly
>>> > getting read errors while trying to enter directories over nfs. In all
>>> of
>>> > these cases, except on, these directories are immediately unavailable
>>> after
>>> > they are created. The error that comes back is always something like
>>> this:
>>> >
>>> > ls: cannot access /mnt/storage/documents/818/8189794/: Stale NFS file
>>> handle
>>> >
>>> > The mount point is /mnt/storage. Other directories on the mount are
>>> > available, and on other servers the same directory can be accessed
>>> perfectly
>>> > fine.
>>> >
>>> > I haven't been able to reproduce this issue in isolated testing.
>>> >
>>> > The four machines that connect via NFS are doing one of two things:
>>> >
>>> > 1) processing e-mail through a php driven daemon (read and write,
>>> creating
>>> > directories)
>>> > 2) serving report files in PDF format over the web via a php web
>>> application
>>> > (read only)
>>> >
>>> > I believe that the ocfs2 version if 1.5. I found this in the kernel
>>> source
>>> > itself, but haven't figured out how to determine this in the shell.
>>> > ocfs2-tools is version 1.8.2, which is what ocfs2 wanted (maybe this is
>>> > ocfs2 1.8 then?).
>>> >
>>> > The only other path I can think to take is to abandon OCFS2 and use
>>> DRBD in
>>> > master/slave mode with ext4 on top of that. This would still provide
>>> me with
>>> > the redundancy I want, but at a lack of not being able to use both
>>> machines
>>> > simultaneously.
>>> >
>>> > If anyone has any advice, I'd love to hear it.
>>> >
>>> > Thanks in advance,
>>> >
>>> > Adam.
>>> >
>>> >
>>> > --
>>> > Adam Randall
>>> > http://www.xaren.net
>>> > AIM: blitz574
>>> > Twitter: @randalla0622
>>> >
>>> > "To err is human... to really foul up requires the root password."
>>> >
>>> > _______________________________________________
>>> > Ocfs2-users mailing list
>>> > Ocfs2-users at oss.oracle.com
>>> > https://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>
>>
>>
>>
>> --
>> Adam Randall
>> http://www.xaren.net
>> AIM: blitz574
>> Twitter: @randalla0622
>>
>> "To err is human... to really foul up requires the root password."
>>
>
>
>
> --
> Adam Randall
> http://www.xaren.net
> AIM: blitz574
> Twitter: @randalla0622
>
> "To err is human... to really foul up requires the root password."
>

-- 
Adam Randall
http://www.xaren.net
AIM: blitz574
Twitter: @randalla0622

"To err is human... to really foul up requires the root password."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20130717/280783e5/attachment.html