[Ocfs-users] Node hangs when trying to create/delete file

Jeremy Schneider jer1887 at asugroup.com
Fri Mar 12 11:10:56 CST 2004


Here's a basic overview of the bug and a workaround for any DBA's or
SysAdmin's reading this list.  I'm sure that there will be an official
fix soon, this is just an FYI if you run into the problem I had in the
meantime.  As soon as you install the updated ocfs-*.rpm the problem
will go away.  (You won't even need to fsck or anything... aren't they
such nice guys?)

SYMPTOM:
  When you try to create or delete a file in a directory with more than
254 files, the process hangs indefinitely.  When you try to kill the
process (via CTRL-C or /bin/kill) it seems to hang in a 'D' Disk Wait
state.

SHORT-TERM WORKAROUND:
  You need to know what directory you were trying to create the file
in.  One of the other nodes has that directory locked.  It's real easy
in a 2-node cluster; just go to that directory on the other node and
delete any file.  This will release the lock on that directory.  You
might need to create a file first so you can delete it.  :)

LONG-TERM WORKAROUND:
  If this happens once, it will likely happen again.  You can fix the
directory permanently so that the bug won't happen anymore but this
requires downtime if there are any database files in that directory. 
The fix is to basically move all the files to a new directory then
delete the old and rename the new.  Step 2 sounds kinda weird, but it's
actually the crucial step that will prevent the bug.  Step 2 changes
"file_lock" (in step 3) from OCFS_DLM_ENABLE_CACHE_LOCK to
OCFS_DLM_NO_LOCK.

1. create a new directory.
2. create a file in the new directory and /bin/cat the file from a
different node than the one where you created the directory.  delete the
file.
3. debugocfs -D /relative/path/to/newdir/from/mountpoint/ /dev/device
-- confirm that "file_lock = OCFS_DLM_NO_LOCK"
4. /bin/mv all the datafiles to the new directory.
5. /bin/rmdir the old directory
6. rename the new directory to the same name as the old.

Happy hacking, everyone...   ;)

/js


Jeremy Schneider
Lansing, MI

>>> Sunil Mushran <Sunil.Mushran at oracle.com> 03/11/2004 9:57:03 PM >>>
Wow... I am impressed.

I still need to test it... but it looks good otherwise.

BTW, it's mainly the oracle developers who are
responding on this list. :-)

 
<<<<...>>>>


More information about the Ocfs-users mailing list