[Ocfs2-users] No space left on device in one node

Tue Jan 26 07:35:36 PST 2010

Hi!

We operate a 2-node cluster running OCFS2 on top of DRBD. It shows about 4.3 GB free space on the OCFS2 filesystem using df on both nodes, but one node can't even write 10 MB:

df (ouput identical on both the nodes)

  $ df -k /cluster
  Filesystem           1K-blocks      Used Available Use% Mounted on
  /dev/drbd0            83883484  80071096   3812388  96% /cluster

  $ df -i /cluster
  Filesystem             Inodes    IUsed   IFree IUse% Mounted on
  /dev/drbd0           20970871 20017778  953093   96% /cluster

dd test on CL1-N1 -- FAILING:

  $ dd if=/dev/zero of=`hostname`.tst bs=1M count=10
  dd: writing `cl1-n1.tst': No space left on device
  1+0 records in
  0+0 records out
  1032192 bytes (1,0 MB) copied, 1,56907 s, 658 kB/s

same dd test on CL1-N2 -- OK:

  $ dd if=/dev/zero of=`hostname`.tst bs=1M count=10
  10+0 records in
  10+0 records out
  10485760 bytes (10 MB) copied, 1,58164 s, 6,6 MB/s

We are running Debian Linux. The problems occurred while running linux kernel 2.6.26 and according to <http://www.mail-archive.com/ocfs2-users@oss.oracle.com/msg03661.html> we hoped that it will be fixed using a newer kernel.

Therefore we upgraded to Linux kernel 2.6.32 (using Debian package linux-image-2.6.32-trunk-amd64_2.6.32-5_amd64.deb from sid), upgraded the userland tools to ocfs2-tools 1.4.3-1 and ran fsck.ocfs -fy (that showed no errors) — but the problem still persists: one node can't write data while the other one has no problems ...

  $ modinfo ocfs2
  filename:       /lib/modules/2.6.32-trunk-amd64/kernel/fs/ocfs2/ocfs2.ko
  license:        GPL
  author:         Oracle
  version:        1.5.0
  description:    OCFS2 1.5.0
  srcversion:     944B0B239B4DEBAF58A7FE1
  depends:        jbd2,ocfs2_stackglue,quota_tree,ocfs2_nodemanager
  vermagic:       2.6.32-trunk-amd64 SMP mod_unload modversions 

(isn't the 1.5.0 version number a little bit strange here??)

"fsck.ocfs2 -f" doesn't show any errors at all.
Neither are any (kernel) messages logged.

I think this is similar to bug #1167 (http://oss.oracle.com/bugzilla/show_bug.cgi?id=1167) so I updated the information there as well and attached the output of the „stat_sysdir.sh“ script running on the failing node.

Do you have any idea what goes wrong here?

Any workarounds?

Anything we can test to help debug this issue?

Thanks
Alex