[Ocfs2-users] Please urgent help required - OCFS2 and VPN again

Tue Dec 2 02:59:09 PST 2008

Hi all...

I already wrote before on the list about the solution I have at a
customer running DRBD8+OCFS2 on two remote sites connected via VPN.
The different suggestions helped improving the situation, but still
we're having big troubles. We've also upgraded the old server with a new
and much more powerful one but there was nearly no improvement at all!
The situation is resumed as:
SITE A: Dual Core 2GHz Pentium, 1Gb ram, 1 SATA hdd for /, 3 SATA hdd in
software raid5, DRBD on /dev/md0. 
SITE B: Quad Core 2.4HGz Pentium, 2Gb ram, 3 SATA HDD in software raid5,
DRBD on /dev/md1.

The two sites are connected using two ADSL, with TWO bonded VPN.

Both machines run Debian Etch fully updated, kernel 2.6.26-bpo.1-686 SMP
with deadline scheduler, DRBD 8.0.13, OCFS2 1.4.1-1. 
The shared data partition is 187G, 30 of which used.

The recent upgrade to OCFS2 1.4 and kernel 2.6.26 didn't improve the
performances as much as I expected.

The main problems we have are:
1. very high load average: this was previously caused by very high
iowait percentages, but with the new server the load is high while top
says the machine is 99-100% idle! 
2. very slow dir browsing: Sunil pointed me to the user guide, where he
talks about inode stat. How can I raise inode cache memory? I've done
several searches without result... The server actually uses less than
300Mb of ram out of the 1Gb installed...
3. very long umount time: I often (not always) experience an extremely
long umount time. During the period while the process is executing iftop
says there's a high usage of network transfer. I suppose it's
transfering file locks, but is it possible that stays stuck for more
than one hour, and still going?

This is the configuration file of OCFS2. The quad-core is file-server-2.

#/etc/ocfs2/cluster.conf
node:
        ip_port = 7777
        ip_address = 192.168.0.1
        number = 0
        name = file-server-1
        cluster = ocfs2
node:
        ip_port = 7777
        ip_address = 192.168.2.31
        number = 1
        name = file-server-2
        cluster = ocfs2
cluster:
        node_count = 2
        name = ocfs2

What is stunning me is that on file-server-2 we run a rsync backup during the night on a local machine on the network, and it takes less than 20m! Doing the same on the other server throws the load average to the stars!

We're in a critical situation because this solution is deployed since a long time and it's not yet working as expected. 
If nobody has suggestion we have no problem in paying qualified support for solving these problems. In this case please contact me directly. 
Sunil, can I get Oracle support for this?

Thank you.
-- 
Lorenzo Milesi - lorenzo.milesi at yetopen.it

YetOpen S.r.l. - http://www.yetopen.it/
C.so E. Filiberto, 74 23900 Lecco - ITALY -
Tel 0341 220 205 - Fax 178 607 8199

GPG/PGP Key-Id: 0xE704E230 - http://keyserver.linux.it

-------- D.Lgs. 196/2003 --------

Si avverte che tutte le informazioni contenute in questo messaggio sono
riservate ed a uso esclusivo del destinatario. Nel caso in cui questo
messaggio Le fosse pervenuto per errore, La invitiamo ad eliminarlo
senza copiarlo, a non inoltrarlo a terzi e ad avvertirci non appena
possibile.
Grazie.