[Ocfs2-users] The ongoing mystery of the ocfs2 memory leak

Sun Apr 8 22:45:10 PDT 2007

John,

This may or may not help: Swap memory was being consumed on my "main" node
every day between 3 and 4 PM. Memory and swap would be completely depleted
after about 2 1/2 days and the node would crash.

(I'm running 2-node OCFS2 cluster w/Oracle RAC on 2.6.9-34.0.2.ELsmp (RHEL
4.0).

Thinking it was heavy OCFS2 file system activity I moved 95% of the file
system activity off the node to ext3 filesystem on a different server.
Problem persisted on the OCFS2 node in the same predicable manner.

We have a web application that connects to the database and uses a
particular config setting to remove abandoned db connections. Once I removed
that setting I stopped getting the predictable afternoon drain and the node
is more stable.

Now I'm on a slower bleed. Swap is still being consumed and apparently never
released, about 100MB/day even on non-business, low-activity days. 

I'm shuffling processes and settings on my two nodes to try to isolate the
problem. I'm no longer convinced it's OCFS2 doing the leaking.

One setting I'm looking at is /proc/sys/vm/swappiness though I've read from
some other folks that it was ineffective at limiting the swapping on RHEL.

If anyone has any hints or suggestions, by all means...

-----Original Message-----
From: ocfs2-users-bounces at oss.oracle.com
[mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of John Lange
Sent: Friday, March 23, 2007 2:23 PM
To: ocfs2-users
Subject: [Ocfs2-users] The ongoing mystery of the ocfs2 memory leak

If you have been watching this list you may have seen my postings about
some kind of memory leak when using ocfs2.

This is a problem that is still not solved and I'm hoping someone one
the list can help us isolate the issue.

The circumstances are very strange; After much analysis and testing what
we have been able to figure out is that there is a 400Meg drop in memory
that happens every day between 6:45am and 7:45am. This memory is never
recovered and after about 3-4 days the node starts killing processes
(oom-killer) until it self-destructs.

Now you are probably thinking (as we were) that this is some kind of
cron that kicks in at that time and causes the problem but that is not
the case. For one thing, daily cron does not run at that time. And
secondly, we logged all processes to a file every 15 minutes and then
compared what was running before the memory loss to what was running
during and after the memory loss and there is nothing new running!

And when we analyze the slabinfo for the same period there is nothing
that is taking a corresponding (400M) jump in size during the same time
period.

So where the heck is our memory going?!?

Does anyone have a clue how we can diagnose this?

Currently we are capturing vmstat, slabinfo, and full process list at 15
minute intervals. Is there anything else we could be logging?

Thanks,

John Lange

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users