[Ocfs2-users] kernel panics on sles 10 rc3

Fri Jul 14 16:27:28 CDT 2006

I run few long-term tests with OCFSv2.

One was using OCFSv2 for backup area on Oracle RAC cluster. System is
standing UP for 1 - 2 month already,
without any visible problems on it. File traffic is, yes, not too big.

It did not worked well until kernel 255, but is fine after it. The only
problem wich I had was because of improler HugeTLB
configuration in Oracle 10.2.0.2.

So, first experiment is positive.

================
-rw-rw----  1 oracle oinstall 7397888 2006-07-14 03:27
o1_mf_2_1826_2cgwcvqc_.arc
-rw-rw----  1 oracle oinstall 7398400 2006-07-14 06:59
o1_mf_2_1827_2ch8sn5k_.arc
-rw-rw----  1 oracle oinstall 7398400 2006-07-14 10:00
o1_mf_2_1828_2chmfk3x_.arc
-rw-rw----  1 oracle oinstall 7398400 2006-07-14 13:40
o1_mf_2_1829_2cj08zkd_.arc

/S02/lost+found:
total 8
drwxr-xr-x  2 oracle root 4096 2006-04-26 20:45 .
drwxr-xr-x  5 oracle root 4096 2006-04-26 20:45 ..
testrac11:/export/home/alex # uptime
  1:44pm  up 44 days 18:58,  1 user,  load average: 1.06, 1.16, 1.16
==================

Second experiment was with using OCFSv2 for document storage. It is
'heartbeat' cluster with Java-based SLIDE application
which store thousands of small files (system documents in our case).

Cluster had 2 x DELL 1650 servers, SLES9 SP3 + update to the kernel and ocfs
# 257 (kernel number).

First attempt.
- set it all up. OK
- install instance - OK
- began application level indexing etc =- everything works BUT o2net shows
sometimes 50% CPU usage (in reality, 100% on 1 cpu)
when I have heavy file traffic on second node. So far, works faster then
NetAPp NFS.

 Worked by this way for 2 days, then I left for weekend.

Monday:
- application dead.
- node1 is dead but answers on PINGS and o2cb counted it as LIVE (but it was
not).

On node-2, application worked but we had few crashes.

rebooot shows that:
- node1 come OK
- node2 got system (/) disk damaged (zeroed in few areas).

So, experiment ended in big BOOM.

Attempt 2:
- rebuilt it. reinstall node2 from installation server. Install application
back . (I forget - data and /usr/local was shared on 2
different disks).

Running - all fine, no problems.

Stay for the weekend (4 days including 1 vacation and 1 holiday).

After it:
- node1 is dead.
- node2 works but any OCFSv2 operation do not return. OCFSv2 did not
recognize that node1 was dead and did not takeover
(and node1 was not 100% dead, true).
- after all, fsck find few inconsistencies on ocfsv2 file system (such as
wrong # of file counts in directory).

I need to classify - was it OCFS or  or Linux, so I moved data onto NFS. No
any problems since this.

I promised (to OCFS team) to retest on the new version, but did not do it
yet.

IN addition, I am not satisfied with heartbeat approach in ocfsv2. few
problems:
- can not use few interfaces (as al professionl clusters can). Bonding do
not replace multi port redundancy, no way.
- have annowying self-fencing behaviour. If I have not active outstanding Io
on file system, I am in _CLEAN__ state, so what's
the benefit of _panic_ (and dont orget, by default _panic != reboot_ in
SLES - you must reconfigure it).

Anyway, I will try OCFSv2 as doc storage on other project, but first result
was negative.

----- Original Message ----- 
From: "Steve Feehan" <sfeehan at gmail.com>
To: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>
Cc: <ocfs2-users at oss.oracle.com>
Sent: Friday, July 14, 2006 1:26 PM
Subject: Re: [Ocfs2-users] kernel panics on sles 10 rc3

> On 7/14/06, Alexei_Roudnev <Alexei_Roudnev at exigengroup.com> wrote:
> > First of all, make o2cd dependent on iSCSI (so that it starts AFTER it
abnd
> > STOPS before it). I recommend to make sshd start BEFORE both - it allows
you
> > to have emergency access to the system if you did anything wrong.
>
> Hi. Thanks for the reply.
>
> The startup order for services is currently:
>
>   S09sshd
>   S13open-iscsi
>   S14o2cb
>   S15ocfs2
>
> and the shutdown order is:
>
>   K07ocfs2
>   K08o2cb
>   K09open-iscsi
>   K13sshd
>
> So I think the ordering is already as you suggest.
>
> > Second. iSCSI is very reluctant on shutdown.
> > I'd better manually remove iscsi shutdown from K* files at all, so that
it
> > never stops. You are lucky that
> > your system did not froze (when I experimented with LVM2 on iSCSI, I had
> > many such scenarios).
>
> I will try disabling the K*open-iscsi init script, but I don't think
> that is the problem. I can trigger the panic manually by calling
> /etc/init.d/o2cb stop (see the first transcript in my original mail).
>
> > In all other things, such combination work fine for me (except that I
was
> > not able to make OCFSv2 work stable as a document storage on i386
servers).
>
> Are you saying that ocfs2 was not usable for your situation? Can you
> explain why?
>
> Thanks.
>
> -- 
> Steve Feehan
>