[Ocfs2-users] problem with 2 host cluster

Alexei_Roudnev Alexei_Roudnev at exigengroup.com
Tue Sep 19 01:15:16 PDT 2006


Anyway, if you see a pattern, try to understand, what happen on the
weekends. Looks as your network get saturated and
don't responde in time, OR server get saturated and don't respond.

If it is critical system, you can try to add network card and interconnect
servers directly. Run all-time ping and
try to determinbe, what happen.

As I can notice to now, OCFSv2 dont fail by itself without external reasons.
Reasons can be harmless (such as a short network outage, 100% safe for any
other applications) or more serious (I had systtem glich., when system
looped on the locks during HugeTLB usage).

Look into the crontab (anbd cron logs), see which process starts. It can be
disk scan (IDS system for example), statistics, something which opens some
slow hardware, which loops and don't allow network traffic (or disk traffic)
to come thru. Anyway, there is SOMETHING.

(Bad thing is that if there is something, then it is not any garantee that
it can not happen in day time. But let;'s search for the core reason first
of all.)

About fencing. I am 100% sure, that it is wrong design. Yes, system MUST
fence itself if it see a problem _AND_ (!) if it own some resources. But if
system have not locked resources, FS is in idle state, all buffers are
syncronized, then fencing can be done by _remounting_ without server's
reboot. And (as many noticed) even when fencing is required, timeouts must
be much logner by default. May be (very likely) you have serious reason for
system to lost communication, but even in this case, you must be able to
prevent system reboot except when 100% necessary for consistancy (and it is
not case at night time, when systems are idle and OCFS is doing nothing).



----- Original Message ----- 
From: "Andy Phillips" <Andrew.Phillips at betfair.com>
To: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>
Cc: "Andrew Brunton" <andrew.brunton at ukfuels.co.uk>;
<ocfs2-users at oss.oracle.com>
Sent: Monday, September 18, 2006 10:03 AM
Subject: Re: [Ocfs2-users] problem with 2 host cluster


Alexei,

We dont do weekly backups. I've got tcpdumps showing whats happening on
the network and I see ocfs2 packets. OCFS2 has a bonded pair of dual
gigabit interfaces between each node onto a pair of non blocking
switches.

I'd rather not just live with it self fencing by design. If DEC can get
this right in 1977 for vaxclusters I'm sure oracle can get it right now.

Cheers,
Andy

On Mon, 2006-09-18 at 09:56 -0700, Alexei_Roudnev wrote:
> May be, weekly backup slow down your network and cause packet loss, so
> making OCFS to self-fence? (Who designed this? If OCFS have not
outstanding
> IO requests, it have not ANY reason to self fence, even if heartbeat
> connection lost, until IO request comes from operation system and it did
not
> recover heartbeat. And, of course, cluster MUST have more than 1 heartbeat
> channel, if it is not kid's toy cluster. We use 3 channels (eth0, eth1,
> Serial) in Linux heartbeat cluster and 4 channels in Veritas production
> cluster - to compare with 1 in oc2b).
>
> To be honest, I dont experience any problems with OCFSv2 on SLES9 SP3
x86_64
> build > 244 under easy load, but I will never use current OCFSv2 for high
> load servers because of this poor heartbeat alhghoritm (only 1 IP is
> possible) and because of this brainless self-fencing (became funny when
> Oracle/ASM and OCFSv2 decide about master node differently, so rebooting
all
> nodes at once).
>
> In reality, if you have OCFSv2, be 100% sure that it will self fence
> time-to-time. Just _by design_. May be, new integrated 'OCFSv2 +
heartbeat'
> system (inplemented in SLES10, but I can not confirmthat it is stable) wil
> work better.
>
> ----- Original Message ----- 
> From: "Andy Phillips" <Andrew.Phillips at betfair.com>
> To: "Andrew Brunton" <andrew.brunton at ukfuels.co.uk>
> Cc: <ocfs2-users at oss.oracle.com>
> Sent: Monday, September 18, 2006 3:26 AM
> Subject: Re: [Ocfs2-users] problem with 2 host cluster
>
>
> Hi,
>
>    I've got _exactly_ the same problem. I've not had the time to dive
> through the source code and check it. We're on ES4.3 and ocfs-1.2.3.
>
>    For us the problem (same trace as below) was not that repeatable, and
> was possibly related to the i/o pattern.
>
>    What seems to happen is that the underlying "network services" of
> ocfs2 (o2net) believes that no packets are being sent. The tcp socket is
> surrounded by wrapper functions, one of which times when the last packet
> is received. Its this that decides the socket is dead, then closes the
> socket. Meanwhile, the upper layers (which are actually sending data
> regularly) find the carpet yanked out from underneath them, and decide
> to halt the cluster to protect the data.
>
>    Highly annoying. I expect it will be some signed 32bit integer
> wrapping somewhere....
>
>    Andy
>
>
> On Mon, 2006-09-18 at 11:14 +0100, Andrew Brunton wrote:
> > Hi,
> >
> >
> >
> > We have 2 Dell 1850’s in a cluster, both machines are running Redhat
> > Enterprise Linux 4 AS, update 2.
> >
> >
> >
> > The boxes are connected to a Dell EMC CX300 using emulex HBA’s
> >
> >
> >
> > The cluster is running an Oracle 10gR2 std edition RAC.
> >
> >
> >
> > We are using ocfs2 to store files generated by our application and not
> > to store anything to do with the database.
> >
> >
> >
> > We’ve been having a few problems were the servers appear to hang, and
> > have to be shutdown (using the powerbutton) and then started up again.
> > This seems to be happening every weekend and I don’t really understand
> > what’s happening, or how to fix it.
> >
> >
> >
> > I’ve included an extract from messages in the hope someone can shed
> > some light on the matter.
> >
> >
> >
> > Kind regards
> >
> >
> >
> > Andrew
> >
> >
> >
> > Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1310 connection
> > to node argon1.crewe.ukfuels.co.uk (num 0) at 10.1.1.110:7777 has been
> > idle for 10 seconds, shutting it down.
> >
> > Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1321 here are
> > some times that might help debug the situation: (tmr 1158527154.993223
> > now 1158527164.993090 dr 1158527154.993213 adv
> > 1158527154.993227:1158527154.993228 func (101e0528:505)
> > 1158527153.796194:1158527153.796200)
> >
> > Sep 17 22:06:04 argon2 kernel: (3854,0):o2net_set_nn_state:411 no
> > longer connected to node argon1.crewe.ukfuels.co.uk (num 0) at
> > 10.1.1.110:7777
> >
> > Sep 17 22:06:04 argon2 kernel:
> > (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -112
> >
> > Sep 17 22:06:04 argon2 kernel:
> > (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >
> > Sep 17 22:06:05 argon2 last message repeated 185 times
> >
> > Sep 17 22:06:05 argon2 kernel:
> > (26144,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >
> > Sep 17 22:06:05 argon2 last message repeated 154 times
> >
> > Sep 17 22:06:05 argon2 kernel:
> > (25274,2):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >
> > Sep 17 22:06:05 argon2 last message repeated 123 times
> >
> > Sep 17 22:06:05 argon2 kernel:
> > (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >
> > Sep 17 22:06:05 argon2 last message repeated 472 times
> >
> > Sep 17 22:06:05 argon2 kernel:
> > (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >
> > Sep 17 22:06:08 argon2 last message repeated 3239 times
> >
> > Sep 17 22:06:08 argon2 kernel:
> > (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >
> > Sep 17 22:06:08 argon2 last message repeated 118 times
> >
> > Sep 17 22:06:08 argon2 kernel:
> > (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >
> > Sep 18 08:40:32 argon2 syslogd 1.4.1: restart.
> >
> > Sep 18 08:40:32 argon2 syslog: syslogd startup succeeded
> >
> > Sep 18 08:40:32 argon2 kernel: klogd 1.4.1, log source = /proc/kmsg
> > started.
> >
> > Sep 18 08:40:32 argon2 kernel: Bootdata ok (command line is ro
> > root=LABEL=/ apic rhgb quiet)
> >
> > Sep 18 08:40:32 argon2 kernel: Linux version 2.6.9-22.0.1.ELsmp
> > (bhcompile at hs20-bc1-2.build.redhat.com) (gcc version 3.4.4 20050721
> > (Red Hat 3.4.4-2)) #1 SMP
> >
> >
> >
> > Andrew Brunton
> >
> > Senior Application Developer
> >
> > UK Fuels Limited
> >
> >
> >
> > Tel +44 (0)1270 655636
> >
> > Fax +44 (0)1270 655700
> >
> >
> >
> > andrew.brunton at ukfuels.co.uk
> >
> >
> >
> >
> >
> > ________________________________________________________________________
> > In order to protect our email recipients, Betfair use SkyScan from
> > MessageLabs to scan all Incoming and Outgoing mail for viruses.
> >
> > ________________________________________________________________________
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> -- 
> Andy Phillips, FRAS
> Systems Architect, Information Systems.
> Betfair.com
>
> Direct Line: 0208 834 8436
>
> Betfair Limited (Company No.5140986), Winslow Road, Hammersmith
> Embankment, London W6 9HP, United Kingdom, +44 208 834 8000, +44 208 834
> 8501 (direct). The information in this e-mail and any attachment is
> confidential, may contain legal advice protected by privilege and is
> intended only for the named recipient(s). The e-mail may not be
> disclosed or used by any person other than the addressee, nor may it be
> copied in any way. If you are not a named recipient please notify the
> sender immediately and delete any copies of this message. Any
> unauthorized copying, disclosure or distribution of the material in this
> e-mail is strictly forbidden. Any view or opinions presented are solely
> those of the author and do not necessarily represent those of the
> company.
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
>
> ________________________________________________________________________
> In order to protect our email recipients, Betfair use SkyScan from
> MessageLabs to scan all Incoming and Outgoing mail for viruses.
>
> ________________________________________________________________________
-- 
Andy Phillips
Systems Architecture Manager, Betfair.com

Office: 0208 8348436

Betfair Limited | Winslow Road | Hammersmith Embankment | London | W6
9HP(Change address information to reflect company of employment and your
work address)

Company No. 5140986 (Modify company number to correspond with company
name listed above)


The information in this e-mail and any attachment is confidential and is
intended only for the named recipient(s). The e-mail may not be
disclosed or used by any person other than the addressee, nor may it be
copied in any way. If you are not a named recipient please notify the
sender immediately and delete any copies of this message. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden. Any view or opinions presented are solely
those of the author and do not necessarily represent those of the
company.





More information about the Ocfs2-users mailing list