[Ocfs2-users] Fencing options

Angelo McComis angelo at mccomis.com
Mon Jan 18 04:57:54 PST 2010


Some updates from the problem we've been having...

Thanks to Sunil for suggesting netconsole be turned on. We've enabled
netconsole, such that we've set it up on the ocfs2 cluster members,
with them reporting logs to a server on the same subnet that's outside
of the cluster. The logs are there, but nothing related to ocfs2 after
the reboots.  grep for o2hb, o2cb, ocfs2, etc. case insensititve,
nothing...  Googling, I noted a reference to sending
sysctl -w kernel.printk="7 4 1 7"

but Novell's suggestion (syslog entries on the receiver side, and
etc/modprobe.conf.local and etc/sysconfig/kernel on the sending side)
were pretty generic.

What we've done so far:

- Mount options:  added nointr, noatime, datavolume   (removed "defaults")
- Multipath.conf: added it (we were running without a multipath.conf
which means use all dm- defaults)
- O2CB_HEARTBEAT_THRESHOLD: set it to 76 (was running default of 31)
- Turned on netconsole (but it's not telling us anything useful yet)

I know Sunil suggested that we can get to the bottom of the fencing
once and for all with the logging, but the above set of changes were
"best practice" enough to ahead with those even minus the specifics we
might get from what we'd learn from the logs.

Once we pushed the above 4 items to our non-prod cluster, it
stabilized immediately.  However, in another datacenter, we have the
same setup (six node cluster for prod, and a six node nonprod
cluster), and it's not having the same problems at all, running all
the defaults.  Saturday during our maintenance, we pushed these
changes to our prod cluster and have seen no issues since.

I tend to believe Sunil's assertion that this is storage related, and
our storage environment is getting better all the time, but I'd really
like to understand this better before I tag them as the cause.

We have backed out the "good" changes from non prod in hopes we would
start catching log entries from ocfs2/o2hb/o2cb/etc. but so far, we've
seen a couple of fencing operations, but no log entries that are
helpful yet.

So, technically we have some stabilization, but still no
instrumentation around it.

Any ideas what we're missing on netconsole to close the circle? I
believe we can get

Angelo

On Wed, Jan 13, 2010 at 3:46 PM, Sunil Mushran <sunil.mushran at oracle.com> wrote:
> Do you have netconsole output? We have to determine the
> reason for the fencing before we can recommend any changes.
>
> Angelo McComis wrote:
>>
>> Some more about my setup, which started the discussion...
>>
>> Version info, mount options, etc. are herein.
>>
>> If there are recommended changes to this, I'm open to suggestions
>> here. This is mostly an "out of the box" configuration.
>>
>> We are not running Oracle DB, just using this for a shared place for
>> transaction files between application servers doing parallel
>> processing.
>>
>> So - Do we want the mount "datavolume, noatime" added to just _netdev
>> and heartbeat=local?  Will that help or hurt?  Also, do we want to
>> turn up the number of HEARTBEAT_THRESHOLD?
>>
>>
>>
>> BEERGOGGLES1:~# modinfo ocfs2
>> filename:       /lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/ocfs2/ocfs2.ko
>> license:        GPL
>> author:         Oracle
>> version:        1.4.1-1-SLES
>> description:    OCFS2 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build
>> f922955d99ef972235bd0c1fc236c5ddbb368611)
>> srcversion:     986DD1EE4F5ABD8A44FF925
>> depends:        ocfs2_dlm,jbd,ocfs2_nodemanager
>> supported:      yes
>> vermagic:       2.6.16.60-0.42.5-smp SMP gcc-4.1
>>
>> BEERGOGGLES1:~# modinfo ocfs2_dlm
>> filename:
>> /lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko
>> license:        GPL
>> author:         Oracle
>> version:        1.4.1-1-SLES
>> description:    OCFS2 DLM 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008
>> (build f922955d99ef972235bd0c1fc236c5ddbb368611)
>> srcversion:     FDB660B2EB59EF106C6305F
>> depends:        ocfs2_nodemanager
>> supported:      yes
>> vermagic:       2.6.16.60-0.42.5-smp SMP gcc-4.1
>> parm:           dlm_purge_interval_ms:int
>> parm:           dlm_purge_locks_max:int
>>
>> BEERGOGGLES1:~# modinfo jbd
>> filename:       /lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/jbd/jbd.ko
>> license:        GPL
>> srcversion:     DCCDE02902B83F98EF81090
>> depends:
>> supported:      yes
>> vermagic:       2.6.16.60-0.42.5-smp SMP gcc-4.1
>>
>> BEERGOGGLES1:~# modinfo ocfs2_nodemanager
>> filename:
>>
>> /lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/ocfs2/cluster/ocfs2_nodemanager.ko
>> license:        GPL
>> author:         Oracle
>> license:        GPL
>> author:         Oracle
>> version:        1.4.1-1-SLES
>> description:    OCFS2 Node Manager 1.4.1-1-SLES Wed Jul 23 18:33:42
>> UTC 2008 (build f922955d99ef972235bd0c1fc236c5ddbb368611)
>> srcversion:     B87371708A8B5E1828E14CD
>> depends:        configfs
>> supported:      yes
>> vermagic:       2.6.16.60-0.42.5-smp SMP gcc-4.1
>>
>> BEERGOGGLES1:~# /etc/init.d/o2cb status
>> Module "configfs": Loaded
>> Filesystem "configfs": Mounted
>> Module "ocfs2_nodemanager": Loaded
>> Module "ocfs2_dlm": Loaded
>> Module "ocfs2_dlmfs": Loaded
>> Filesystem "ocfs2_dlmfs": Mounted
>> Checking O2CB cluster ocfs2: Online
>> Heartbeat dead threshold = 31
>>  Network idle timeout: 30000
>>  Network keepalive delay: 2000
>>  Network reconnect delay: 2000
>> Checking O2CB heartbeat: Active
>>
>> BEERGOGGLES1:~# mount | grep ocfs2
>> ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
>> /dev/evms/prod_app on /opt/VendorApsp/sharedapp type ocfs2
>> (rw,_netdev,heartbeat=local)
>>
>> BEERGOGGLES1:~# cat /etc/sysconfig/o2cb
>> #
>> # This is a configuration file for automatic startup of the O2CB
>> # driver.  It is generated by running /etc/init.d/o2cb configure.
>> # On Debian based systems the preferred method is running
>> # 'dpkg-reconfigure ocfs2-tools'.
>> #
>>
>> # O2CB_ENABLED: 'true' means to load the driver on boot.
>> O2CB_ENABLED=true
>>
>> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
>> O2CB_BOOTCLUSTER=ocfs2
>>
>> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
>> O2CB_HEARTBEAT_THRESHOLD=
>>
>> # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is
>> considered dead.
>> O2CB_IDLE_TIMEOUT_MS=
>>
>> # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is
>> sent
>> O2CB_KEEPALIVE_DELAY_MS=
>>
>> # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
>> O2CB_RECONNECT_DELAY_MS=
>>
>> # O2CB_HEARTBEAT_MODE: Whether to use the native "kernel" or the "user"
>> # driven heartbeat (for example, for integration with heartbeat 2.0.x)
>> O2CB_HEARTBEAT_MODE="kernel"
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>
>



More information about the Ocfs2-users mailing list