[Ocfs2-users] Getting Closer (was: Fencing options)

Mon Jan 18 11:42:20 PST 2010

One more follow on,

The combination of kernel.panic=60  and kernel.printk=7 4 1 7 seems to
have netted the culrptit:

E01-netconsole.log:Jan 18 09:45:10 E01 (10,0):o2hb_write_timeout:137
ERROR: Heartbeat write timeout to device dm-12 after 60000
milliseconds
E01-netconsole.log:Jan 18 09:45:10 E01
(10,0):o2hb_stop_all_regions:1517 ERROR: stopping heartbeat on all
active regions.
E01-netconsole.log:Jan 18 09:45:10 E01 ocfs2 is very sorry to be
fencing this system by restarting

dm-12 maps to my evms volume...

iostat for dm-12 doesn't indicate that it's overly  taxed.

Can we get some ideas from the info provided?

Thanks,

Angelo

On Mon, Jan 18, 2010 at 7:57 AM, Angelo McComis <angelo at mccomis.com> wrote:
> Some updates from the problem we've been having...
>
> Thanks to Sunil for suggesting netconsole be turned on. We've enabled
> netconsole, such that we've set it up on the ocfs2 cluster members,
> with them reporting logs to a server on the same subnet that's outside
> of the cluster. The logs are there, but nothing related to ocfs2 after
> the reboots.  grep for o2hb, o2cb, ocfs2, etc. case insensititve,
> nothing...  Googling, I noted a reference to sending
> sysctl -w kernel.printk="7 4 1 7"
>
> but Novell's suggestion (syslog entries on the receiver side, and
> etc/modprobe.conf.local and etc/sysconfig/kernel on the sending side)
> were pretty generic.
>
> What we've done so far:
>
> - Mount options:  added nointr, noatime, datavolume   (removed "defaults")
> - Multipath.conf: added it (we were running without a multipath.conf
> which means use all dm- defaults)
> - O2CB_HEARTBEAT_THRESHOLD: set it to 76 (was running default of 31)
> - Turned on netconsole (but it's not telling us anything useful yet)
>
> I know Sunil suggested that we can get to the bottom of the fencing
> once and for all with the logging, but the above set of changes were
> "best practice" enough to ahead with those even minus the specifics we
> might get from what we'd learn from the logs.
>
> Once we pushed the above 4 items to our non-prod cluster, it
> stabilized immediately.  However, in another datacenter, we have the
> same setup (six node cluster for prod, and a six node nonprod
> cluster), and it's not having the same problems at all, running all
> the defaults.  Saturday during our maintenance, we pushed these
> changes to our prod cluster and have seen no issues since.
>
> I tend to believe Sunil's assertion that this is storage related, and
> our storage environment is getting better all the time, but I'd really
> like to understand this better before I tag them as the cause.
>
> We have backed out the "good" changes from non prod in hopes we would
> start catching log entries from ocfs2/o2hb/o2cb/etc. but so far, we've
> seen a couple of fencing operations, but no log entries that are
> helpful yet.
>
> So, technically we have some stabilization, but still no
> instrumentation around it.
>
> Any ideas what we're missing on netconsole to close the circle? I
> believe we can get
>
> Angelo
>
> On Wed, Jan 13, 2010 at 3:46 PM, Sunil Mushran <sunil.mushran at oracle.com> wrote:
>> Do you have netconsole output? We have to determine the
>> reason for the fencing before we can recommend any changes.
>>
>> Angelo McComis wrote:
>>>
>>> Some more about my setup, which started the discussion...
>>>
>>> Version info, mount options, etc. are herein.
>>>
>>> If there are recommended changes to this, I'm open to suggestions
>>> here. This is mostly an "out of the box" configuration.
>>>
>>> We are not running Oracle DB, just using this for a shared place for
>>> transaction files between application servers doing parallel
>>> processing.
>>>
>>> So - Do we want the mount "datavolume, noatime" added to just _netdev
>>> and heartbeat=local?  Will that help or hurt?  Also, do we want to
>>> turn up the number of HEARTBEAT_THRESHOLD?
>>>
>>>
>>>
>>> BEERGOGGLES1:~# modinfo ocfs2
>>> filename:       /lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/ocfs2/ocfs2.ko
>>> license:        GPL
>>> author:         Oracle
>>> version:        1.4.1-1-SLES
>>> description:    OCFS2 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build
>>> f922955d99ef972235bd0c1fc236c5ddbb368611)
>>> srcversion:     986DD1EE4F5ABD8A44FF925
>>> depends:        ocfs2_dlm,jbd,ocfs2_nodemanager
>>> supported:      yes
>>> vermagic:       2.6.16.60-0.42.5-smp SMP gcc-4.1
>>>
>>> BEERGOGGLES1:~# modinfo ocfs2_dlm
>>> filename:
>>> /lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko
>>> license:        GPL
>>> author:         Oracle
>>> version:        1.4.1-1-SLES
>>> description:    OCFS2 DLM 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008
>>> (build f922955d99ef972235bd0c1fc236c5ddbb368611)
>>> srcversion:     FDB660B2EB59EF106C6305F
>>> depends:        ocfs2_nodemanager
>>> supported:      yes
>>> vermagic:       2.6.16.60-0.42.5-smp SMP gcc-4.1
>>> parm:           dlm_purge_interval_ms:int
>>> parm:           dlm_purge_locks_max:int
>>>
>>> BEERGOGGLES1:~# modinfo jbd
>>> filename:       /lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/jbd/jbd.ko
>>> license:        GPL
>>> srcversion:     DCCDE02902B83F98EF81090
>>> depends:
>>> supported:      yes
>>> vermagic:       2.6.16.60-0.42.5-smp SMP gcc-4.1
>>>
>>> BEERGOGGLES1:~# modinfo ocfs2_nodemanager
>>> filename:
>>>
>>> /lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/ocfs2/cluster/ocfs2_nodemanager.ko
>>> license:        GPL
>>> author:         Oracle
>>> license:        GPL
>>> author:         Oracle
>>> version:        1.4.1-1-SLES
>>> description:    OCFS2 Node Manager 1.4.1-1-SLES Wed Jul 23 18:33:42
>>> UTC 2008 (build f922955d99ef972235bd0c1fc236c5ddbb368611)
>>> srcversion:     B87371708A8B5E1828E14CD
>>> depends:        configfs
>>> supported:      yes
>>> vermagic:       2.6.16.60-0.42.5-smp SMP gcc-4.1
>>>
>>> BEERGOGGLES1:~# /etc/init.d/o2cb status
>>> Module "configfs": Loaded
>>> Filesystem "configfs": Mounted
>>> Module "ocfs2_nodemanager": Loaded
>>> Module "ocfs2_dlm": Loaded
>>> Module "ocfs2_dlmfs": Loaded
>>> Filesystem "ocfs2_dlmfs": Mounted
>>> Checking O2CB cluster ocfs2: Online
>>> Heartbeat dead threshold = 31
>>>  Network idle timeout: 30000
>>>  Network keepalive delay: 2000
>>>  Network reconnect delay: 2000
>>> Checking O2CB heartbeat: Active
>>>
>>> BEERGOGGLES1:~# mount | grep ocfs2
>>> ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
>>> /dev/evms/prod_app on /opt/VendorApsp/sharedapp type ocfs2
>>> (rw,_netdev,heartbeat=local)
>>>
>>> BEERGOGGLES1:~# cat /etc/sysconfig/o2cb
>>> #
>>> # This is a configuration file for automatic startup of the O2CB
>>> # driver.  It is generated by running /etc/init.d/o2cb configure.
>>> # On Debian based systems the preferred method is running
>>> # 'dpkg-reconfigure ocfs2-tools'.
>>> #
>>>
>>> # O2CB_ENABLED: 'true' means to load the driver on boot.
>>> O2CB_ENABLED=true
>>>
>>> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
>>> O2CB_BOOTCLUSTER=ocfs2
>>>
>>> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
>>> O2CB_HEARTBEAT_THRESHOLD=
>>>
>>> # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is
>>> considered dead.
>>> O2CB_IDLE_TIMEOUT_MS=
>>>
>>> # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is
>>> sent
>>> O2CB_KEEPALIVE_DELAY_MS=
>>>
>>> # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
>>> O2CB_RECONNECT_DELAY_MS=
>>>
>>> # O2CB_HEARTBEAT_MODE: Whether to use the native "kernel" or the "user"
>>> # driven heartbeat (for example, for integration with heartbeat 2.0.x)
>>> O2CB_HEARTBEAT_MODE="kernel"
>>>
>>> _______________________________________________
>>> Ocfs2-users mailing list
>>> Ocfs2-users at oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>
>>
>>
>