[Ocfs2-users] Two-node cluster often hanging in o2hb/jdb2
Jan Wielemaker
J.Wielemaker at cs.vu.nl
Tue Dec 7 12:34:28 PST 2010
Dear Sunil,
On Tue, 2010-12-07 at 09:07 -0800, Sunil Mushran wrote:
> Check the kernel stack of the D state processes.
>
> cat /proc/PID/stack
>
> The kernel stack will tell us where it is waiting. My guess is that
> the io stack is slow. Slow ios appear as temporary hangs to the
> users.
Thanks. First need to sort out a packaging issue with the latest
Debian testing version that causes the System.map to be out-of-sync
with the running kernel, so there are no symbols available :-(
In doubt this is the problem. Both systems use one gigabit link
to the NAS device for aoe and are connected with another gigabit
link to a switch that links them to the outside world. There are
no errors/collisions in /proc on the network devices and in general
network performance between the systems is excellent.
Let us wait for stack dumps and see whether that shines new light.
Regards --- Jan
>
> On 12/07/2010 07:45 AM, Jan Wielemaker wrote:
> > Hi,
> >
> > I'm pretty new to ocfs2 and a bit stuck. I have two Debian/Squeeze
> > (testing) machines accessing an ocfs2 filesystem over aoe. The
> > filesystem sits on an lvm2 volume, but I guess that is irrelevant.
> >
> > Even when mostly idle, everything accessing the cluster sometimes hangs
> > for about 20 seconds. This happens rather frequently, say every 5
> > minutes, but the interval seems irregular while the time that it hangs
> > is quite similar. This behavior seems pretty much independent from
> > the (IO) load of the nodes (as long as not really high).
> >
> > I tried a ps, grepping for D repeated every second on both nodes.
> > When hanging, both show this:
> >
> > 1649 D< o2hb-02BC250CDB ?
> > 3507 R+ ps -
> > 1649 D< o2hb-02BC250CDB ?
> > 3511 R+ ps -
> > 1649 D< o2hb-02BC250CDB ?
> > 3515 R+ ps -
> > 1649 D< o2hb-02BC250CDB ?
> > 3519 R+ ps -
> > 1649 D< o2hb-02BC250CDB ?
> > 3523 R+ ps -
> > 1649 D< o2hb-02BC250CDB ?
> > 3527 R+ ps -
> > 1649 D< o2hb-02BC250CDB ?
> > 3531 R+ ps -
> > 1649 D< o2hb-02BC250CDB ?
> > 1670 D jbd2/dm-4-18 ?
> > 3535 R+ ps -
> > 1649 D< o2hb-02BC250CDB ?
> > 1670 D jbd2/dm-4-18 ?
> > 3539 R+ ps -
> > 1649 D< o2hb-02BC250CDB ?
> > 1670 D jbd2/dm-4-18 ?
> > 3543 R+ ps -
> >
> > ocfs2-tools is at version 1.4.4-3. Kernel is version 2.6.32-5-amd64.
> > The kernel log of the mount at boot is here:
> >
> > [ 18.911452] aoe: AoE v47 initialised.
> > [ 19.686358] fuse init (API version 7.13)
> > [ 29.000017] eth2: no IPv6 routers present
> > [ 36.212109] aoe: 003048f28d36 e1.1 vace0 has 9767625597 sectors
> > [ 36.212218] etherd/e1.1: unknown partition table
> > [ 59.715506] OCFS2 Node Manager 1.5.0
> > [ 59.732002] OCFS2 DLM 1.5.0
> > [ 59.733343] ocfs2: Registered cluster interface o2cb
> > [ 59.749185] OCFS2 DLMFS 1.5.0
> > [ 59.749304] OCFS2 User DLM kernel interface loaded
> > [ 65.347517] o2net: accepted connection from node eculture (num 1) at
> > 130.37.193.11:7777
> > [ 67.884256] OCFS2 1.5.0
> > [ 67.886984] ocfs2_dlm: Nodes in domain
> > ("02BC250CDB0A4B468F845C68BE99B90E"): 0 1
> > [ 67.890075] ocfs2: Mounting device (254,4) on (node 0, slot 0) with
> > ordered data mode.
> >
> > Installation and formatting are totally standard.
> >
> > I've been spending quite a bit of time getting a clue on what might be
> > wrong, but sofar I failed. Today I played a fair bit with the debugfs,
> > but I'm do not have enough experience to see what is odd. Dumping all
> > the locks showed just over 100,000 of them, which I though might be a
> > lot, but posts suggest it isn't. No busy or very few (-B) locks.
> >
> > Checked cabling and low-level network activity. Seems ok.
> >
> > Does anyone has similar experiences and/or an idea where to look?
> >
> > Thanks --- Jan
> >
> >
> >
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
More information about the Ocfs2-users
mailing list