[Ocfs2-users] Two-node cluster often hanging in o2hb/jdb2

Sunil Mushran sunil.mushran at oracle.com
Tue Dec 7 09:07:34 PST 2010


Check the kernel stack of the D state processes.

cat /proc/PID/stack

The kernel stack will tell us where it is waiting. My guess is that
the io stack is slow. Slow ios appear as temporary hangs to the
users.

On 12/07/2010 07:45 AM, Jan Wielemaker wrote:
> Hi,
>
> I'm pretty new to ocfs2 and a bit stuck.  I have two Debian/Squeeze
> (testing) machines accessing an ocfs2 filesystem over aoe.  The
> filesystem sits on an lvm2 volume, but I guess that is irrelevant.
>
> Even when mostly idle, everything accessing the cluster sometimes hangs
> for about 20 seconds.  This happens rather frequently, say every 5
> minutes, but the interval seems irregular while the time that it hangs
> is quite similar.  This behavior seems pretty much independent from
> the (IO) load of the nodes (as long as not really high).
>
> I tried a ps, grepping for D repeated every second on both nodes.
> When hanging, both show this:
>
>   1649 D<    o2hb-02BC250CDB ?
>   3507 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   3511 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   3515 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   3519 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   3523 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   3527 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   3531 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   1670 D    jbd2/dm-4-18    ?
>   3535 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   1670 D    jbd2/dm-4-18    ?
>   3539 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   1670 D    jbd2/dm-4-18    ?
>   3543 R+   ps              -
>
> ocfs2-tools is at version 1.4.4-3.  Kernel is version 2.6.32-5-amd64.
> The kernel log of the mount at boot is here:
>
> [   18.911452] aoe: AoE v47 initialised.
> [   19.686358] fuse init (API version 7.13)
> [   29.000017] eth2: no IPv6 routers present
> [   36.212109] aoe: 003048f28d36 e1.1 vace0 has 9767625597 sectors
> [   36.212218]  etherd/e1.1: unknown partition table
> [   59.715506] OCFS2 Node Manager 1.5.0
> [   59.732002] OCFS2 DLM 1.5.0
> [   59.733343] ocfs2: Registered cluster interface o2cb
> [   59.749185] OCFS2 DLMFS 1.5.0
> [   59.749304] OCFS2 User DLM kernel interface loaded
> [   65.347517] o2net: accepted connection from node eculture (num 1) at
> 130.37.193.11:7777
> [   67.884256] OCFS2 1.5.0
> [   67.886984] ocfs2_dlm: Nodes in domain
> ("02BC250CDB0A4B468F845C68BE99B90E"): 0 1
> [   67.890075] ocfs2: Mounting device (254,4) on (node 0, slot 0) with
> ordered data mode.
>
> Installation and formatting are totally standard.
>
> I've been spending quite a bit of time getting a clue on what might be
> wrong, but sofar I failed.  Today I played a fair bit with the debugfs,
> but I'm do not have enough experience to see what is odd.  Dumping all
> the locks showed just over 100,000 of them, which I though might be a
> lot, but posts suggest it isn't.  No busy or very few (-B) locks.
>
> Checked cabling and low-level network activity.  Seems ok.
>
> Does anyone has similar experiences and/or an idea where to look?
>
> 	Thanks --- Jan
>
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users




More information about the Ocfs2-users mailing list