[Ocfs2-users] Two-node cluster often hanging in o2hb/jdb2

Jan Wielemaker J.Wielemaker at cs.vu.nl
Tue Dec 7 07:45:23 PST 2010


Hi,

I'm pretty new to ocfs2 and a bit stuck.  I have two Debian/Squeeze
(testing) machines accessing an ocfs2 filesystem over aoe.  The
filesystem sits on an lvm2 volume, but I guess that is irrelevant.

Even when mostly idle, everything accessing the cluster sometimes hangs
for about 20 seconds.  This happens rather frequently, say every 5
minutes, but the interval seems irregular while the time that it hangs
is quite similar.  This behavior seems pretty much independent from
the (IO) load of the nodes (as long as not really high).

I tried a ps, grepping for D repeated every second on both nodes.  
When hanging, both show this:

 1649 D<   o2hb-02BC250CDB ?
 3507 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 3511 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 3515 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 3519 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 3523 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 3527 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 3531 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 1670 D    jbd2/dm-4-18    ?
 3535 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 1670 D    jbd2/dm-4-18    ?
 3539 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 1670 D    jbd2/dm-4-18    ?
 3543 R+   ps              -

ocfs2-tools is at version 1.4.4-3.  Kernel is version 2.6.32-5-amd64.
The kernel log of the mount at boot is here:

[   18.911452] aoe: AoE v47 initialised.
[   19.686358] fuse init (API version 7.13)
[   29.000017] eth2: no IPv6 routers present
[   36.212109] aoe: 003048f28d36 e1.1 vace0 has 9767625597 sectors
[   36.212218]  etherd/e1.1: unknown partition table
[   59.715506] OCFS2 Node Manager 1.5.0
[   59.732002] OCFS2 DLM 1.5.0
[   59.733343] ocfs2: Registered cluster interface o2cb
[   59.749185] OCFS2 DLMFS 1.5.0
[   59.749304] OCFS2 User DLM kernel interface loaded
[   65.347517] o2net: accepted connection from node eculture (num 1) at
130.37.193.11:7777
[   67.884256] OCFS2 1.5.0
[   67.886984] ocfs2_dlm: Nodes in domain
("02BC250CDB0A4B468F845C68BE99B90E"): 0 1 
[   67.890075] ocfs2: Mounting device (254,4) on (node 0, slot 0) with
ordered data mode.

Installation and formatting are totally standard.

I've been spending quite a bit of time getting a clue on what might be
wrong, but sofar I failed.  Today I played a fair bit with the debugfs,
but I'm do not have enough experience to see what is odd.  Dumping all
the locks showed just over 100,000 of them, which I though might be a
lot, but posts suggest it isn't.  No busy or very few (-B) locks.

Checked cabling and low-level network activity.  Seems ok.

Does anyone has similar experiences and/or an idea where to look?

	Thanks --- Jan






More information about the Ocfs2-users mailing list