[Ocfs2-tools-users] heartbeat issue with ocfs2 on debian

Tue Nov 17 02:29:49 PST 2009

On Mon, Nov 16, 2009 at 11:35:10AM -0800, Joel Becker scribbled in
"Re: [Ocfs2-tools-users] heartbeat issue with ocfs2 on debian":
> On Mon, Nov 16, 2009 at 12:37:52PM +0200, Dameon Wagner wrote:
> > I'm not sure how many people are still on this list, as the
> > archive doesn't show there being much activity.  I was going to
> > lurk for a little while, but there've been no messages since I
> > joined, so here goes.
>
> Most folks use ocfs2-users for this sort of question, which is why
> you haven't seen much activity.  But we're happy to help anywhere
> :-)

Ahh, that could explain it ;-)  I'll probably move subscriptions in a
while, and follow what's going on over there.

> > My setup is pretty simple, using only one physical box running
> > debian lenny.  That box has a xen virtual machine that I'd like to
> > share a block device with, also running debian lenny.
> >
> > The physical box is publishing a LVM2 logical volume using AOE,
> > and both systems are mounting the ocfs2 formatted partition on
> > /mnt/ocfs.
>
> I'm not sure I understand your setup.  You have a Xen dom0 on a
> physical box.  Then you have a single domU guest.  You have a LVM2
> volume on the dom0.  That volume is exported via AOE.  You are
> mounting the volume on the dom0 and the domU so that they can share
> the filesystem.  Is that correct?

Yup, exactly right.

> Big question: is the dom0 mounting the LVM2 volume via AOE or via
> direct block device access?

I am/was mounting the LVM2 volume directly on the dom0, and via AOE on
the domU.  I had originally wanted to mount them both as AOE as that
is probably how I will move it into production (simple storage box
that won't consume any of the volumes it publishes via AOE, and
"remote" boxes that actually mount the volumes).  However, the
aoe-tools/vblade setup I have didn't seem to make the AOE volume
available to the dom0, so I figured I'd just mount the LV.  I honestly
just thought that as a block device it wouldn't make a difference.

I've been playing around a little this morning, trying to get both
dom0 and domU to mount via AOE, and it seems that I can only get dom0
to see the aoe device if I use `vblade .. .. lo <vol>` rather than
`vblade .. .. eth0 <vol>`, which is annoying, but a matter for another
mailing list I think...

Anyway, long-story-short, using vblade twice on the same LV, once for
lo, and once for eth0, seems to have solved the issue, and given me
_exactly_ what I was after -- quick test edits on one host show up
(effectively) immediately on the other host, and no errors appearing
in either hosts logfiles.  In other words "thanks Joel!"

All I have to do now is work out a neater way of publishing the aoe
device to both hosts, without having two instances of vblade running.

> > All seems to work OK, and running simple commands like `ls` and
> > `cat` work nicely, but I first noticed something was off when any
> > edits to a file didn't propagate from one node to the other.
> > Looking in syslog of the physical host I see kernel entries
> > mentioning:
> >
> > (27275,0):o2hb_do_disk_heartbeat:762 ERROR: Device "dm-12": another node is heartbeating in our slot
> >
> > usually near or followed by:
> >
> > (26956,0):o2net_connect_expired:1629 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors.
> >
> > o2cb status on both nodes shows that heartbeat is active -- am I
> > missing a configuration option somewhere that will give each node
> > it's own slot?  The list archive doens't seem to mention this, and
> > with the terms I've tried searching on Google doesn't seem to dig
> > anything up either.
>
> Slots are autoconfigured, so you're not missing anything there.
> You're missing something else that we need to track down.  You are
> having trouble with network connectivity between the dom0 and the
> domU.  Is your /etc/ocfs2/cluster.conf correct between them?  But
> the bigger problem is the 'another node is in our slot' error.  This
> signifies inconsistency in how the disk is seen.  This is why I ask
> about AOE vs direct access.  We need to make sure that changes to
> the disk show up immediately to both parties.  Then they should see
> each other on the disk and choose slots correctly.

My cluster.conf files are copy-pasted so, if I understand correctly,
should be compatible.  Besides, with both boxes connecting to the
block device via AOE all seems to be sorted.  I honestly didn't think
that that would make a difference, but it seems it does.

> > Any pointers?  Or am I just trying to do the wrong thing (I did
> > see somewhere that ocfs was more for oracle DB usage, rather than
> > as a general purpose filesystem)?
>
> ocfs2 is a general purpose filesystem.  ocfs (without the '2') was
> not, but that sucker is back in the world of linux 2.4.

Cool, good to know.  Any idea when ACLs will be working? (Probably
answered in the ocfs2-users archive, which I'm trawling through at the
moment).

Thanks again.

Dameon

-- 
><> ><> ><> ><> ><> ><> ooOoo <>< <>< <>< <>< <>< <><
Dr. Dameon Wagner,
Senior ICT Specialist,
Depts. of Computer Science & Information Systems,
Rhodes University, Grahamstown, South Africa.
><> ><> ><> ><> ><> ><> ooOoo <>< <>< <>< <>< <>< <><