[Ocfs2-users] Another node is heartbeating in our slot! errors with LUN removal/addition

Wed Nov 26 10:52:43 PST 2008

The first step is to find out the current UUIDs on the devices.
$ mounted.ocfs2 -d

Next get a list of all running heartbeat threads.
$ ls -l /sys/kernel/config/cluster/<clustername>/heartbeat/
This will list the heartbeat regions which is the same as the UUID.

What you have to do is remove from the second list all UUIDs you get
from the first list. This process could be made simpler if you umounted
all the ocfs2 volumes on node 2. What we are trying to do is kill off all
hb threads that should have been killed during umount.

Once you have that list, do:
$ ocfs2_hb_ctl -I -u <UUID>
For example:
$ ocfs2_hb_ctl -I -u C43CB881C2C84B09BAC14546BF6DCAD9

This will tell you the number of hb references. It should be 1.
To kill do:
$ ocfs2_hb_ctl -K -u <UUID>

Do it one by one. Ensure the hb thread is killed. (Note: The o2hb thread
name has the start of the region name.)

We still don't know why ocfs2_hb_ctl sigsevs duuring umount. But we
know that that the failure to do so is the cause of your problem.

Sunil

Daniel Keisling wrote:
> All nodes except the node that I run snapshots on have the correct
> number of o2hb threads running.  However, node 2, the node that has
> daily snapshots taken has _way_ too many threads:
>
> [root at ausracdb03 ~]# ps aux | grep o2hb | wc -l
> 79
>
> root at ausracdb03 ~]# ps aux | grep o2hb | head -n 10
> root      1166  0.0  0.0      0     0 ?        S<   Nov20   0:47
> [o2hb-00EFECD3FF]
> root      1216  0.0  0.0      0     0 ?        S<   Oct25   4:14
> [o2hb-5E0C4AD17C]
> root      1318  0.0  0.0      0     0 ?        S<   Nov01   3:18
> [o2hb-98697EE8BC]
> root      1784  0.0  0.0      0     0 ?        S<   Nov15   1:25
> [o2hb-A7DBDA5C27]
> root      2293  0.0  0.0      0     0 ?        S<   Nov18   1:05
> [o2hb-FBA96061AD]
> root      2410  0.0  0.0      0     0 ?        S<   Oct23   4:49
> [o2hb-289FD53333]
> root      2977  0.0  0.0      0     0 ?        S<   Nov21   0:00
> [o2hb-58CB9EA8F0]
> root      3038  0.0  0.0      0     0 ?        S<   Nov21   0:00
> [o2hb-D33787D93D]
> root      3150  0.0  0.0      0     0 ?        S<   Oct25   4:38
> [o2hb-3CB2E03215]
> root      3302  0.0  0.0      0     0 ?        S<   Nov09   2:22
> [o2hb-F78E8BF89E]
>
>
> What's the best way to proceed?
>
> Is this being caused by unpresenting/presenting snapshotted LUNs back to
> the system?  Those steps include:
> - unmount the snapshot dir
> - unmap the snapshot lun
> - take a SAN-based snapshot
> - present snapshot lun (same SCSI ID/WWNN) back to server
> - force a uuid reset with tunefs.ocfs2 on the snapshot filesystem
> - change the label with tunefs.ocfs2 on the snapshot filesystem
> - fsck the snapshot filesystem
> - mount the snapshot filesystem
>
> I am using tunefs.ocfs2 v1.2.7 because the --force-uuid-reset is not in
> the v1.4.1 release.
>
> My two node development cluster, which is exactly the same as above, is
> exhibiting the same behavior.  My single node cluster, which is exactly
> the same as above, is NOT exhibiting the same behavior.
>
> Another single-node Oracle RAC cluster that is nearly the same (using
> Qlogic HBA drivers for SCSI devices instead of  device-mapper) does not
> exhibit the o2hb thread issue.
>
> Daniel