[Ocfs2-users] Another node is heartbeating in our slot! errors with LUN removal/addition
Sunil Mushran
sunil.mushran at oracle.com
Wed Nov 26 10:52:43 PST 2008
The first step is to find out the current UUIDs on the devices.
$ mounted.ocfs2 -d
Next get a list of all running heartbeat threads.
$ ls -l /sys/kernel/config/cluster/<clustername>/heartbeat/
This will list the heartbeat regions which is the same as the UUID.
What you have to do is remove from the second list all UUIDs you get
from the first list. This process could be made simpler if you umounted
all the ocfs2 volumes on node 2. What we are trying to do is kill off all
hb threads that should have been killed during umount.
Once you have that list, do:
$ ocfs2_hb_ctl -I -u <UUID>
For example:
$ ocfs2_hb_ctl -I -u C43CB881C2C84B09BAC14546BF6DCAD9
This will tell you the number of hb references. It should be 1.
To kill do:
$ ocfs2_hb_ctl -K -u <UUID>
Do it one by one. Ensure the hb thread is killed. (Note: The o2hb thread
name has the start of the region name.)
We still don't know why ocfs2_hb_ctl sigsevs duuring umount. But we
know that that the failure to do so is the cause of your problem.
Sunil
Daniel Keisling wrote:
> All nodes except the node that I run snapshots on have the correct
> number of o2hb threads running. However, node 2, the node that has
> daily snapshots taken has _way_ too many threads:
>
> [root at ausracdb03 ~]# ps aux | grep o2hb | wc -l
> 79
>
> root at ausracdb03 ~]# ps aux | grep o2hb | head -n 10
> root 1166 0.0 0.0 0 0 ? S< Nov20 0:47
> [o2hb-00EFECD3FF]
> root 1216 0.0 0.0 0 0 ? S< Oct25 4:14
> [o2hb-5E0C4AD17C]
> root 1318 0.0 0.0 0 0 ? S< Nov01 3:18
> [o2hb-98697EE8BC]
> root 1784 0.0 0.0 0 0 ? S< Nov15 1:25
> [o2hb-A7DBDA5C27]
> root 2293 0.0 0.0 0 0 ? S< Nov18 1:05
> [o2hb-FBA96061AD]
> root 2410 0.0 0.0 0 0 ? S< Oct23 4:49
> [o2hb-289FD53333]
> root 2977 0.0 0.0 0 0 ? S< Nov21 0:00
> [o2hb-58CB9EA8F0]
> root 3038 0.0 0.0 0 0 ? S< Nov21 0:00
> [o2hb-D33787D93D]
> root 3150 0.0 0.0 0 0 ? S< Oct25 4:38
> [o2hb-3CB2E03215]
> root 3302 0.0 0.0 0 0 ? S< Nov09 2:22
> [o2hb-F78E8BF89E]
>
>
> What's the best way to proceed?
>
> Is this being caused by unpresenting/presenting snapshotted LUNs back to
> the system? Those steps include:
> - unmount the snapshot dir
> - unmap the snapshot lun
> - take a SAN-based snapshot
> - present snapshot lun (same SCSI ID/WWNN) back to server
> - force a uuid reset with tunefs.ocfs2 on the snapshot filesystem
> - change the label with tunefs.ocfs2 on the snapshot filesystem
> - fsck the snapshot filesystem
> - mount the snapshot filesystem
>
> I am using tunefs.ocfs2 v1.2.7 because the --force-uuid-reset is not in
> the v1.4.1 release.
>
> My two node development cluster, which is exactly the same as above, is
> exhibiting the same behavior. My single node cluster, which is exactly
> the same as above, is NOT exhibiting the same behavior.
>
> Another single-node Oracle RAC cluster that is nearly the same (using
> Qlogic HBA drivers for SCSI devices instead of device-mapper) does not
> exhibit the o2hb thread issue.
>
> Daniel
More information about the Ocfs2-users
mailing list