[Ocfs2-users] Another node is heartbeating in our slot! errors with LUN removal/addition

Wed Nov 26 10:58:39 PST 2008

If you can umount all ocfs2 vols on node 2, then you can skip
mounted.ocfs2 step. Just list all active heartbeat regions and
kill them one-by-one.

Sunil Mushran wrote:
> The first step is to find out the current UUIDs on the devices.
> $ mounted.ocfs2 -d
>
> Next get a list of all running heartbeat threads.
> $ ls -l /sys/kernel/config/cluster/<clustername>/heartbeat/
> This will list the heartbeat regions which is the same as the UUID.
>
> What you have to do is remove from the second list all UUIDs you get
> from the first list. This process could be made simpler if you umounted
> all the ocfs2 volumes on node 2. What we are trying to do is kill off all
> hb threads that should have been killed during umount.
>
> Once you have that list, do:
> $ ocfs2_hb_ctl -I -u <UUID>
> For example:
> $ ocfs2_hb_ctl -I -u C43CB881C2C84B09BAC14546BF6DCAD9
>
> This will tell you the number of hb references. It should be 1.
> To kill do:
> $ ocfs2_hb_ctl -K -u <UUID>
>
> Do it one by one. Ensure the hb thread is killed. (Note: The o2hb thread
> name has the start of the region name.)
>
> We still don't know why ocfs2_hb_ctl sigsevs duuring umount. But we
> know that that the failure to do so is the cause of your problem.
>
> Sunil
>
> Daniel Keisling wrote:
>   
>> All nodes except the node that I run snapshots on have the correct
>> number of o2hb threads running.  However, node 2, the node that has
>> daily snapshots taken has _way_ too many threads:
>>
>> [root at ausracdb03 ~]# ps aux | grep o2hb | wc -l
>> 79
>>
>> root at ausracdb03 ~]# ps aux | grep o2hb | head -n 10
>> root      1166  0.0  0.0      0     0 ?        S<   Nov20   0:47
>> [o2hb-00EFECD3FF]
>> root      1216  0.0  0.0      0     0 ?        S<   Oct25   4:14
>> [o2hb-5E0C4AD17C]
>> root      1318  0.0  0.0      0     0 ?        S<   Nov01   3:18
>> [o2hb-98697EE8BC]
>> root      1784  0.0  0.0      0     0 ?        S<   Nov15   1:25
>> [o2hb-A7DBDA5C27]
>> root      2293  0.0  0.0      0     0 ?        S<   Nov18   1:05
>> [o2hb-FBA96061AD]
>> root      2410  0.0  0.0      0     0 ?        S<   Oct23   4:49
>> [o2hb-289FD53333]
>> root      2977  0.0  0.0      0     0 ?        S<   Nov21   0:00
>> [o2hb-58CB9EA8F0]
>> root      3038  0.0  0.0      0     0 ?        S<   Nov21   0:00
>> [o2hb-D33787D93D]
>> root      3150  0.0  0.0      0     0 ?        S<   Oct25   4:38
>> [o2hb-3CB2E03215]
>> root      3302  0.0  0.0      0     0 ?        S<   Nov09   2:22
>> [o2hb-F78E8BF89E]
>>
>>
>> What's the best way to proceed?
>>
>> Is this being caused by unpresenting/presenting snapshotted LUNs back to
>> the system?  Those steps include:
>> - unmount the snapshot dir
>> - unmap the snapshot lun
>> - take a SAN-based snapshot
>> - present snapshot lun (same SCSI ID/WWNN) back to server
>> - force a uuid reset with tunefs.ocfs2 on the snapshot filesystem
>> - change the label with tunefs.ocfs2 on the snapshot filesystem
>> - fsck the snapshot filesystem
>> - mount the snapshot filesystem
>>
>> I am using tunefs.ocfs2 v1.2.7 because the --force-uuid-reset is not in
>> the v1.4.1 release.
>>
>> My two node development cluster, which is exactly the same as above, is
>> exhibiting the same behavior.  My single node cluster, which is exactly
>> the same as above, is NOT exhibiting the same behavior.
>>
>> Another single-node Oracle RAC cluster that is nearly the same (using
>> Qlogic HBA drivers for SCSI devices instead of  device-mapper) does not
>> exhibit the o2hb thread issue.
>>
>> Daniel
>>     
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>