[Ocfs2-users] Another node is heartbeating in our slot! errors with LUN removal/addition

Daniel Keisling daniel.keisling at austin.ppdi.com
Mon Dec 1 12:23:57 PST 2008


I wrote a script to easily get the heartbeats that should have been
killed.  However, I get a segmentation fault everytime I try and kill
the "dead" heartbeats:

[root at ausracdbd01 tmp]# mounted.ocfs2 -d | grep -i f5f0 | wc -l
0

[root at ausracdbd01 tmp]# ocfs2_hb_ctl -K -u
F5F0522D39FC4EB2824C3E68C0B1D589
Segmentation fault (core dumped)



The process is still active:

[root at ausracdbd01 tmp]# ps -ef | grep -i f5f0
root       620   169  0 Nov29 ?        00:00:30 [o2hb-F5F0522D39]
root     22608 18491  0 14:07 pts/4    00:00:00 grep -i f5f0

Attached is the core.

While I can create and mount snapshot filesystems on my development
node, a dead heartbeat on one of my production nodes is not letting me
mount the snapshot for a newly presented filesystem (thus causing our
backups to fail).  What else can I do?  I really don't want to open an
SR with Oracle...

Thanks,

Daniel
 

> -----Original Message-----
> From: Sunil Mushran [mailto:sunil.mushran at oracle.com] 
> Sent: Wednesday, November 26, 2008 12:59 PM
> To: Daniel Keisling
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] Another node is heartbeating in 
> our slot! errors with LUN removal/addition
> 
> If you can umount all ocfs2 vols on node 2, then you can skip
> mounted.ocfs2 step. Just list all active heartbeat regions and
> kill them one-by-one.
> 
> Sunil Mushran wrote:
> > The first step is to find out the current UUIDs on the devices.
> > $ mounted.ocfs2 -d
> >
> > Next get a list of all running heartbeat threads.
> > $ ls -l /sys/kernel/config/cluster/<clustername>/heartbeat/
> > This will list the heartbeat regions which is the same as the UUID.
> >
> > What you have to do is remove from the second list all UUIDs you get
> > from the first list. This process could be made simpler if 
> you umounted
> > all the ocfs2 volumes on node 2. What we are trying to do 
> is kill off all
> > hb threads that should have been killed during umount.
> >
> > Once you have that list, do:
> > $ ocfs2_hb_ctl -I -u <UUID>
> > For example:
> > $ ocfs2_hb_ctl -I -u C43CB881C2C84B09BAC14546BF6DCAD9
> >
> > This will tell you the number of hb references. It should be 1.
> > To kill do:
> > $ ocfs2_hb_ctl -K -u <UUID>
> >
> > Do it one by one. Ensure the hb thread is killed. (Note: 
> The o2hb thread
> > name has the start of the region name.)
> >
> > We still don't know why ocfs2_hb_ctl sigsevs duuring umount. But we
> > know that that the failure to do so is the cause of your problem.
> >
> > Sunil
> >
> > Daniel Keisling wrote:
> >   
> >> All nodes except the node that I run snapshots on have the correct
> >> number of o2hb threads running.  However, node 2, the node that has
> >> daily snapshots taken has _way_ too many threads:
> >>
> >> [root at ausracdb03 ~]# ps aux | grep o2hb | wc -l
> >> 79
> >>
> >> root at ausracdb03 ~]# ps aux | grep o2hb | head -n 10
> >> root      1166  0.0  0.0      0     0 ?        S<   Nov20   0:47
> >> [o2hb-00EFECD3FF]
> >> root      1216  0.0  0.0      0     0 ?        S<   Oct25   4:14
> >> [o2hb-5E0C4AD17C]
> >> root      1318  0.0  0.0      0     0 ?        S<   Nov01   3:18
> >> [o2hb-98697EE8BC]
> >> root      1784  0.0  0.0      0     0 ?        S<   Nov15   1:25
> >> [o2hb-A7DBDA5C27]
> >> root      2293  0.0  0.0      0     0 ?        S<   Nov18   1:05
> >> [o2hb-FBA96061AD]
> >> root      2410  0.0  0.0      0     0 ?        S<   Oct23   4:49
> >> [o2hb-289FD53333]
> >> root      2977  0.0  0.0      0     0 ?        S<   Nov21   0:00
> >> [o2hb-58CB9EA8F0]
> >> root      3038  0.0  0.0      0     0 ?        S<   Nov21   0:00
> >> [o2hb-D33787D93D]
> >> root      3150  0.0  0.0      0     0 ?        S<   Oct25   4:38
> >> [o2hb-3CB2E03215]
> >> root      3302  0.0  0.0      0     0 ?        S<   Nov09   2:22
> >> [o2hb-F78E8BF89E]
> >>
> >>
> >> What's the best way to proceed?
> >>
> >> Is this being caused by unpresenting/presenting 
> snapshotted LUNs back to
> >> the system?  Those steps include:
> >> - unmount the snapshot dir
> >> - unmap the snapshot lun
> >> - take a SAN-based snapshot
> >> - present snapshot lun (same SCSI ID/WWNN) back to server
> >> - force a uuid reset with tunefs.ocfs2 on the snapshot filesystem
> >> - change the label with tunefs.ocfs2 on the snapshot filesystem
> >> - fsck the snapshot filesystem
> >> - mount the snapshot filesystem
> >>
> >> I am using tunefs.ocfs2 v1.2.7 because the 
> --force-uuid-reset is not in
> >> the v1.4.1 release.
> >>
> >> My two node development cluster, which is exactly the same 
> as above, is
> >> exhibiting the same behavior.  My single node cluster, 
> which is exactly
> >> the same as above, is NOT exhibiting the same behavior.
> >>
> >> Another single-node Oracle RAC cluster that is nearly the 
> same (using
> >> Qlogic HBA drivers for SCSI devices instead of  
> device-mapper) does not
> >> exhibit the o2hb thread issue.
> >>
> >> Daniel
> >>     
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >   
> 
> 
> 

______________________________________________________________________
This email transmission and any documents, files or previous email
messages attached to it may contain information that is confidential or
legally privileged. If you are not the intended recipient or a person
responsible for delivering this transmission to the intended recipient,
you are hereby notified that you must not read this transmission and
that any disclosure, copying, printing, distribution or use of this
transmission is strictly prohibited. If you have received this transmission
in error, please immediately notify the sender by telephone or return email
and delete the original transmission and its attachments without reading
or saving in any manner.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: core.21738
Type: application/octet-stream
Size: 253952 bytes
Desc: core.21738
Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20081201/562fdef6/attachment-0001.obj 


More information about the Ocfs2-users mailing list