[Ocfs2-users] re: o2hb_do_disk_heartbeat:963 ERROR: Device "sdb1" another node is heartbeating in our slot!

Fri Mar 16 13:50:36 PDT 2007

Peter Santos wrote:
> "Mar 16 13:38:02 dbo3 kernel: (3712,3):o2hb_do_disk_heartbeat:963 ERROR: Device "sdb1": another node is
> 					      heartbeating in our slot!"	
> Usually there are a number of other errors, but this one was it.
>   
If this was one isolated error message, it just could be that the previous
hb write failed for some reason. As in, the real error may not be as severe
as the message printed.

> Our RAC cluster is made up of 3 nodes (dbo1,dbo2,dbo3) and they use ocfs2 for the ocr /voting file, but
> ASM is where the datafiles are located. This is suse9 kernel 282.
>
>
> A while back one of our SA's was trying to install ocfs2 on a couple of red-hat machines, and didn't properly
> configure ocfs2 to add the nodes. I believe he just copied directories and the /etc/ocfs2/cluster.conf file.
> Anyway, when he turned the machines on today, they were still mis configured and I believe that is the
> cause of the error message "another node is heartbeating in our slot" message? would you agree ?
>   
If it was just one message then unlikely. But see the the config file to 
see whether it is correct or not.

> As I mentioned there are only 3 nodes in our cluster, but the /etc/cluster.conf file shows 6 and so does the
> following:
> 	oracle at dbo1:/etc/ocfs2> ls /config/cluster/ocfs2/node/
> 	dbo1  dbo2  dbo3  dbo4  dbt3  dbt4
>
> So my question, is how do I permanently remove dbt3, dbt4 and dbo4 ? I checked out the ocfs2 guide, but it only
> has information on adding a node to both an online/offline cluster.
>   
Deletion would require a cluster shutdown. But why do you have to have 
to remove it right now?
Why can't you schedule a cluster.conf cleanup during your next cluster 
shutdown window.

> More importantly is how the oracle clusterware behaved.  After this happened, my ASM and RDBMS instances stayed 	
> up. None of the machines rebooted. But the CRS deamon appears to be having issues.
>
> When I run "crsctl check crs" on all 3 nodes, I get the error "Cannot communicate with CRS" on all 3 nodes.
> The cssd log directory has a core file .. yet I can log into all 3 database instances as if nothing happened.
>
> I suspect this is a bug?
>
> The CRSD log files reveal some sort of issue relating to problems writing to the ocr file ..which is on ocfs2. But
> if there really was a problem, wouldn't ocfs2 have rebooted the machine? And when RAC has a problem accessing the ocfs2
> volume, there are usually a large number of io errors in the system log
>   
File a SR with Oracle and let the RAC folks look at the issue. Existence 
of a core file may mean that
some process may need to be restarted.