[Ocfs2-users] re: o2hb_do_disk_heartbeat:963 ERROR: Device "sdb1" another node is heartbeating in our slot!

Fri Mar 16 13:56:50 PDT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I'm not really how these other servers were setup. I believe disk images were used.

Now I seem to have a bigger problem. I restarted one of my nodes to see if I can clear
up this mess and now the restarted node won't mount the ocfs2 partitions... so the RAC cluster
doesn't come up.

I re-started node3

dbo3:~ # /etc/init.d/ocfs2 start
Starting Oracle Cluster File System (OCFS2) ocfs2_hb_ctl: Bad magic number in superblock while reading uuid
mount.ocfs2: Error when attempting to run /sbin/ocfs2_hb_ctl: "Operation not permitted"
ocfs2_hb_ctl: Bad magic number in superblock while reading uuid
mount.ocfs2: Error when attempting to run /sbin/ocfs2_hb_ctl: "Operation not permitted"

Both node1 and node2 have the /ocfs2 cluster partition mounted, but the mounted.ocfs2 -d command
only shows the /backups .. which is also mounted on node1 and node2.

Any ideas on how I can get around this "Bad magic number is superblock" problem?

dbo1 and dbo2
====================================

dbo1:/ocfs2 # mount -t ocfs2
/dev/sdb1 on /ocfs2 type ocfs2 (rw,_netdev,datavolume,nointr,heartbeat=local)
/dev/sdb2 on /backups type ocfs2 (rw,_netdev,datavolume,nointr,heartbeat=local)

dbo1:~ #  mounted.ocfs2 -d
Device                FS     UUID                                  Label
/dev/sdb2             ocfs2  f35379a7-07a7-4e87-b766-5ee42f595fbf  /backups

dbo2:/ocfs2/oracrs # mount -t ocfs2
/dev/sdb1 on /ocfs2 type ocfs2 (rw,_netdev,datavolume,nointr,heartbeat=local)
/dev/sdb2 on /backups type ocfs2 (rw,_netdev,datavolume,nointr,heartbeat=local)

dbo2:/ocfs2/oracrs # mounted.ocfs2 -d
Device                FS     UUID                                  Label
/dev/sdb2             ocfs2  f35379a7-07a7-4e87-b766-5ee42f595fbf  /backups

- -peter

Alexei_Roudnev wrote:
> Btw, upgrade kernel to #283; 282 had a serious bug in OCFSv2 (relaying to
> the simultaneous append t the file).
> 
> Another story - try to keep CSR and CSS files out of OCFSv2. reason is that
> keeping CRS files on OCFS, you de facto keep
> one cluster (CRS) depending of another (OCFS), which can influence CRS
> decisions in a faulrty situations.
> 
> (It's usually simple to create 2 more partitions or LUN's for OCRFile and
> CSSFile - 102MB and 22MB each).
> 
> What's about your case - these experiments could really broke heartbeat (did
> you allowed access to the same disks from these new
> experimental servers?)
> 
> 
> ----- Original Message ----- 
> From: "Peter Santos" <psantos at cheetahmail.com>
> To: <ocfs2-users at oss.oracle.com>
> Sent: Friday, March 16, 2007 1:04 PM
> Subject: [Ocfs2-users] re: o2hb_do_disk_heartbeat:963 ERROR: Device "sdb1"
> another node is heartbeating in our slot!
> 
> 
> Folks,
> 
> I'm trying to wrap my head around something that happened in our
>> environment.
> Basically, we noticed the error in /var/log/messages with no other errors.
> 
> "Mar 16 13:38:02 dbo3 kernel: (3712,3):o2hb_do_disk_heartbeat:963 ERROR:
>> Device "sdb1": another node is
>       heartbeating in our slot!"
> Usually there are a number of other errors, but this one was it.
> 
> Our RAC cluster is made up of 3 nodes (dbo1,dbo2,dbo3) and they use ocfs2
>> for the ocr /voting file, but
> ASM is where the datafiles are located. This is suse9 kernel 282.
> 
> 
> A while back one of our SA's was trying to install ocfs2 on a couple of
>> red-hat machines, and didn't properly
> configure ocfs2 to add the nodes. I believe he just copied directories and
>> the /etc/ocfs2/cluster.conf file.
> Anyway, when he turned the machines on today, they were still mis
>> configured and I believe that is the
> cause of the error message "another node is heartbeating in our slot"
>> message? would you agree ?
> As I mentioned there are only 3 nodes in our cluster, but the
>> /etc/cluster.conf file shows 6 and so does the
> following:
> oracle at dbo1:/etc/ocfs2> ls /config/cluster/ocfs2/node/
> dbo1  dbo2  dbo3  dbo4  dbt3  dbt4
> 
> So my question, is how do I permanently remove dbt3, dbt4 and dbo4 ? I
>> checked out the ocfs2 guide, but it only
> has information on adding a node to both an online/offline cluster.
> 
> 
> More importantly is how the oracle clusterware behaved.  After this
>> happened, my ASM and RDBMS instances stayed
> up. None of the machines rebooted. But the CRS deamon appears to be having
>> issues.
> When I run "crsctl check crs" on all 3 nodes, I get the error "Cannot
>> communicate with CRS" on all 3 nodes.
> The cssd log directory has a core file .. yet I can log into all 3
>> database instances as if nothing happened.
> I suspect this is a bug?
> 
> The CRSD log files reveal some sort of issue relating to problems writing
>> to the ocr file ..which is on ocfs2. But
> if there really was a problem, wouldn't ocfs2 have rebooted the machine?
>> And when RAC has a problem accessing the ocfs2
> volume, there are usually a large number of io errors in the system log
> 
> 
> Any insight is greatly appreciated.
> 
> -peter
> 
> 
> alertdbo3.log
> =============
> 2007-03-16 13:38:25.471
> [crsd(4994)]CRS-1006:The OCR location /ocfs2/oracrs/ocr.crs is
>> inaccessible. Details in
>      /data/app/crs/oracle/product/10.2.0/crs/log/dbo3/crsd/crsd.log.
> 
> 2007-03-16 13:38:43.377
> [client(13125)]CRS-1006:The OCR location /ocfs2/oracrs/ocr.crs is
>> inaccessible. Details in
>        /data/app/crs/oracle/product/10.2.0/crs/log/dbo3/client/css.log.
> 
> 
> crsd.log
> =============
> 2007-03-16 13:38:11.708: [  OCRCLI][1407371616]proac_set_value: Response
>> message returned with failure keyname =
> [CRS.CUR.ora!ORACTAH!ORACTAH3!inst.REASON], retcode = 26
> 2007-03-16 13:38:11.710: [  OCRCLI][1417865568]proac_set_value: Response
>> message returned with failure keyname =
> [CRS.CUR.ora!dbo3!LISTENER_DBO3!lsnr.REASON], retcode = 26
> 2007-03-16 13:38:24.159: [  OCRMSG][1407371616]prom_rpc: CLSC recv
>> failure..ret code 7
> 2007-03-16 13:38:24.159: [  OCRMSG][1407371616]prom_rpc: possible OCR
>> retry scenario
> 2007-03-16 13:38:24.159: [ COMMCRS][1417865568]clscsendx: (0xc80100)
>> Physical connection (0xc7fa30) not active
> 2007-03-16 13:38:24.159: [  OCRMSG][1417865568]prom_rpc: CLSC send
>> failure..ret code 11
> 2007-03-16 13:38:24.159: [  OCRMSG][1417865568]prom_rpc: possible OCR
>> retry scenario
> 2007-03-16 13:38:25.036: [  OCRMAS][1182845280]th_master:13: I AM THE NEW
>> OCR MASTER at incar 3. Node Number = 3
> 2007-03-16 13:38:25.046: [  OCRRAW][1182845280]proprioo: for disk 0
>> (/ocfs2/oracrs/ocr.crs), id match (1), my id set
> (1201294405,1028247821) total id sets (1), 1st set
>> (1201294405,1028247821), 2nd set (0,0) my votes (2), total votes (2)
> 2007-03-16 13:38:25.102: [  OCRRAW][1182845280]rrecover:3: recovery
>> required
> 2007-03-16 13:38:25.471: [  OCRRAW][1182845280]rtnode:3: invalid tnode
>> 1085
> 2007-03-16 13:38:25.471: [  OCRRAW][1182845280]propropen:0: could not read
>> tnode addrd=0
> 2007-03-16 13:38:25.471: [  OCRRAW][1182845280]proprseterror: Error in
>> accessing physical storage [26] Marking context
> invalid.
> 2007-03-16 13:38:25.471: [  OCRUTL][1182845280]u_freem: INVALID
>> PROU_BEGIN_MEMTAG for memory [99351708] Begin tag
> [99351170] Expected begin tag [5072426d]
> [  OCRMAS][1182845280]th_calc_av:8.1': Error reading key
>> [SYSTEM.version.node_numbers.node3]
> 2007-03-16 13:38:25.471: [  OCRMAS][1182845280]th_master:9: Shutdown
>> CacheMaster. prev AV [169869824] new calc av
> [169869824] my sv [169869824]2007-03-16 13:38:39.932: [
>> CRSOCR][1438853472]0OCR api procr_open_key failed for key
> CRS.CUR. OCR error code = 3 OCR error msg:
> 2007-03-16 13:38:39.932: [  CRSOCR][1438853472][PANIC]0Failed to open key:
>> CRS.CUR(File: caaocr.cpp, line: 472)
> 
> * The cssd directory has a core file, but nothing in the ocssd.log file.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>>
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF+wSRoyy5QBCjoT0RArwrAKCZSL8PckEtKv2g7gsHazL9eUWjVgCdHM2H
KjTEYZL/nxXn+UbMDCvETVI=
=eX2O
-----END PGP SIGNATURE-----