[Ocfs2-users] re: o2hb_do_disk_heartbeat:963 ERROR: Device "sdb1"
another node is heartbeating in our slot!
Peter Santos
psantos at cheetahmail.com
Fri Mar 16 13:56:50 PDT 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I'm not really how these other servers were setup. I believe disk images were used.
Now I seem to have a bigger problem. I restarted one of my nodes to see if I can clear
up this mess and now the restarted node won't mount the ocfs2 partitions... so the RAC cluster
doesn't come up.
I re-started node3
dbo3:~ # /etc/init.d/ocfs2 start
Starting Oracle Cluster File System (OCFS2) ocfs2_hb_ctl: Bad magic number in superblock while reading uuid
mount.ocfs2: Error when attempting to run /sbin/ocfs2_hb_ctl: "Operation not permitted"
ocfs2_hb_ctl: Bad magic number in superblock while reading uuid
mount.ocfs2: Error when attempting to run /sbin/ocfs2_hb_ctl: "Operation not permitted"
Both node1 and node2 have the /ocfs2 cluster partition mounted, but the mounted.ocfs2 -d command
only shows the /backups .. which is also mounted on node1 and node2.
Any ideas on how I can get around this "Bad magic number is superblock" problem?
dbo1 and dbo2
====================================
dbo1:/ocfs2 # mount -t ocfs2
/dev/sdb1 on /ocfs2 type ocfs2 (rw,_netdev,datavolume,nointr,heartbeat=local)
/dev/sdb2 on /backups type ocfs2 (rw,_netdev,datavolume,nointr,heartbeat=local)
dbo1:~ # mounted.ocfs2 -d
Device FS UUID Label
/dev/sdb2 ocfs2 f35379a7-07a7-4e87-b766-5ee42f595fbf /backups
dbo2:/ocfs2/oracrs # mount -t ocfs2
/dev/sdb1 on /ocfs2 type ocfs2 (rw,_netdev,datavolume,nointr,heartbeat=local)
/dev/sdb2 on /backups type ocfs2 (rw,_netdev,datavolume,nointr,heartbeat=local)
dbo2:/ocfs2/oracrs # mounted.ocfs2 -d
Device FS UUID Label
/dev/sdb2 ocfs2 f35379a7-07a7-4e87-b766-5ee42f595fbf /backups
- -peter
Alexei_Roudnev wrote:
> Btw, upgrade kernel to #283; 282 had a serious bug in OCFSv2 (relaying to
> the simultaneous append t the file).
>
> Another story - try to keep CSR and CSS files out of OCFSv2. reason is that
> keeping CRS files on OCFS, you de facto keep
> one cluster (CRS) depending of another (OCFS), which can influence CRS
> decisions in a faulrty situations.
>
> (It's usually simple to create 2 more partitions or LUN's for OCRFile and
> CSSFile - 102MB and 22MB each).
>
> What's about your case - these experiments could really broke heartbeat (did
> you allowed access to the same disks from these new
> experimental servers?)
>
>
> ----- Original Message -----
> From: "Peter Santos" <psantos at cheetahmail.com>
> To: <ocfs2-users at oss.oracle.com>
> Sent: Friday, March 16, 2007 1:04 PM
> Subject: [Ocfs2-users] re: o2hb_do_disk_heartbeat:963 ERROR: Device "sdb1"
> another node is heartbeating in our slot!
>
>
> Folks,
>
> I'm trying to wrap my head around something that happened in our
>> environment.
> Basically, we noticed the error in /var/log/messages with no other errors.
>
> "Mar 16 13:38:02 dbo3 kernel: (3712,3):o2hb_do_disk_heartbeat:963 ERROR:
>> Device "sdb1": another node is
> heartbeating in our slot!"
> Usually there are a number of other errors, but this one was it.
>
> Our RAC cluster is made up of 3 nodes (dbo1,dbo2,dbo3) and they use ocfs2
>> for the ocr /voting file, but
> ASM is where the datafiles are located. This is suse9 kernel 282.
>
>
> A while back one of our SA's was trying to install ocfs2 on a couple of
>> red-hat machines, and didn't properly
> configure ocfs2 to add the nodes. I believe he just copied directories and
>> the /etc/ocfs2/cluster.conf file.
> Anyway, when he turned the machines on today, they were still mis
>> configured and I believe that is the
> cause of the error message "another node is heartbeating in our slot"
>> message? would you agree ?
> As I mentioned there are only 3 nodes in our cluster, but the
>> /etc/cluster.conf file shows 6 and so does the
> following:
> oracle at dbo1:/etc/ocfs2> ls /config/cluster/ocfs2/node/
> dbo1 dbo2 dbo3 dbo4 dbt3 dbt4
>
> So my question, is how do I permanently remove dbt3, dbt4 and dbo4 ? I
>> checked out the ocfs2 guide, but it only
> has information on adding a node to both an online/offline cluster.
>
>
> More importantly is how the oracle clusterware behaved. After this
>> happened, my ASM and RDBMS instances stayed
> up. None of the machines rebooted. But the CRS deamon appears to be having
>> issues.
> When I run "crsctl check crs" on all 3 nodes, I get the error "Cannot
>> communicate with CRS" on all 3 nodes.
> The cssd log directory has a core file .. yet I can log into all 3
>> database instances as if nothing happened.
> I suspect this is a bug?
>
> The CRSD log files reveal some sort of issue relating to problems writing
>> to the ocr file ..which is on ocfs2. But
> if there really was a problem, wouldn't ocfs2 have rebooted the machine?
>> And when RAC has a problem accessing the ocfs2
> volume, there are usually a large number of io errors in the system log
>
>
> Any insight is greatly appreciated.
>
> -peter
>
>
> alertdbo3.log
> =============
> 2007-03-16 13:38:25.471
> [crsd(4994)]CRS-1006:The OCR location /ocfs2/oracrs/ocr.crs is
>> inaccessible. Details in
> /data/app/crs/oracle/product/10.2.0/crs/log/dbo3/crsd/crsd.log.
>
> 2007-03-16 13:38:43.377
> [client(13125)]CRS-1006:The OCR location /ocfs2/oracrs/ocr.crs is
>> inaccessible. Details in
> /data/app/crs/oracle/product/10.2.0/crs/log/dbo3/client/css.log.
>
>
> crsd.log
> =============
> 2007-03-16 13:38:11.708: [ OCRCLI][1407371616]proac_set_value: Response
>> message returned with failure keyname =
> [CRS.CUR.ora!ORACTAH!ORACTAH3!inst.REASON], retcode = 26
> 2007-03-16 13:38:11.710: [ OCRCLI][1417865568]proac_set_value: Response
>> message returned with failure keyname =
> [CRS.CUR.ora!dbo3!LISTENER_DBO3!lsnr.REASON], retcode = 26
> 2007-03-16 13:38:24.159: [ OCRMSG][1407371616]prom_rpc: CLSC recv
>> failure..ret code 7
> 2007-03-16 13:38:24.159: [ OCRMSG][1407371616]prom_rpc: possible OCR
>> retry scenario
> 2007-03-16 13:38:24.159: [ COMMCRS][1417865568]clscsendx: (0xc80100)
>> Physical connection (0xc7fa30) not active
> 2007-03-16 13:38:24.159: [ OCRMSG][1417865568]prom_rpc: CLSC send
>> failure..ret code 11
> 2007-03-16 13:38:24.159: [ OCRMSG][1417865568]prom_rpc: possible OCR
>> retry scenario
> 2007-03-16 13:38:25.036: [ OCRMAS][1182845280]th_master:13: I AM THE NEW
>> OCR MASTER at incar 3. Node Number = 3
> 2007-03-16 13:38:25.046: [ OCRRAW][1182845280]proprioo: for disk 0
>> (/ocfs2/oracrs/ocr.crs), id match (1), my id set
> (1201294405,1028247821) total id sets (1), 1st set
>> (1201294405,1028247821), 2nd set (0,0) my votes (2), total votes (2)
> 2007-03-16 13:38:25.102: [ OCRRAW][1182845280]rrecover:3: recovery
>> required
> 2007-03-16 13:38:25.471: [ OCRRAW][1182845280]rtnode:3: invalid tnode
>> 1085
> 2007-03-16 13:38:25.471: [ OCRRAW][1182845280]propropen:0: could not read
>> tnode addrd=0
> 2007-03-16 13:38:25.471: [ OCRRAW][1182845280]proprseterror: Error in
>> accessing physical storage [26] Marking context
> invalid.
> 2007-03-16 13:38:25.471: [ OCRUTL][1182845280]u_freem: INVALID
>> PROU_BEGIN_MEMTAG for memory [99351708] Begin tag
> [99351170] Expected begin tag [5072426d]
> [ OCRMAS][1182845280]th_calc_av:8.1': Error reading key
>> [SYSTEM.version.node_numbers.node3]
> 2007-03-16 13:38:25.471: [ OCRMAS][1182845280]th_master:9: Shutdown
>> CacheMaster. prev AV [169869824] new calc av
> [169869824] my sv [169869824]2007-03-16 13:38:39.932: [
>> CRSOCR][1438853472]0OCR api procr_open_key failed for key
> CRS.CUR. OCR error code = 3 OCR error msg:
> 2007-03-16 13:38:39.932: [ CRSOCR][1438853472][PANIC]0Failed to open key:
>> CRS.CUR(File: caaocr.cpp, line: 472)
>
> * The cssd directory has a core file, but nothing in the ocssd.log file.
>
>
>
>
>
>
>
>
>
>
>
>>
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFF+wSRoyy5QBCjoT0RArwrAKCZSL8PckEtKv2g7gsHazL9eUWjVgCdHM2H
KjTEYZL/nxXn+UbMDCvETVI=
=eX2O
-----END PGP SIGNATURE-----
More information about the Ocfs2-users
mailing list