[Oracleasm-users] Re: [suse-oracle] re: SELS 10 - Kernel 2.6.16.27.0.9 locks up - Again.

Peter Santos psantos at cheetahmail.com
Thu May 3 06:30:02 PDT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Joel,
	I really don't think this has anything to do with Oracle/ASM.
	Ever since we installed SLES 10, the machine would randomly lockup.
	This was before we ever installed the oracle clusterware/rdbms software.
	
	We updated the kernel to 2.6.16.27.0.9 to see if that would fix the problem,and
	at first it looked like it was fixed, but after working on the box (ie installing
	the different oracle pieces), the box does not stay up more than 12-24 hours.

	To try and understand the lock up, we thought we'd generate some load by running
	a could of "dd" write commands to just some local storage. The /z0 lun is an ext3 filesystem
	on the local storage.  Our SA initially thought this was oracle related because the only thing
	"up" on the box were the oracleasm service (even thought the database/cluster was all down).

	Yesterday, I stopped CRS, un-installed the ocfs2 packages, stopped the ASM service, and running
	a couple of "dd" commands in parallel caused the machine to lock up in about 20 minutes.

	Now, the "dd" commands cause the machine to lockup faster, but I left the machine completely Idle
	yesterday, and this morning the machine locked up again (and I have to walk over to the data center to restart
	it).

	So the machine does lockup without any significant activity.
	I'll spare you all the details for now on the raw device configuration. I'm working with someone at SuSe to
	see if we can figure this out.

	I'll be sure to update this thread later.
- -peter


Joel Becker wrote:
> On Wed, May 02, 2007 at 08:36:36AM -0400, Peter Santos wrote:
> 
> Hello Peter,
> 	Don't worry about Alexei; he sometimes gets worked up about
> things.
> 
>> 	the reason we are using asmlib is because our experience with managing
>> 	raw devices is limited and we don't want to run into additional trouble
>> 	down the road.
> 
> 	Raw devices per se are already deprecated.  They are now merely
> a compatibility layer on top of opening a block device with O_DIRECT.
> That is, if you decided to not use ASMLib, you should be specifying
> disks as "/dev/sdb1" instead of "/dev/raw/raw2".  Oracle will handle it
> just fine.
> 	You probably want to use ASMLib, though.  Even if you are using
> block device names ("/dev/sdb1", etc), you still have to manage all
> device naming and permissions.  This is frustrating enough on one node,
> but a real pain across a cluster.  As you probably know already, ASMLib
> makes this much easier.  In addition, it can make some I/O more
> efficient for Oracle (though it's not a large difference, and the
> management advantages are far more compelling).
> 
>> 	we've tried these tests over and over and it seems that the machine just
>> 	locks up when we run consecutive "dd" commands .. after about an hr the
>> 	machine locks up.  When the oracleasm is down we can't reproduce this, but when
>> 	the service is up, we get the locking problem. The only thing that I'm
>> 	uncertain about is that when the raw service starts up the raw devices
>> 	are bound, but the permissions on those devices were root:root when
>> 	oracleasm started. Only after did I change the permissions.  I'm going to 	
>> 	try this test one more time in this sequence.
>> 		1. bind the raw devices.
>> 		2. set the proper permissions on those devices
>> 		3. start the oracleasm service.
>> 		4. do /etc/init.d/oracleasm/status and listdisks to make sure that
>> 		   everything looks correct.
>> 		5. run a number of "dd" commands to some local storage and see if
>> 		   machine locks up.
>> 		   prompt>  dd if=/dev/zero of=/z0/test/testthere3 bs=4k count=22000000
> 
> 	I'm unclear as to what you are doing exactly here.  I get that
> ASMLib has been configured.  But what are you dd(1)ing to?  Is ASM or
> Oracle running?  Where do the raw devices point?
> 	Looking at your steps, I'd like to know:
> 
> 1) What raw devices are bound to what block devices?
> 2) What permissions are set on the raw and block devices?
> 3) What does the oracleasm configuration look like
>    (/etc/sysconfig/oracleasm)?
> 4) What is the output of "status" and "listdisks", and what ASMLib
>    disk corresponds to what ASMLib disk?  On SLES10, you can run
>        blkid -t TYPE="oracleasm"
>    to see what devices are marked for ASMLib.  Let's compare that to
>    the output of "listdisks".  Note that this blkid(8) command requires
>    a libblkid.so new enough to know about oracleasm, which means SLES9,
>    SLES10, RHEL5, and probably any recent Debian/Fedora/OpenSuSE.
> 5) Where is /z0/test/testthere3?  Is it a filesystem?  What sort of
>    filesystem, mounted from what block device?
> 
> 	Again, I want to know if ASM and Oracle are running.  If they
> are not, the ASMLib driver is literally doing nothing.  It has no effect
> on your system.
> 	Also, if ASM is running, what is your asm_diskstring?  Is ASM
> accessing the disks via the raw names or the ASMLib names?
> 	One of the things I'm trying to understand is where your dd(1)
> interacts with the rest of your environment.  Does it conflict with
> ASMLib or is it just coincidence?  Does it conflict with another block
> device?  What are the raw devices doing?  Etc.  That will help us narrow
> down what's actually happening.
> 
> Joel
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGOePZoyy5QBCjoT0RAkvDAJ45Kn3CWcSyCEaR90ZFuONZe6zaEACcDz0C
zN8eVCQYDdN4vnsLLQLmgsk=
=Qr8o
-----END PGP SIGNATURE-----



More information about the Oracleasm-users mailing list