[Ocfs2-users] Kernel Panic, Server not coming back up

Mon Apr 5 13:45:59 PDT 2010

It is having problems doing ios to the virtual devices. -5 is EIO.

kevin at utahsysadmin.com wrote:
> I have a relatively new test environment setup that is a little different
> from your typical scenario.  This is my first time using OCFS2, but I
> believe it should work the way I have it setup.
>
> All of this is setup on VMWare virtual hosts.  I have two front-end web
> servers and one backend administrative server.  They all share 2 virtual
> hard drives within VMware (independent, persistent, & thick provisioned).
>
> Everything works great and the way I want, except occasionally, one of the
> nodes will crash with errors like the following:
>
> end_request:  I/O error, dev sdc, sector 585159
> Aborting journal on device sdc1
> end_request:  I/O error, dev sdc, sector 528151
> Buffer I/O error on device sdc1,  logical block 66011
> lost page write due to I/O error on sdc1
> (2848,1):ocfs2_start_trans:240 ERROR: status = -30
> OCFS2: abort (device sdc1): ocfs2_start_trans: Detected aborted journal
> Kernel panic - not syncing: OCFS2:  (device sdc1): panic forced after error
>
>  <0>Rebooting in 30 seconds..BUG: warning at
> arch/i386/kernel/smp.c:492/smp_send_reschedule() (Tainted: G    )
>
> The server never reboots, it just sits there until I reset it.  The cluster
> ran fine without errors (for a week or two) and now that I upgraded to the
> latest kernel/ocfs2 it's happening almost daily.  The disks are fine, it's
> on a LUN on a SAN with no problems and I unmounted all the partitions and
> ran fsck.ocfs2 -f on both drives from all three nodes (one at a time) and
> it found no errors.
>
> This morning it happened again and now after a reset the server will not
> boot up at all, just sits there on "Starting Oracle Cluster File System
> (OCFS2)".  These servers are all running OEL 5.4 with the latest patches
> installed.
>
> Here's the setup:# cat /etc/ocfs2/cluster.conf 
> cluster:
> 	node_count = 3
> 	name = qacluster
>
> node:
> 	ip_port = 7777
> 	ip_address = 10.10.220.30
> 	number = 0
> 	name = qa-admin
> 	cluster = qacluster
>
> node:
> 	ip_port = 7777
> 	ip_address = 10.10.220.31
> 	number = 1
> 	name = qa-web1
> 	cluster = qacluster
>
> node:
> 	ip_port = 7777
> 	ip_address = 10.10.220.32
> 	number = 2
> 	name = qa-web2
> 	cluster = qacluster
>
> # mounted.ocfs2 -d
> Device                FS     UUID                                  Label
> /dev/sdb1             ocfs2  85b050a0-a381-49d8-8353-c21b1c8b28c4  data
> /dev/sdc1             ocfs2  6a03e81a-8186-41a6-8fd8-dc23854e12d3  logs
>
> # uname -a
> Linux qa-admin.domain.com 2.6.18-164.15.1.0.1.el5 #1 SMP Wed Mar 17
> 00:56:05 EDT 2010 i686 i686 i386 GNU/Linux
>
> # rpm -qa | grep ocfs2
> ocfs2-2.6.18-164.11.1.0.1.el5-1.4.4-1.el5
> ocfs2-tools-1.4.3-1.el5
> ocfs2-2.6.18-164.15.1.0.1.el5-1.4.4-1.el5
>
> This is the latest from one of the alive hosts:
>
> # dmesg | tail -50
> (2869,0):ocfs2_lock_allocators:677 ERROR: status = -5
> (2869,0):__ocfs2_extend_allocation:739 ERROR: status = -5
> (2869,0):ocfs2_extend_no_holes:952 ERROR: status = -5
> (2869,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5
> (2869,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5
> (2869,0):ocfs2_write_begin:1860 ERROR: status = -5
> (2869,0):ocfs2_file_buffered_write:2039 ERROR: status = -5
> (2869,0):__ocfs2_file_aio_write:2194 ERROR: status = -5
> OCFS2: ERROR (device sdc1): ocfs2_check_group_descriptor: Group descriptor
> # 1128960 has bit count 32256 but claims that 34300 are free
> (2881,0):ocfs2_search_chain:1244 ERROR: status = -5
> (2881,0):ocfs2_claim_suballoc_bits:1433 ERROR: status = -5
> (2881,0):__ocfs2_claim_clusters:1715 ERROR: status = -5
> (2881,0):ocfs2_local_alloc_new_window:1013 ERROR: status = -5
> (2881,0):ocfs2_local_alloc_slide_window:1116 ERROR: status = -5
> (2881,0):ocfs2_reserve_local_alloc_bits:537 ERROR: status = -5
> (2881,0):__ocfs2_reserve_clusters:725 ERROR: status = -5
> (2881,0):ocfs2_lock_allocators:677 ERROR: status = -5
> (2881,0):__ocfs2_extend_allocation:739 ERROR: status = -5
> (2881,0):ocfs2_extend_no_holes:952 ERROR: status = -5
> (2881,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5
> (2881,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5
> (2881,0):ocfs2_write_begin:1860 ERROR: status = -5
> (2881,0):ocfs2_file_buffered_write:2039 ERROR: status = -5
> (2881,0):__ocfs2_file_aio_write:2194 ERROR: status = -5
> (2045,0):o2net_connect_expired:1664 ERROR: no connection established with
> node 2 after 30.0 seconds, giving up and returning errors.
> OCFS2: ERROR (device sdc1): ocfs2_check_group_descriptor: Group descriptor
> # 1128960 has bit count 32256 but claims that 34300 are free
> (2872,0):ocfs2_search_chain:1244 ERROR: status = -5
> (2872,0):ocfs2_claim_suballoc_bits:1433 ERROR: status = -5
> (2872,0):__ocfs2_claim_clusters:1715 ERROR: status = -5
> (2872,0):ocfs2_local_alloc_new_window:1013 ERROR: status = -5
> (2872,0):ocfs2_local_alloc_slide_window:1116 ERROR: status = -5
> (2872,0):ocfs2_reserve_local_alloc_bits:537 ERROR: status = -5
> (2872,0):__ocfs2_reserve_clusters:725 ERROR: status = -5
> (2872,0):ocfs2_lock_allocators:677 ERROR: status = -5
> (2872,0):__ocfs2_extend_allocation:739 ERROR: status = -5
> (2872,0):ocfs2_extend_no_holes:952 ERROR: status = -5
> (2872,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5
> (2872,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5
> (2872,0):ocfs2_write_begin:1860 ERROR: status = -5
> (2872,0):ocfs2_file_buffered_write:2039 ERROR: status = -5
> (2872,0):__ocfs2_file_aio_write:2194 ERROR: status = -5
> (2065,0):ocfs2_dlm_eviction_cb:98 device (8,33): dlm has evicted node 2
> (12701,1):dlm_get_lock_resource:844
> 6A03E81A818641A68FD8DC23854E12D3:M00000000000000000000243568d3c5: at least
> one node (2) to recover before lock mastery can begin
> (2045,0):ocfs2_dlm_eviction_cb:98 device (8,33): dlm has evicted node 2
> (12701,1):dlm_get_lock_resource:898
> 6A03E81A818641A68FD8DC23854E12D3:M00000000000000000000243568d3c5: at least
> one node (2) to recover before lock mastery can begin
> o2net: accepted connection from node qa-web2 (num 2) at 147.178.220.32:7777
> ocfs2_dlm: Node 2 joins domain 6A03E81A818641A68FD8DC23854E12D3
> ocfs2_dlm: Nodes in domain ("6A03E81A818641A68FD8DC23854E12D3"): 0 1 2 
> (12701,1):dlm_restart_lock_mastery:1216 node 2 up while restarting
> (12701,1):dlm_wait_for_lock_mastery:1040 ERROR: status = -11
>
> Any suggestions?  Is there anymore data I can provide?
>
> Thanks for any help.
>
> Kevin
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>