[Ocfs2-users] [SUMMARY] Cannot mount 1 out of 3 OCFS2 filesystems

Daniel Keisling daniel.keisling at austin.ppdi.com
Fri Oct 3 14:13:23 PDT 2008


This seems to be related to bug 6719988 in v1.2.8-2.  This is fixed in
v1.2.9-1.
 
________________________________

From: ocfs2-users-bounces at oss.oracle.com
[mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Daniel Keisling
Sent: Friday, October 03, 2008 10:21 AM
To: ocfs2-users at oss.oracle.com
Subject: [Ocfs2-users] Cannot mount 1 out of 3 OCFS2 filesystems


Greetings,
 
I have a 4-node Oracle RAC cluster sharing four OCFS2 v1.2 filesystems
on RHEL5.  Node 3 was taken down for maintenance and was rebooted
several times.  During this time, the networking stack on the cluster
interconnect had issues (after changing to an active-backup bonding
method) and was receiving high packet loss, resulting in timeouts
connecting to the cluster.  After the networking changes were reverted
(putting the bonding method back to active-active) and the server
rebooted, I can join the cluster but can only mount 3 out of the 4 OCFS2
filesystems:
 
[root at ausracdb04 /]# mount /dev/mapper/limsp_archp1
mount.ocfs2: Unknown code B 0 while mounting /dev/mapper/limsp_archp1 on
/var/opt/oracle/oradata/limsp/arch. Check 'dmesg' for more information
on this error.

dmesg reports:
(17909,1):dlm_join_domain:1301 Timed out joining dlm domain
980E9BC11D2C458B9BC8BEACC1365CAC after 90400 msecs
ocfs2: Unmounting device (253,19) on (node 3)

The other nodes do not report anything for this filesystem during the
failed join, but I do see successful domain joins for the other OCFS2
filesystems.
 
I can ping the interconnect IPs between all 4 servers.  I have rebooted
several times and restarted the entire cluster stack to no avail.  The
problem has persisted for the last 18 hours.
 
My initial thoughts is that there is a DLM resource lock that cannot be
released, but I'm not exactly sure how to fix it (rebooting the other
nodes is not the best option as this is a high production environment).
I've tried to use the debugfs tools mentioned in the FAQ/User Guides,
but it's very confusing and I'm not sure what I need to look for.
 
I can see the disk device just fine on the server, and can browse the
filesystem using ocfs2console, just cannot join the domain to mount it.
 
I would appreciate any advice anyone may have.
 
My details are:
 
[root at ausracdb04 /]# uname -a
Linux ausracdb04.austin.ppdi.com 2.6.18-53.el5 #1 SMP Wed Oct 10
16:34:19 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

[root at ausracdb04 /]# rpm -qa | grep -i ocfs2
ocfs2-2.6.18-53.el5-1.2.8-2.el5
ocfs2console-1.2.7-2.el5
ocfs2-tools-1.2.7-2.el5

[root at ausracdb04 /]# cat /etc/ocfs2/cluster.conf
node:
        ip_port = 7777
        ip_address = 192.168.0.100
        number = 0
        name = ausracdb01
        cluster = racdb
 
node:
        ip_port = 7777
        ip_address = 192.168.0.101
        number = 1
        name = ausracdb02
        cluster = racdb
 
node:
        ip_port = 7777
        ip_address = 192.168.0.102
        number = 2
        name = ausracdb03
        cluster = racdb
 
node:
        ip_port = 7777
        ip_address = 192.168.0.106
        number = 3
        name = ausracdb04
        cluster = racdb
 
cluster:
        node_count = 4
        name = racdb

[root at ausracdb04 /]# cat /etc/sysconfig/o2cb
#
# This is a configuration file for automatic startup of the O2CB
# driver.  It is generated by running /etc/init.d/o2cb configure.
# Please use that method to modify this file
#
 
# O2CB_ENABELED: 'true' means to load the driver on boot.
O2CB_ENABLED=true
 
# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=racdb
 
# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=61
 
# O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is
considered dead.
O2CB_IDLE_TIMEOUT_MS=60000
 
# O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is
sent
O2CB_KEEPALIVE_DELAY_MS=
 
# O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
O2CB_RECONNECT_DELAY_MS=

 
[root at ausracdb04 /]# echo "stat " | debugfs.ocfs2 -n
/dev/mapper/limsp_archp1
        Inode: 5   Mode: 0775   Generation: 1066067688 (0x3f8ae6e8)
        FS Generation: 1066067688 (0x3f8ae6e8)
        Type: Directory   Attr: 0x0   Flags: Valid System
        User: 503 (oracle)   Group: 505 (dba)   Size: 40960
        Links: 4   Clusters: 10
        ctime: 0x48e635d4 -- Fri Oct  3 10:10:12 2008
        atime: 0x48627838 -- Wed Jun 25 11:54:16 2008
        mtime: 0x48e635d4 -- Fri Oct  3 10:10:12 2008
        dtime: 0x0 -- Wed Dec 31 18:00:00 1969
        ctime_nsec: 0x3ad5b3d6 -- 987083734
        atime_nsec: 0x00000000 -- 0
        mtime_nsec: 0x3ad5b3d6 -- 987083734
        Last Extblk: 0
        Sub Alloc Slot: Global   Sub Alloc Bit: 1
        Tree Depth: 0   Count: 243   Next Free Rec: 10
        ## Offset        Clusters       Block#
        0  0             1              207
        1  1             1              485268
        2  2             1              2096789
        3  3             1              751454
        4  4             1              1782521
        5  5             1              2144728
        6  6             1              2145932
        7  7             1              1784169
        8  8             1              1601861
        9  9             1              2446400
 
[root at ausracdb04 /]# echo "slotmap" | debugfs.ocfs2 -n
/dev/mapper/limsp_archp1
        Slot#   Node#
            0       0
            1       1
            2       2

Slotmaps for another filesystem that is correctly joined and mounted:
[root at ausracdb04 /]# echo "slotmap" | debugfs.ocfs2 -n
/dev/mapper/ph1pp1
        Slot#   Node#
            0       0
            1       1
            2       2
            3       3
 
 
I don't know if this is a correct command to look for "busy" locks.
(Done from another node):
[root at ausracdb01 ~]# echo "fs_locks" | debugfs.ocfs2 -n
/dev/mapper/limsp_archp1 | grep -i busy
[root at ausracdb01 ~]#


 
TIA,
 
Daniel

 
 




______________________________________________________________________
This email transmission and any documents, files or previous email
messages attached to it may contain information that is confidential or
legally privileged. If you are not the intended recipient or a person
responsible for delivering this transmission to the intended recipient,
you are hereby notified that you must not read this transmission and
that any disclosure, copying, printing, distribution or use of this
transmission is strictly prohibited. If you have received this
transmission
in error, please immediately notify the sender by telephone or return
email
and delete the original transmission and its attachments without reading
or saving in any manner.
	

______________________________________________________________________
This email transmission and any documents, files or previous email
messages attached to it may contain information that is confidential or
legally privileged. If you are not the intended recipient or a person
responsible for delivering this transmission to the intended recipient,
you are hereby notified that you must not read this transmission and
that any disclosure, copying, printing, distribution or use of this
transmission is strictly prohibited. If you have received this transmission
in error, please immediately notify the sender by telephone or return email
and delete the original transmission and its attachments without reading
or saving in any manner.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20081003/6f1a2c79/attachment-0001.html 


More information about the Ocfs2-users mailing list