[Ocfs-users] DLM Problem?

Wed Mar 11 23:40:13 PDT 2009

Hello,

I have an active, balanced webcluster with 2 SLES10SP2 nodes, both running
ocfs2 with an iscsi target.
The ocfs2 volume is mounted on both nodes.
Everything works fine, except sometime the load on both systems go as high
as 200, then both systems freeze, only reboot helps to regain control.
I noticed that this behaviour does not occur when the cluster is NOT
balanced. When the load on both systems is even the problem occurs.
Looking into the logs revealed strange lock problems from the DLM.
See lines below:

Feb 19 10:31:52 web1 kernel: (31564,0):dlm_send_remote_lock_request:315
ERROR: status = -40
Feb 19 10:31:52 web1 kernel: (31564,0):dlmlock_remote:251 ERROR: dlm status
= DLM_BADARGS
Feb 19 10:31:52 web1 kernel: (31564,0):dlmlock:729 ERROR: dlm status =
DLM_BADARGS
Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_lock_create:901 ERROR: Dlm
error "DLM_BADARGS" while calling dlmlock on resource
F000000000000000155f89eb780960c: bad api args
Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_file_lock:1486 ERROR: status =
-22
Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_do_flock:79 ERROR: status = -22
Feb 19 10:33:16 web1 kernel: (7071,5):dlm_send_remote_lock_request:315
ERROR: status = -40
Feb 19 10:33:16 web1 kernel: (7071,5):dlmlock_remote:251 ERROR: dlm status =
DLM_BADARGS
Feb 19 10:33:16 web1 kernel: (7071,5):dlmlock:729 ERROR: dlm status =
DLM_BADARGS
Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_lock_create:901 ERROR: Dlm error
"DLM_BADARGS" while calling dlmlock on resource
F000000000000000155f89eb780960c: bad api args
Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_file_lock:1486 ERROR: status =
-22
Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_do_flock:79 ERROR: status = -22

The heartbeat between nodes is active at all times, even when those errors
occur.
These errors are the only error related to ocfs2 or the filesystem.

Cluster config is as follows:
node:
ip_port = 7777
ip_address = 10.0.0.1
number = 0
name = web1
cluster = ocfs2

node:
ip_port = 7777
ip_address = 10.0.0.2
number = 1
name = web2
cluster = ocfs2

cluster:
node_count = 2
name = ocfs2

I suspect that the nodes of the cluster have problems with creating and
releasing the locks, thus knocking out each other.
I have searched but cannot find anything about this problem, the only thing
appearing on the oracle page is this [Ocfs2-devel] [PATCH] ocfs2: fix
DLM_BADARGS error in concurrent file
locking<http://oss.oracle.com/pipermail/ocfs2-devel/2008-December/003464.html>but
i'm not that skilled to understand what the guys are talking about.

I can reproduce the freeze problem and the dlm errors at any time.

Has anyone encountered the same problem? Does anyone know if novell offers
support in this kind of situations? I have an Standard subscriptions on both
systems.
Thanks a lot, every hint is welcome.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs-users/attachments/20090312/c42bf909/attachment.html