Hello,<br>
<br>
I have an active, balanced webcluster with 2 SLES10SP2 nodes, both running ocfs2 with an iscsi target.<br>
The ocfs2 volume is mounted on both nodes. <br>
Everything works fine, except sometime the load on both systems go as
high as 200, then both systems freeze, only reboot helps to regain
control.<br>
I noticed that this behaviour does not occur when the cluster is NOT
balanced. When the load on both systems is even the problem occurs.<br>
Looking into the logs revealed strange lock problems from the DLM.<br>
See lines below:<br><br>Feb 19 10:31:52 web1 kernel: (31564,0):dlm_send_remote_lock_request:315 ERROR: status = -40<br>Feb 19 10:31:52 web1 kernel: (31564,0):dlmlock_remote:251 ERROR: dlm status = DLM_BADARGS<br>Feb 19 10:31:52 web1 kernel: (31564,0):dlmlock:729 ERROR: dlm status = DLM_BADARGS<br>
Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_lock_create:901 ERROR: Dlm error "DLM_BADARGS" while calling dlmlock on resource F000000000000000155f89eb780960c: bad api args<br>Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_file_lock:1486 ERROR: status = -22<br>
Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_do_flock:79 ERROR: status = -22<br>Feb 19 10:33:16 web1 kernel: (7071,5):dlm_send_remote_lock_request:315 ERROR: status = -40<br>Feb 19 10:33:16 web1 kernel: (7071,5):dlmlock_remote:251 ERROR: dlm status = DLM_BADARGS<br>
Feb 19 10:33:16 web1 kernel: (7071,5):dlmlock:729 ERROR: dlm status = DLM_BADARGS<br>Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_lock_create:901 ERROR: Dlm error "DLM_BADARGS" while calling dlmlock on resource F000000000000000155f89eb780960c: bad api args<br>
Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_file_lock:1486 ERROR: status = -22<br>Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_do_flock:79 ERROR: status = -22<br><br>The heartbeat between nodes is active at all times, even when those errors occur.<br>
These errors are the only error related to ocfs2 or the filesystem.<br>
<br>
Cluster config is as follows:<br>
node:<br>
ip_port = 7777<br>
ip_address = 10.0.0.1<br>
number = 0<br>
name = web1<br>
cluster = ocfs2<br>
<br>
node:<br>
ip_port = 7777<br>
ip_address = 10.0.0.2<br>
number = 1<br>
name = web2<br>
cluster = ocfs2<br>
<br>
cluster:<br>
node_count = 2<br>
name = ocfs2<br>
<br>
I suspect that the nodes of the cluster have problems with creating and releasing the locks, thus knocking out each other.<br>
I have searched but cannot find anything about this problem, the only thing appearing on the oracle page is this <a href="http://oss.oracle.com/pipermail/ocfs2-devel/2008-December/003464.html" target="_blank">[Ocfs2-devel] [PATCH] ocfs2: fix DLM_BADARGS error in concurrent file locking</a> but i'm not that skilled to understand what the guys are talking about. <br>
<br>
I can reproduce the freeze problem and the dlm errors at any time. <br>
<br>
Has anyone encountered the same problem? Does anyone know if novell
offers support in this kind of situations? I have an Standard
subscriptions on both systems.<br>
Thanks a lot, every hint is welcome.