Hello,<br>

<br>

I have an active, balanced webcluster with 2 SLES10SP2 nodes, both running ocfs2 with an iscsi target.<br>

The ocfs2 volume is mounted on both nodes. <br>

Everything works fine, except sometime the load on both systems go as

high as 200, then both systems freeze, only reboot helps to regain

control.<br>

I noticed that this behaviour does not occur when the cluster is NOT

balanced. When the load on both systems is even the problem occurs.<br>

Looking into the logs revealed strange lock problems from the DLM.<br>

See lines below:<br><br>Feb 19 10:31:52 web1 kernel: (31564,0):dlm_send_remote_lock_request:315 ERROR: status = -40<br>Feb 19 10:31:52 web1 kernel: (31564,0):dlmlock_remote:251 ERROR: dlm status = DLM_BADARGS<br>Feb 19 10:31:52 web1 kernel: (31564,0):dlmlock:729 ERROR: dlm status = DLM_BADARGS<br>

Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_lock_create:901 ERROR: Dlm error &quot;DLM_BADARGS&quot; while calling dlmlock on resource F000000000000000155f89eb780960c: bad api args<br>Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_file_lock:1486 ERROR: status = -22<br>

Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_do_flock:79 ERROR: status = -22<br>Feb 19 10:33:16 web1 kernel: (7071,5):dlm_send_remote_lock_request:315 ERROR: status = -40<br>Feb 19 10:33:16 web1 kernel: (7071,5):dlmlock_remote:251 ERROR: dlm status = DLM_BADARGS<br>

Feb 19 10:33:16 web1 kernel: (7071,5):dlmlock:729 ERROR: dlm status = DLM_BADARGS<br>Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_lock_create:901 ERROR: Dlm error &quot;DLM_BADARGS&quot; while calling dlmlock on resource F000000000000000155f89eb780960c: bad api args<br>

Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_file_lock:1486 ERROR: status = -22<br>Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_do_flock:79 ERROR: status = -22<br><br>The heartbeat between nodes is active at all times, even when those errors occur.<br>

These errors are the only error related to ocfs2 or the filesystem.<br>

<br>

Cluster config is as follows:<br>

node:<br>

        ip_port = 7777<br>

        ip_address = 10.0.0.1<br>

        number = 0<br>

        name = web1<br>

        cluster = ocfs2<br>

<br>

node:<br>

        ip_port = 7777<br>

        ip_address = 10.0.0.2<br>

        number = 1<br>

        name = web2<br>

        cluster = ocfs2<br>

<br>

cluster:<br>

        node_count = 2<br>

        name = ocfs2<br>

<br>

I suspect that the nodes of the cluster have problems with creating and releasing the locks, thus knocking out each other.<br>

I have searched but cannot find anything about this problem, the only thing appearing on the oracle page is this <a href="http://oss.oracle.com/pipermail/ocfs2-devel/2008-December/003464.html" target="_blank">[Ocfs2-devel] [PATCH] ocfs2: fix DLM_BADARGS error in concurrent file locking</a> but i&#39;m not that skilled to understand what the guys are talking about. <br>

<br>

I can reproduce the freeze problem and the dlm errors at any time. <br>

<br>

Has anyone encountered the same problem? Does anyone know if novell

offers support in this kind of situations? I have an Standard

subscriptions on both systems.<br>

Thanks a lot, every hint is welcome.