[Ocfs-users] DLM Problem?

Fri Mar 13 09:06:15 PDT 2009

This issue has been fixed quite some time ago. Get the latest sles10 sp2
kernel. It should have the fix.

Sunil

On Thu, Mar 12, 2009 at 07:40:13AM +0100, Bogdan Constantin wrote:
>    Hello,
>    I have an active, balanced webcluster with 2 SLES10SP2 nodes, both
>    running ocfs2 with an iscsi target.
>    The ocfs2 volume is mounted on both nodes.
>    Everything works fine, except sometime the load on both systems go as
>    high as 200, then both systems freeze, only reboot helps to regain
>    control.
>    I noticed that this behaviour does not occur when the cluster is NOT
>    balanced. When the load on both systems is even the problem occurs.
>    Looking into the logs revealed strange lock problems from the DLM.
>    See lines below:
>    Feb 19 10:31:52 web1 kernel: (31564,0):dlm_send_remote_lock_request:315
>    ERROR: status = -40
>    Feb 19 10:31:52 web1 kernel: (31564,0):dlmlock_remote:251 ERROR: dlm
>    status = DLM_BADARGS
>    Feb 19 10:31:52 web1 kernel: (31564,0):dlmlock:729 ERROR: dlm status =
>    DLM_BADARGS
>    Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_lock_create:901 ERROR: Dlm
>    error "DLM_BADARGS" while calling dlmlock on resource
>    F000000000000000155f89eb780960c: bad api args
>    Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_file_lock:1486 ERROR:
>    status = -22
>    Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_do_flock:79 ERROR: status
>    = -22
>    Feb 19 10:33:16 web1 kernel: (7071,5):dlm_send_remote_lock_request:315
>    ERROR: status = -40
>    Feb 19 10:33:16 web1 kernel: (7071,5):dlmlock_remote:251 ERROR: dlm
>    status = DLM_BADARGS
>    Feb 19 10:33:16 web1 kernel: (7071,5):dlmlock:729 ERROR: dlm status =
>    DLM_BADARGS
>    Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_lock_create:901 ERROR: Dlm
>    error "DLM_BADARGS" while calling dlmlock on resource
>    F000000000000000155f89eb780960c: bad api args
>    Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_file_lock:1486 ERROR:
>    status = -22
>    Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_do_flock:79 ERROR: status =
>    -22
>    The heartbeat between nodes is active at all times, even when those
>    errors occur.
>    These errors are the only error related to ocfs2 or the filesystem.
>    Cluster config is as follows:
>    node:
>    ip_port = 7777
>    ip_address = 10.0.0.1
>    number = 0
>    name = web1
>    cluster = ocfs2
>    node:
>    ip_port = 7777
>    ip_address = 10.0.0.2
>    number = 1
>    name = web2
>    cluster = ocfs2
>    cluster:
>    node_count = 2
>    name = ocfs2
>    I suspect that the nodes of the cluster have problems with creating and
>    releasing the locks, thus knocking out each other.
>    I have searched but cannot find anything about this problem, the only
>    thing appearing on the oracle page is this [1][Ocfs2-devel] [PATCH]
>    ocfs2: fix DLM_BADARGS error in concurrent file locking but i'm not
>    that skilled to understand what the guys are talking about.
>    I can reproduce the freeze problem and the dlm errors at any time.
>    Has anyone encountered the same problem? Does anyone know if novell
>    offers support in this kind of situations? I have an Standard
>    subscriptions on both systems.
>    Thanks a lot, every hint is welcome.
> 
> References
> 
>    1. http://oss.oracle.com/pipermail/ocfs2-devel/2008-December/003464.html

> _______________________________________________
> Ocfs-users mailing list
> Ocfs-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs-users