[Ocfs2-users] servers blocked on ocfs2
frank
frank at si.ct.upc.edu
Thu Dec 9 23:16:05 PST 2010
Hi Sunil,
first of all thanks for the answer, but are you saying that is better to
use a switch that a direct cable between nodes?
I thought that using a switch adds an extra point of failure and also
spends a couple of switch ports unnecessarily.
What is the problem in using crossover cables?
Frank
Al 09/12/10 20:15, En/na Sunil Mushran ha escrit:
> The interconnect is the problem. Don't use crossover cables. Use a
> gige link
> with a proper switch. That's what the world uses.
>
> On 12/09/2010 02:10 AM, frank wrote:
>> Hi,
>>
>> we have recently started to use ocfs2 on some RHEL 5.5 servers
>> (ocfs2-1.4.7)
>> Some days ago, two servers sharing an ocfs2 filesystem, and with
>> quite virtual services, stalled, in what it seems on ocfs2 issue.
>> This are the lines in their messages files:
>>
>> =====node heraclito (0)========================================
>> /Dec 4 09:15:06 heraclito kernel: o2net: connection to node
>> parmenides (num 1) at 192.168.1.2:7777 has been idle for 30.0
>> seconds, shutting it down.
>> Dec 4 09:15:06 heraclito kernel: (swapper,0,7):o2net_idle_timer:1503
>> here are some times that might help debug the situation: (tmr
>> 1291450476.228826
>> now 1291450506.229456 dr 1291450476.228760 adv
>> 1291450476.228842:1291450476.228843 func (de6e01eb:500)
>> 1291450476.228827:1291450476.228829)
>> Dec 4 09:15:06 heraclito kernel: o2net: no longer connected to node
>> parmenides (num 1) at 192.168.1.2:7777
>> Dec 4 09:15:06 heraclito kernel:
>> (vzlist,22622,7):dlm_send_remote_convert_request:395 ERROR: status = -112
>> Dec 4 09:15:06 heraclito kernel:
>> (snmpd,16452,10):dlm_send_remote_convert_request:395 ERROR: status = -112
>> Dec 4 09:15:06 heraclito kernel:
>> (snmpd,16452,10):dlm_wait_for_node_death:370
>> 0D3E49EB1F614A3EAEC0E2A74A34AFFF: waiting 5000ms for notification of de
>> ath of node 1
>> Dec 4 09:15:06 heraclito kernel:
>> (httpd,4615,10):dlm_do_master_request:1334 ERROR: link to 1 went down!
>> Dec 4 09:15:06 heraclito kernel:
>> (httpd,4615,10):dlm_get_lock_resource:917 ERROR: status = -112
>> Dec 4 09:15:06 heraclito kernel:
>> (python,20750,10):dlm_do_master_request:1334 ERROR: link to 1 went down!
>> Dec 4 09:15:06 heraclito kernel:
>> (python,20750,10):dlm_get_lock_resource:917 ERROR: status = -112
>> Dec 4 09:15:06 heraclito kernel:
>> (vzlist,22622,7):dlm_wait_for_node_death:370
>> 0D3E49EB1F614A3EAEC0E2A74A34AFFF: waiting 5000ms for notification of de
>> ath of node 1
>> Dec 4 09:15:06 heraclito kernel: o2net: accepted connection from
>> node parmenides (num 1) at 192.168.1.2:7777
>> Dec 4 09:15:11 heraclito kernel:
>> (snmpd,16452,5):dlm_send_remote_convert_request:393 ERROR: dlm status
>> = DLM_IVLOCKID
>> Dec 4 09:15:11 heraclito kernel:
>> (snmpd,16452,5):dlmconvert_remote:327 ERROR: dlm status = DLM_IVLOCKID
>> Dec 4 09:15:11 heraclito kernel:
>> (snmpd,16452,5):ocfs2_cluster_lock:1258 ERROR: DLM error DLM_IVLOCKID
>> while calling dlmlock on resource M00000000000
>> 0000000000b6f931666: bad lockid
>> Dec 4 09:15:11 heraclito kernel:
>> (snmpd,16452,5):ocfs2_inode_lock_full:2121 ERROR: status = -22
>> Dec 4 09:15:11 heraclito kernel: (snmpd,16452,5):_ocfs2_statfs:1266
>> ERROR: status = -22
>> Dec 4 09:15:11 heraclito kernel:
>> (vzlist,22622,9):dlm_send_remote_convert_request:393 ERROR: dlm
>> status = DLM_IVLOCKID
>> Dec 4 09:15:11 heraclito kernel:
>> (vzlist,22622,9):dlmconvert_remote:327 ERROR: dlm status = DLM_IVLOCKID
>> Dec 4 09:15:11 heraclito kernel:
>> (vzlist,22622,9):ocfs2_cluster_lock:1258 ERROR: DLM error
>> DLM_IVLOCKID while calling dlmlock on resource M0000000000
>> 0000000b94df00000000: bad lockid/
>>
>> =====node parmenides (1)========================================
>> /Dec 4 09:15:06 parmenides kernel: o2net: connection to node
>> heraclito (num 0) at 192.168.1.3:7777 has been idle for 30.0 seconds,
>> shutting it down.
>> Dec 4 09:15:06 parmenides kernel:
>> (swapper,0,9):o2net_idle_timer:1503 here are some times that might
>> help debug the situation: (tmr 1291450476.231519
>> now 1291450506.232462 dr 1291450476.231506 adv
>> 1291450476.231522:1291450476.231522 func (de6e01eb:505)
>> 1291450475.650496:1291450475.650501)
>> Dec 4 09:15:06 parmenides kernel: o2net: no longer connected to node
>> heraclito (num 0) at 192.168.1.3:7777
>> Dec 4 09:15:06 parmenides kernel:
>> (snmpd,12342,11):dlm_do_master_request:1334 ERROR: link to 0 went down!
>> Dec 4 09:15:06 parmenides kernel:
>> (minilogd,12700,0):dlm_wait_for_lock_mastery:1117 ERROR: status = -112
>> Dec 4 09:15:06 parmenides kernel:
>> (smbd,25555,12):dlm_do_master_request:1334 ERROR: link to 0 went down!
>> Dec 4 09:15:06 parmenides kernel:
>> (python,12439,9):dlm_do_master_request:1334 ERROR: link to 0 went down!
>> Dec 4 09:15:06 parmenides kernel:
>> (python,12439,9):dlm_get_lock_resource:917 ERROR: status = -112
>> Dec 4 09:15:06 parmenides kernel:
>> (smbd,25555,12):dlm_get_lock_resource:917 ERROR: status = -112
>> Dec 4 09:15:06 parmenides kernel:
>> (minilogd,12700,0):dlm_do_master_request:1334 ERROR: link to 0 went down!
>> Dec 4 09:15:06 parmenides kernel:
>> (minilogd,12700,0):dlm_get_lock_resource:917 ERROR: status = -107
>> Dec 4 09:15:06 parmenides kernel:
>> (dlm_thread,10627,4):dlm_drop_lockres_ref:2211 ERROR: status = -112
>> Dec 4 09:15:06 parmenides kernel:
>> (dlm_thread,10627,4):dlm_purge_lockres:206 ERROR: status = -112
>> Dec 4 09:15:06 parmenides kernel: o2net: connected to node heraclito
>> (num 0) at 192.168.1.3:7777
>> Dec 4 09:15:06 parmenides kernel:
>> (snmpd,12342,11):dlm_get_lock_resource:917 ERROR: status = -112
>> Dec 4 09:15:11 parmenides kernel:
>> (o2net,10545,6):dlm_convert_lock_handler:489 ERROR: did not find lock
>> to convert on grant queue! cookie=0:6
>> Dec 4 09:15:11 parmenides kernel: lockres:
>> M000000000000000000000b6f931666, owner=1, state=0
>> Dec 4 09:15:11 parmenides kernel: last used: 0, refcnt: 4, on
>> purge list: no
>> Dec 4 09:15:11 parmenides kernel: on dirty list: no, on reco list:
>> no, migrating pending: no
>> Dec 4 09:15:11 parmenides kernel: inflight locks: 0, asts reserved: 0
>> Dec 4 09:15:11 parmenides kernel: refmap nodes: [ 0 ], inflight=0
>> Dec 4 09:15:11 parmenides kernel: granted queue:
>> Dec 4 09:15:11 parmenides kernel: type=5, conv=-1, node=1,
>> cookie=1:6, ref=2, ast=(empty=y,pend=n), bast=(empty=y,pend=n),
>> pending=(conv=n,lock=n
>> ,cancel=n,unlock=n)/
>> =================================================
>>
>> As you can see, it seems problem started at same time in both nodes.
>> This nodes have a crossover cable on a dedicated eth interface in the
>> private net 192.168.1.0; we use OpenVZ on these hosts, so kernel is
>> OpenVZ patched and we recompiled ocfs2 sources to get apropiated
>> ocfs2 modules.
>>
>> Configuration file is the same on both nodes, this one:
>>
>> /node:
>> ip_port = 7777
>> ip_address = 192.168.1.3
>> number = 0
>> name = heraclito
>> cluster = ocfs2
>>
>> node:
>> ip_port = 7777
>> ip_address = 192.168.1.2
>> number = 1
>> name = parmenides
>> cluster = ocfs2
>>
>> cluster:
>> node_count = 2
>> name = ocfs2/
>>
>> We are worried a lot because on these servers we have critical
>> services, so we will thank a lot any ideas we can try to avoid
>> another crash like this. The only way we could recover from this was
>> to reboot both servers, because we have no way to log in nodes.
>> Please let us know if you need additional information.
>>
>> Thanks in advance and regards.
>>
>> Frank
>>
>>
>>
--
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que està net.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20101210/f03a42b5/attachment.html
More information about the Ocfs2-users
mailing list