[Ocfs2-users] servers blocked on ocfs2

frank frank at si.ct.upc.edu
Thu Dec 9 23:16:05 PST 2010


Hi Sunil,
first of all thanks for the answer, but are you saying that is better to 
use a switch that a direct cable between nodes?
I thought that using a switch adds an extra point of failure and also 
spends a couple of switch ports unnecessarily.
What is the problem in using crossover cables?

Frank

Al 09/12/10 20:15, En/na Sunil Mushran ha escrit:
> The interconnect is the problem. Don't use crossover cables. Use a 
> gige link
> with a proper switch. That's what the world uses.
>
> On 12/09/2010 02:10 AM, frank wrote:
>> Hi,
>>
>> we have recently started to use ocfs2 on some RHEL 5.5 servers 
>> (ocfs2-1.4.7)
>> Some days ago, two servers sharing an ocfs2 filesystem, and with 
>> quite virtual services, stalled, in what it seems on ocfs2 issue. 
>> This are the lines in their messages files:
>>
>> =====node heraclito (0)========================================
>> /Dec  4 09:15:06 heraclito kernel: o2net: connection to node 
>> parmenides (num 1) at 192.168.1.2:7777 has been idle for 30.0 
>> seconds, shutting it down.
>> Dec  4 09:15:06 heraclito kernel: (swapper,0,7):o2net_idle_timer:1503 
>> here are some times that might help debug the situation: (tmr 
>> 1291450476.228826
>> now 1291450506.229456 dr 1291450476.228760 adv 
>> 1291450476.228842:1291450476.228843 func (de6e01eb:500) 
>> 1291450476.228827:1291450476.228829)
>> Dec  4 09:15:06 heraclito kernel: o2net: no longer connected to node 
>> parmenides (num 1) at 192.168.1.2:7777
>> Dec  4 09:15:06 heraclito kernel: 
>> (vzlist,22622,7):dlm_send_remote_convert_request:395 ERROR: status = -112
>> Dec  4 09:15:06 heraclito kernel: 
>> (snmpd,16452,10):dlm_send_remote_convert_request:395 ERROR: status = -112
>> Dec  4 09:15:06 heraclito kernel: 
>> (snmpd,16452,10):dlm_wait_for_node_death:370 
>> 0D3E49EB1F614A3EAEC0E2A74A34AFFF: waiting 5000ms for notification of de
>> ath of node 1
>> Dec  4 09:15:06 heraclito kernel: 
>> (httpd,4615,10):dlm_do_master_request:1334 ERROR: link to 1 went down!
>> Dec  4 09:15:06 heraclito kernel: 
>> (httpd,4615,10):dlm_get_lock_resource:917 ERROR: status = -112
>> Dec  4 09:15:06 heraclito kernel: 
>> (python,20750,10):dlm_do_master_request:1334 ERROR: link to 1 went down!
>> Dec  4 09:15:06 heraclito kernel: 
>> (python,20750,10):dlm_get_lock_resource:917 ERROR: status = -112
>> Dec  4 09:15:06 heraclito kernel: 
>> (vzlist,22622,7):dlm_wait_for_node_death:370 
>> 0D3E49EB1F614A3EAEC0E2A74A34AFFF: waiting 5000ms for notification of de
>> ath of node 1
>> Dec  4 09:15:06 heraclito kernel: o2net: accepted connection from 
>> node parmenides (num 1) at 192.168.1.2:7777
>> Dec  4 09:15:11 heraclito kernel: 
>> (snmpd,16452,5):dlm_send_remote_convert_request:393 ERROR: dlm status 
>> = DLM_IVLOCKID
>> Dec  4 09:15:11 heraclito kernel: 
>> (snmpd,16452,5):dlmconvert_remote:327 ERROR: dlm status = DLM_IVLOCKID
>> Dec  4 09:15:11 heraclito kernel: 
>> (snmpd,16452,5):ocfs2_cluster_lock:1258 ERROR: DLM error DLM_IVLOCKID 
>> while calling dlmlock on resource M00000000000
>> 0000000000b6f931666: bad lockid
>> Dec  4 09:15:11 heraclito kernel: 
>> (snmpd,16452,5):ocfs2_inode_lock_full:2121 ERROR: status = -22
>> Dec  4 09:15:11 heraclito kernel: (snmpd,16452,5):_ocfs2_statfs:1266 
>> ERROR: status = -22
>> Dec  4 09:15:11 heraclito kernel: 
>> (vzlist,22622,9):dlm_send_remote_convert_request:393 ERROR: dlm 
>> status = DLM_IVLOCKID
>> Dec  4 09:15:11 heraclito kernel: 
>> (vzlist,22622,9):dlmconvert_remote:327 ERROR: dlm status = DLM_IVLOCKID
>> Dec  4 09:15:11 heraclito kernel: 
>> (vzlist,22622,9):ocfs2_cluster_lock:1258 ERROR: DLM error 
>> DLM_IVLOCKID while calling dlmlock on resource M0000000000
>> 0000000b94df00000000: bad lockid/
>>
>> =====node parmenides (1)========================================
>> /Dec  4 09:15:06 parmenides kernel: o2net: connection to node 
>> heraclito (num 0) at 192.168.1.3:7777 has been idle for 30.0 seconds, 
>> shutting it down.
>> Dec  4 09:15:06 parmenides kernel: 
>> (swapper,0,9):o2net_idle_timer:1503 here are some times that might 
>> help debug the situation: (tmr 1291450476.231519
>>  now 1291450506.232462 dr 1291450476.231506 adv 
>> 1291450476.231522:1291450476.231522 func (de6e01eb:505) 
>> 1291450475.650496:1291450475.650501)
>> Dec  4 09:15:06 parmenides kernel: o2net: no longer connected to node 
>> heraclito (num 0) at 192.168.1.3:7777
>> Dec  4 09:15:06 parmenides kernel: 
>> (snmpd,12342,11):dlm_do_master_request:1334 ERROR: link to 0 went down!
>> Dec  4 09:15:06 parmenides kernel: 
>> (minilogd,12700,0):dlm_wait_for_lock_mastery:1117 ERROR: status = -112
>> Dec  4 09:15:06 parmenides kernel: 
>> (smbd,25555,12):dlm_do_master_request:1334 ERROR: link to 0 went down!
>> Dec  4 09:15:06 parmenides kernel: 
>> (python,12439,9):dlm_do_master_request:1334 ERROR: link to 0 went down!
>> Dec  4 09:15:06 parmenides kernel: 
>> (python,12439,9):dlm_get_lock_resource:917 ERROR: status = -112
>> Dec  4 09:15:06 parmenides kernel: 
>> (smbd,25555,12):dlm_get_lock_resource:917 ERROR: status = -112
>> Dec  4 09:15:06 parmenides kernel: 
>> (minilogd,12700,0):dlm_do_master_request:1334 ERROR: link to 0 went down!
>> Dec  4 09:15:06 parmenides kernel: 
>> (minilogd,12700,0):dlm_get_lock_resource:917 ERROR: status = -107
>> Dec  4 09:15:06 parmenides kernel: 
>> (dlm_thread,10627,4):dlm_drop_lockres_ref:2211 ERROR: status = -112
>> Dec  4 09:15:06 parmenides kernel: 
>> (dlm_thread,10627,4):dlm_purge_lockres:206 ERROR: status = -112
>> Dec  4 09:15:06 parmenides kernel: o2net: connected to node heraclito 
>> (num 0) at 192.168.1.3:7777
>> Dec  4 09:15:06 parmenides kernel: 
>> (snmpd,12342,11):dlm_get_lock_resource:917 ERROR: status = -112
>> Dec  4 09:15:11 parmenides kernel: 
>> (o2net,10545,6):dlm_convert_lock_handler:489 ERROR: did not find lock 
>> to convert on grant queue! cookie=0:6
>> Dec  4 09:15:11 parmenides kernel: lockres: 
>> M000000000000000000000b6f931666, owner=1, state=0
>> Dec  4 09:15:11 parmenides kernel:   last used: 0, refcnt: 4, on 
>> purge list: no
>> Dec  4 09:15:11 parmenides kernel:   on dirty list: no, on reco list: 
>> no, migrating pending: no
>> Dec  4 09:15:11 parmenides kernel:   inflight locks: 0, asts reserved: 0
>> Dec  4 09:15:11 parmenides kernel:   refmap nodes: [ 0 ], inflight=0
>> Dec  4 09:15:11 parmenides kernel:   granted queue:
>> Dec  4 09:15:11 parmenides kernel:     type=5, conv=-1, node=1, 
>> cookie=1:6, ref=2, ast=(empty=y,pend=n), bast=(empty=y,pend=n), 
>> pending=(conv=n,lock=n
>> ,cancel=n,unlock=n)/
>> =================================================
>>
>> As you can see, it seems problem started at same time in both nodes. 
>> This nodes have a crossover cable on a dedicated eth interface in the 
>> private net 192.168.1.0; we use OpenVZ on these hosts, so kernel is 
>> OpenVZ patched and we recompiled ocfs2 sources to get apropiated 
>> ocfs2 modules.
>>
>> Configuration file is the same on both nodes, this one:
>>
>> /node:
>>     ip_port = 7777
>>     ip_address = 192.168.1.3
>>     number = 0
>>     name = heraclito
>>     cluster = ocfs2
>>
>> node:
>>     ip_port = 7777
>>     ip_address = 192.168.1.2
>>     number = 1
>>     name = parmenides
>>     cluster = ocfs2
>>
>> cluster:
>>     node_count = 2
>>     name = ocfs2/
>>
>> We are worried a lot because on these servers we have critical 
>> services, so we will thank a lot any ideas we can try to avoid 
>> another crash like this. The only way we could recover from this was 
>> to reboot both servers, because we have no way to log in nodes.
>> Please let us know if you need additional information.
>>
>> Thanks in advance and regards.
>>
>> Frank
>>
>>
>>


-- 
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que està net.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20101210/f03a42b5/attachment.html 


More information about the Ocfs2-users mailing list