[Ocfs2-users] servers blocked on ocfs2

Sun Dec 12 23:58:17 PST 2010

After that, all node operations frozen; we can not log in either.

Node 0 keep on log this kind of messages until it stopped "message" 
logging at 10:49:

/Dec  4 10:49:34 heraclito kernel: 
(sendmail,19074,6):ocfs2_inode_lock_full:2121 ERROR: status = -22
Dec  4 10:49:34 heraclito kernel: (sendmail,19074,6):_ocfs2_statfs:1266 
ERROR: status = -22
Dec  4 10:49:34 heraclito kernel: 
(sendmail,19074,6):dlm_send_remote_convert_request:393 ERROR: dlm status 
= DLM_IVLOCKID
Dec  4 10:49:34 heraclito kernel: 
(sendmail,19074,6):dlmconvert_remote:327 ERROR: dlm status = DLM_IVLOCKID
Dec  4 10:49:34 heraclito kernel: 
(sendmail,19074,6):ocfs2_cluster_lock:1258 ERROR: DLM error DLM_IVLOCKID 
while calling dlmlock on resource M00000000
0000000000000b6f931666: bad lockid/

Node 1 keep on log this kind of messages until it stopped "message" 
logging at 10:00:

/Dec  4 10:00:20 parmenides kernel: 
(o2net,10545,14):dlm_convert_lock_handler:489 ERROR: did not find lock 
to convert on grant queue! cookie=0:6
Dec  4 10:00:20 parmenides kernel: lockres: 
M000000000000000000000b6f931666, owner=1, state=0
Dec  4 10:00:20 parmenides kernel:   last used: 0, refcnt: 4, on purge 
list: no
Dec  4 10:00:20 parmenides kernel:   on dirty list: no, on reco list: 
no, migrating pending: no
Dec  4 10:00:20 parmenides kernel:   inflight locks: 0, asts reserved: 0
Dec  4 10:00:20 parmenides kernel:   refmap nodes: [ 0 ], inflight=0
Dec  4 10:00:20 parmenides kernel:   granted queue:
Dec  4 10:00:20 parmenides kernel:     type=5, conv=-1, node=1, 
cookie=1:6, ref=2, ast=(empty=y,pend=n), bast=(empty=y,pend=n), 
pending=(conv=n,lock=n
,cancel=n,unlock=n)
Dec  4 10:00:20 parmenides kernel:   converting queue:
Dec  4 10:00:20 parmenides kernel:     type=0, conv=3, node=0, 
cookie=0:6, ref=2, ast=(empty=y,pend=n), bast=(empty=y,pend=n), 
pending=(conv=n,lock=n,
cancel=n,unlock=n)
Dec  4 10:00:20 parmenides kernel:   blocked queue:/

We reboot both nodes at 13:03, and we recovered services as usual with 
no more problems.

Frank

Al 10/12/10 20:40, En/na Joel Becker ha escrit:
> On Fri, Dec 10, 2010 at 11:38:04AM -0800, Joel Becker wrote:
>> On Fri, Dec 10, 2010 at 08:42:19AM +0100, frank wrote:
>>> Anyway, if there was a cut in the heartbeat or something similar, one of
>>> the nodes should have fenced itself, haven't it? Why did the nodes
>>> stall? Can we avoid that?
>> 	If both nodes saw the network go down, but the disk heartbeat
>> was still working, the higher node should have fenced.  Was there no
>> fencing?  Was it just both nodes hung?  How were they hung?  All
>> operation, or just ocfs2 operations?
> 	Oh, I see.  While node 0 was waiting for node 1 to kill itself,
> node 1 managed to reconnect.  The invalid lock stuff was weird, though.
> After this, did all operation resume to normal, or were many operations
> permanently frozen?
>
> Joel
>

-- 
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que està net.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20101213/ceb7555c/attachment.html