[Ocfs2-users] Node crash

Sérgio Surkamp sergio at gruposinternet.com.br
Wed Dec 2 10:08:53 PST 2009


Dunno if it is useful, but we had a never seen crash.

Setup:

2x SuSE SLES 10SP2 (its old, I known)

Problem description:

1. We had to reboot ocfs2 master node.
2. During the reboot, the umount coredumped, leaving the filesystem
   mounted or may be heartbeating (?);
3. The slave node detected that the slave was dead;
4. When the slave tried to assume the master status, it rebooted (no
   crash, no warning, nothing, just like press reset button);
5. The master hanged because it could not unmount ocfs2 filesystem;

Could not take many messages from nodes, just this ones:

master node umount crash (from syslog):
Dec  2 14:22:08 soap02 kernel: (19573,5):dlm_empty_lockres:2783 ERROR:
lockres M00000000000000164ad60700000000 still has local locks!
Dec  2 14:22:08 soap02 kernel: ----------- [cut here ] ---------
[please bite here ] ---------
Dec  2 14:22:08 soap02 kernel: Kernel BUG at
fs/ocfs2/dlm/dlmmaster.c:2784
Dec  2 14:22:08 soap02 kernel: invalid opcode: 0000 [1] SMP
Dec  2 14:22:08 soap02 kernel: last sysfs
file: /devices/pci0000:00/0000:00:1c.0/0000:04:00.0/0000:05:00.0/power/state
Dec  2 14:22:08 soap02 kernel: CPU 5
Dec  2 14:22:08 soap02 kernel: Modules linked in: af_packet joydev st
ocfs2 jbd ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs nfsd
exportfs nfs lockd nfs_acl sunrpc ipv6 button battery ac binfmt_misc
netconsole xt_comment xt_tcpudp xt_state iptable_filter iptable_mangle
iptab le_nat ip_nat ip_conntrack nfnetlink ip_tables x_tables apparmor
loop sr_mod usbhid usb_storage ide_cd uhci_hcd ehci_hcd usbcore shpchp
hw_random cdrom bnx2 pci_hotplug reiserfs ata_piix ahci libata
dm_snapshot qla2xxx firmware_class qla2xxx_conf intermodule edd dm_mod
fan therm al processor sg megaraid_sas piix sd_mod scsi_mod ide_disk
ide_core
Dec  2 14:22:08 soap02 kernel: Pid: 19573, comm: umount Tainted: G
U 2.6.16.60-0.21-smp #1
Dec  2 14:22:08 soap02 kernel: RIP: 0010:[<ffffffff885a9d6d>]
<ffffffff885a9d6d>{:ocfs2_dlm:dlm_empty_lockres+5255}
Dec  2 14:22:08 soap02 kernel: RSP: 0018:ffff810356f65c88  EFLAGS:
00010292
Dec  2 14:22:08 soap02 kernel: RAX: 000000000000006a RBX:
ffff8101f28f7880 RCX: 0000000000000292
Dec  2 14:22:08 soap02 kernel: RDX: ffffffff80359968 RSI:
0000000000000296 RDI: ffffffff80359960
Dec  2 14:22:08 soap02 kernel: RBP: ffff81025eec7e00 R08:
ffffffff80359968 R09: ffff810423f77a80
Dec  2 14:22:08 soap02 kernel: R10: ffff810001071600 R11:
0000000000000070 R12: 0000000000000184
Dec  2 14:22:08 soap02 kernel: R13: ffff8104257a5400 R14:
0000000000000184 R15: ffff8101f28f7880
Dec  2 14:22:08 soap02 kernel: FS: 00002ab1a83db6d0(0000)
GS:ffff810430654840(0000) knlGS:0000000000000000
Dec  2 14:22:08 soap02 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Dec  2 14:22:08 soap02 kernel: CR2: 00002aaaaac16000 CR3:
00000001a2c2f000 CR4: 00000000000006e0
Dec  2 14:22:08 soap02 kernel: Process umount (pid: 19573, threadinfo
ffff810356f64000, task ffff8102e78997e0)
Dec  2 14:22:08 soap02 kernel: Stack: 00000000ffffffd9 0000000000000000
01ff810400000001 ffff8102e78997e0
Dec  2 14:22:08 soap02 kernel: 0100000000000000 0000000100000003
0000000000000000 ffff8102e78997e0
Dec  2 14:22:08 soap02 kernel: ffffffff80147f3e ffff810356f65cd0
Dec  2 14:22:08 soap02 kernel: Call Trace:
<ffffffff80147f3e>{autoremove_wake_function+0}
Dec  2 14:22:08 soap02 kernel:
<ffffffff885a30e1>{:ocfs2_dlm:dlm_unregister_domain+479}
Dec  2 14:22:08 soap02 kernel:
<ffffffff8012c668>{default_wake_function+0}
<ffffffff8860bb5e>{:ocfs2:ocfs2_dlm_shutdown+190}
Dec  2 14:22:08 soap02 kernel:
<ffffffff8862fe07>{:ocfs2:ocfs2_dismount_volume+559}
Dec  2 14:22:08 soap02 kernel:
<ffffffff886302f7>{:ocfs2:ocfs2_put_super+104}
<ffffffff8018bc99>{generic_shutdown_super+148}
Dec  2 14:22:08 soap02 kernel:
<ffffffff8018bd6a>{kill_block_super+38}
<ffffffff8018be40>{deactivate_super+114}
Dec  2 14:22:08 soap02 kernel:        <ffffffff801a078e>{sys_umount+623}
<ffffffff8018e4e1>{sys_newstat+25}
Dec  2 14:22:08 soap02 kernel:
<ffffffff8010ae42>{system_call+126}
Dec  2 14:22:08 soap02 kernel: Dec 2 14:22:08 soap02 kernel: Code: 0f
0b 68 95 d0 5b 88 c2 e0 0a 48 f7 05 9e 2c fd ff 00 09 00
Dec  2 14:22:08 soap02 kernel: RIP
<ffffffff885a9d6d>{:ocfs2_dlm:dlm_empty_lockres+5255} RSP
<ffff810356f65c88>
Dec  2 14:22:08 soap02 kernel:  Badness in do_exit at kernel/exit.c:837
Dec  2 14:22:08 soap02 kernel:
Dec  2 14:22:08 soap02 kernel: Call Trace:
<ffffffff80137000>{do_exit+80}
<ffffffff802ea8b6>{_spin_unlock_irqrestore+8}
Dec  2 14:22:08 soap02 kernel:
<ffffffff8010c820>{kernel_math_error+0}
<ffffffff8010cdb5>{do_invalid_op+163}
Dec  2 14:22:09 soap02 kernel:
<ffffffff885a9d6d>{:ocfs2_dlm:dlm_empty_lockres+5255}
Dec  2 14:22:09 soap02 kernel: <ffffffff8012c10c>{activate_task+204}
<ffffffff8012c657>{try_to_wake_up+1106}
Dec  2 14:22:09 soap02 kernel:        <ffffffff801349b8>{printk+78}
<ffffffff8010bd19>{error_exit+0}
Dec  2 14:22:09 soap02 kernel:
<ffffffff885a9d6d>{:ocfs2_dlm:dlm_empty_lockres+5255}
Dec  2 14:22:09 soap02 kernel:
<ffffffff80147f3e>{autoremove_wake_function+0}
<ffffffff885a30e1>{:ocfs2_dlm:dlm_unregister_domain+479}
Dec  2 14:22:09 soap02 kernel:
<ffffffff8012c668>{default_wake_function+0}
<ffffffff8860bb5e>{:ocfs2:ocfs2_dlm_shutdown+190}
Dec  2 14:22:09 soap02 kernel:
<ffffffff8862fe07>{:ocfs2:ocfs2_dismount_volume+559}
Dec  2 14:22:09 soap02 kernel:
<ffffffff886302f7>{:ocfs2:ocfs2_put_super+104}
<ffffffff8018bc99>{generic_shutdown_super+148}
Dec  2 14:22:09 soap02 kernel:
<ffffffff8018bd6a>{kill_block_super+38}
<ffffffff8018be40>{deactivate_super+114}
Dec  2 14:22:09 soap02 kernel:        <ffffffff801a078e>{sys_umount+623}
<ffffffff8018e4e1>{sys_newstat+25}
Dec  2 14:22:09 soap02 kernel:
<ffffffff8010ae42>{system_call+126}

slave node detecting master down and rebooted:

Dec  2 14:23:14 soap01 kernel: o2net: connection to node soap02 (num 0)
at 192.168.0.10:7777 has been idle for 60.0 seconds, shutting it down.
Dec  2 14:23:14 soap01 kernel: (0,0):o2net_idle_timer:1422 here are
some times that might help debug the situation: (tmr 1259770934.129785
now 1259770994.132629 dr 1259770934.129779 adv
1259770934.129789:1259770934.129789 func (300d6acb:505)
1259770933.205787:1259770933.205792)
Dec  2 14:23:14 soap01 kernel: o2net: no longer connected to node
soap02 (num 0) at 192.168.0.10:7777
Dec  2 14:23:14 soap01 kernel: (7035,1):dlm_do_master_request:1409
ERROR: link to 0 went down!
Dec  2 14:23:14 soap01 kernel: (7039,0):dlm_do_master_request:1409
ERROR: link to 0 went down!
Dec  2 14:23:14 soap01 kernel: (7039,0):dlm_get_lock_resource:986
ERROR: status = -112
Dec  2 14:23:14 soap01 kernel: (7035,1):dlm_get_lock_resource:986
ERROR: status = -112
Dec  2 14:23:14 soap01 kernel: (7043,0):dlm_do_master_request:1409
ERROR: link to 0 went down!
Dec  2 14:23:14 soap01 kernel: (7043,0):dlm_get_lock_resource:986
ERROR: status = -112
Dec  2 14:23:14 soap01 kernel:
(7047,0):dlm_send_remote_convert_request:395 ERROR: status = -112
Dec  2 14:23:14 soap01 kernel: (7047,0):dlm_wait_for_node_death:370
F59B45831EEA41F384BADE6C4B7A932B: waiting 5000ms for notification of
death of node 0
Dec  2 14:24:14 soap01 kernel: (5283,0):o2net_connect_expired:1583
ERROR: no connection established with node 0 after 60.0 seconds, giving
up and returning errors.
Dec  2 14:24:14 soap01 kernel:
(7047,0):dlm_send_remote_convert_request:395 ERROR: status = -107
Dec  2 14:24:14 soap01 kernel: (7047,0):dlm_wait_for_node_death:370
F59B45831EEA41F384BADE6C4B7A932B: waiting 5000ms for notification of
death of node 0

Hope this information is useful for something.

Regards,
-- 
  .:''''':.
.:'        `     Sérgio Surkamp | Gerente de Rede
::    ........   sergio at gruposinternet.com.br
`:.        .:'
  `:,   ,.:'     *Grupos Internet S.A.*
    `: :'        R. Lauro Linhares, 2123 Torre B - Sala 201
     : :         Trindade - Florianópolis - SC
     :.'
     ::          +55 48 3234-4109
     :
     '           http://www.gruposinternet.com.br



More information about the Ocfs2-users mailing list