[Ocfs2-users] Mysterious server reboot

Nikola Savic niks at logik-internet.rs
Sat Mar 26 02:24:17 PDT 2011


  Hi all,

  Just keep you informed :) After 7 days of normal operations, we again
had server failure because of OCFS2/DLM drop reference bug. I have added
log is on end of message.

  We're running Centos5 with latest available RedHat kernel
2.6.18-238.5.1.el5 and OCFS2 1.4.7 installed from packages provided by
Oracle for this kernel. Cluster has 3 nodes. Two nodes are providing
shared storage using DRBD and OCFS2. All 3 nodes are accessing shared
storage over iSCSI (server 1 is iSCSI target with DRBD as backing
device). Cluster is used to host single web site. All 3 nodes are
running Apache web servers and accessing web application files on shared
storage. Bug happens when rsync is doing daily backup. It's interesting
that I noticed similar error logged on other server, but without it
hanging because of kernel panic.

  I plan to install latest kernel provided by Oracle for RHEL5
(2.6.32-100.0.19.el5.x86_64) on public yum
(http://public-yum.oracle.com/) and OCFS2 1.6. I hope that this kernel
includes bug fix. However, answer I got in original BUG post
http://oss.oracle.com/bugzilla/show_bug.cgi?id=912 is not definite :(. I
assume compiling kernel or OCFS2 from latest source code is possible,
but sounds like too much work.

  Is anyone using OCFS2 1.4 on Centos5 in similar setup, and doesn't
have issues with this bug?

  It's strange to me that bug which is marked as RESOLVED 1 year ago is
not included in OCFS2 packages created few weeks ago :( If bug is kernel
related and RHEL kernel 2.6.18-238.5.1.el5 is not updated enough, then
my only hope for now is installing Oracle's kernels.

Logged errors:
Mar 26 04:07:42 server3 kernel:
(dlm_thread,5142,2):dlm_drop_lockres_ref:2216 ERROR: while dropping ref
on BDB600C633D74D6B85C496D78F566879:N0000000000fc82d0 (master=1) got -22.
Mar 26 04:07:42 server3 kernel: lockres: N0000000000fc82d000fce096,
owner=1, state=64
Mar 26 04:07:42 server3 kernel:   last used: 4899685258, refcnt: 3, on
purge list: yes
Mar 26 04:07:42 server3 kernel:   on dirty list: no, on reco list: no,
migrating pending: no
Mar 26 04:07:42 server3 kernel:   inflight locks: 0, asts reserved: 0
Mar 26 04:07:42 server3 kernel:   refmap nodes: [ ], inflight=0
Mar 26 04:07:42 server3 kernel:   granted queue:
Mar 26 04:07:42 server3 kernel:   converting queue:
Mar 26 04:07:42 server3 kernel:   blocked queue:
Mar 26 04:07:44 server3 kernel: ----------- [cut here ] ---------
[please bite here ] ---------
Mar 26 04:07:44 server3 kernel: Kernel BUG at
...xiaowei/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmmaster.c:2218
Mar 26 04:07:44 server3 kernel: invalid opcode: 0000 [1] SMP
Mar 26 04:07:44 server3 kernel: last sysfs file:
/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq
Mar 26 04:07:44 server3 kernel: CPU 2
Mar 26 04:07:44 server3 kernel: Modules linked in: ocfs2(U) be2iscsi
ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i
cnic cxgb3i libiscsi_tcp libiscsi2 scsi_transport_iscsi2
scsi_transport_iscsi ipt_recent acpi_cpufreq freq_table mperf
ocfs2_dlmfs(U) ocfs2_dlm(U) ocfs2_nodemanager(U) configfs uio cxgb3
8021q iptable_nat ip_nat iptable_mangle ipt_REJECT xt_state ip_conntrack
nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter
ip6_tables x_tables ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh
video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery
asus_acpi acpi_memhotplug ac parport_pc lp parport sg tpm_tis tpm
i2c_i801 tpm_bios r8169 i2c_core shpchp mii serio_raw pcspkr i7core_edac
edac_mc dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot
dm_zero dm_mirror dm_log dm_mod raid10 raid456 xor raid0 sata_nv aacraid
3w_9xxx 3w_xxxx sata_sil sata_via ahci libata sd_mod scsi_mod raid1 ext3
jbd uhci_hcd ohci_hcd ehci_hcd
Mar 26 04:07:44 server3 kernel: Pid: 5142, comm: dlm_thread Tainted:
G      2.6.18-238.5.1.el5 #1




More information about the Ocfs2-users mailing list