<div dir="ltr"><p>Dear OCFS2 forum<br> <br>We run ocfs2 version 1.2.9-1 as an ocfs2 cluster on four Linux servers running RHEL 4 (kernel: 2.6.9-42.0.2.ELs)<br> <br>We are getting unexpected reboots of one of the Linux servers and are wondering if the reboots are related to ocfs2 or not. <br>
We enable tracing of ocfs2 on the node we suspected would reboot<br> # debugfs.ocfs2 -l SUPER allow<br> # debugfs.ocfs2 -l HEARTBEAT ENTRY EXIT allow<br>and then waited for the reboot to occur. A sample of log messages around the time of the reboot is included below. There are no strange ocfs2 messages in the /var/log/messages log file but I thought I would just check with your forum if you see anything strange. <br>
<br>Can you confirm that ocfs2 version 1.2.9-1 is compatible with the Linux kernel : 2.6.9-42.0.2.ELs thanks. Also if ocfs2 fences a node can you confirm that a message is written to the /var/log/messages logfile noting that such fencing has occurred. Your responses may help us narrow down the cause<br>
Can you let us know if there are any particular logfiles we should check, or if there is anything we can do to confirm that ocfs2 is, or is not, the cause of these reboots.<br> <br>Appreciate any responses<br> <br>regards<br>
Derek Hazell | System Administrator<br>#####################################################################<br>APPENDIX 1 : REBOOT on Friday night (ocfs2 tracing running)<br>Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY:<br>
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M000000000000000c5b1914dc72d356<br>Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148 ENTRY:M000000000000000c5b1914dc72d356<br>
Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY:<br>Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M0000000000000009f1bbc95e1dad74<br>Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148 ENTRY:M0000000000000009f1bbc95e1dad74<br>
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M0000000000000009f1bbc95e1dad74<br>Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148 ENTRY:M0000000000000009f1bbc95e1dad74<br>
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M0000000000000009f1bbc95e1dad74<br>Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148 ENTRY:M0000000000000009f1bbc95e1dad74<br>
Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY:<br>Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M000000000000000c5bc95ddc72d357<br>Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148 ENTRY:M000000000000000c5bc95ddc72d357<br>
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M000000000000000c5bc95ddc72d357<br>Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148 ENTRY:M000000000000000c5bc95ddc72d357<br>
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M000000000000000c5bc95ddc72d357<br>Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148 ENTRY:M000000000000000c5bc95ddc72d357<br>
Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY:<br>Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M00000000000000049c73bf5e1d8e29<br>Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148 ENTRY:M00000000000000049c73bf5e1d8e29<br>
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M00000000000000049c73bf5e1d8e29<br>[UNEXPECTED REBOOT]<br>Aug 15 21:05:09 Sysname syslogd 1.4.1: restart.<br>Aug 15 21:05:09 Sysname syslog: syslogd startup succeeded<br>
Aug 15 21:05:09 Sysname kernel: klogd 1.4.1, log source = /proc/kmsg started.<br>Aug 15 21:05:09 Sysname kernel: Bootdata ok (command line is ro root=/dev/VolGroup_ID_12182/LogVol1 rhgb quiet)<br>Aug 15 21:05:09 Sysname kernel: Linux version 2.6.9-42.0.2.ELsmp (<a href="mailto:bhcompile@ls20-bc1-13.build.redhat.com">bhcompile@ls20-bc1-13.build.redhat.com</a>) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1<br>
SMP Thu Aug 17 17:57:31 EDT 2006<br>Aug 15 21:05:09 Sysname kernel: BIOS-provided physical RAM map:<br>######################################################################<br>APPENDIX 2 : REBOOT on Saturday night (ocfs2 tracing NOT running)<br>
Aug 15 21:08:12 Sysname kernel: o2net: connected to node Othersystem2.x.y (num 1) at <a href="http://172.16.172.172:7777">172.16.172.172:7777</a><br>Aug 15 21:08:13 Sysname kernel: o2net: accepted connection from node Othersystem1.x.y (num 3) at <a href="http://172.16.172.171:7777">172.16.172.171:7777</a><br>
Aug 15 21:08:16 Sysname kernel: OCFS2 1.2.9 Mon May 19 13:00:33 PDT 2008 (build a693806cb619dd7f225004092b675ede)<br>Aug 15 21:08:16 Sysname kernel: ocfs2_dlm: Nodes in domain ("46C5D4A751514E55B04786DFEC7B2175"): 1 2 3<br>
Aug 15 21:08:17 Sysname kernel: kjournald starting. Commit interval 5 seconds<br>Aug 15 21:08:17 Sysname kernel: ocfs2: Mounting device (120,1) on (node 2, slot 2)<br>Aug 15 21:08:21 Sysname kernel: ocfs2_dlm: Nodes in domain ("0D29B3C9792B46E1BD0DFF0A97E03534"): 1 2 3<br>
Aug 15 21:08:21 Sysname kernel: kjournald starting. Commit interval 5 seconds<br>Aug 15 21:08:21 Sysname kernel: ocfs2: Mounting device (120,17) on (node 2, slot 2)<br>Aug 15 21:08:31 Sysname ntpd[7076]: synchronized to <a href="http://172.16.32.254">172.16.32.254</a>, stratum 2<br>
Aug 15 21:08:31 Sysname ntpd[7076]: kernel time sync disabled 0041<br>Aug 15 21:08:38 Sysname su(pam_unix)[9656]: session opened for user digicol by root(uid=0)<br>Aug 15 21:08:41 Sysname su(pam_unix)[9656]: session closed for user digicol<br>
Aug 15 21:13:52 Sysname ntpd[7076]: kernel time sync enabled 0001<br>Aug 15 21:41:46 Sysname kernel: SCSI error : <1 0 2 1> return code = 0x20000<br>Aug 15 21:41:46 Sysname kernel: end_request: I/O error, dev sdc, sector 1291272320<br>
Aug 15 21:41:46 Sysname kernel: SCSI error : <1 0 2 1> return code = 0x20000<br>Aug 15 21:41:46 Sysname kernel: end_request: I/O error, dev sdc, sector 1487646848<br>Aug 15 21:41:47 Sysname kernel: SCSI error : <1 0 2 1> return code = 0x20000<br>
Aug 15 21:41:47 Sysname kernel: end_request: I/O error, dev sdc, sector 1301852288<br>Aug 15 21:41:48 Sysname kernel: SCSI error : <1 0 2 1> return code = 0x20000<br>Aug 15 21:41:48 Sysname kernel: end_request: I/O error, dev sdc, sector 1498484864<br>
Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = 0x20000<br>Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector 1611251840<br>Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = 0x20000<br>
Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector 1045610624<br>Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = 0x20000<br>Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector 1234243712<br>
Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = 0x20000<br>Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector 989614208<br>Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = 0x20000<br>
Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector 1115283584<br>Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = 0x20000<br>Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector 1240952960<br>
Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code = 0x20000<br>Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, sector 995807360<br>Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code = 0x20000<br>
Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, sector 1104961664<br>Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code = 0x20000<br>Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, sector 1008507952<br>
Aug 16 03:00:26 Sysname Server Administrator: Storage Service EventID: 2242 The Patrol Read has started.: Controller 0 (PERC 5/i Integrated)<br>Aug 16 03:00:27 Sysname snmpd[7589]: Got trap from peer on fd 13<br>Aug 16 03:52:02 Sysname Server Administrator: Storage Service EventID: 2243 The Patrol Read has stopped.: Controller 0 (PERC 5/i Integrated)<br>
Aug 16 03:52:02 Sysname snmpd[7589]: Got trap from peer on fd 13<br>Aug 16 16:38:33 Sysname sshd(pam_unix)[31901]: session opened for user root by root(uid=0)<br>Aug 16 16:55:55 Sysname sshd(pam_unix)[32254]: session opened for user root by root(uid=0)<br>
Aug 16 17:27:06 Sysname sshd(pam_unix)[966]: session opened for user root by root(uid=0)<br>[UNEXPECTED REBOOT]<br>Aug 16 23:18:31 Sysname syslogd 1.4.1: restart.<br>Aug 16 23:18:31 Sysname syslog: syslogd startup succeeded<br>
Aug 16 23:18:31 Sysname kernel: klogd 1.4.1, log source = /proc/kmsg started.<br>Aug 16 23:18:31 Sysname kernel: Bootdata ok (command line is ro root=/dev/VolGroup_ID_12182/LogVol1 rhgb quiet)<br>Aug 16 23:18:31 Sysname kernel: Linux version 2.6.9-42.0.2.ELsmp (<a href="mailto:bhcompile@ls20-bc1-13.build.redhat.com">bhcompile@ls20-bc1-13.build.redhat.com</a>) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1<br>
SMP Thu Aug 17 17:57:31 EDT 2006<br>Aug 16 23:18:31 Sysname kernel: BIOS-provided physical RAM map:<br>#####################################################################</p></div>