[Ocfs2-users] ocfs2 issue? : unexplained reboots of RHEL 4 server (kernel:2.6.9-42.0.2.ELs)

Sunil Mushran sunil.mushran at oracle.com
Mon Aug 18 10:55:34 PDT 2008


Configure a netdump or netconsole server. It will catch the relevant 
messages.

Derek Hazell wrote:
>
> Dear OCFS2 forum
>  
> We run ocfs2 version 1.2.9-1 as an ocfs2 cluster on four Linux servers 
> running RHEL 4 (kernel: 2.6.9-42.0.2.ELs)
>  
> We are getting unexpected reboots of one of the Linux servers and are 
> wondering if the reboots are related to ocfs2 or not.
> We enable tracing of ocfs2 on the node we suspected would reboot
>       # debugfs.ocfs2 -l SUPER allow
>       # debugfs.ocfs2 -l HEARTBEAT ENTRY EXIT allow
> and then waited for the reboot to occur. A sample of log messages 
> around the time of the reboot is included below. There are no strange 
> ocfs2 messages in the /var/log/messages log file but I thought I would 
> just check with your forum if you see anything strange.
>  
> Can you confirm that ocfs2 version 1.2.9-1 is compatible with the 
> Linux kernel : 2.6.9-42.0.2.ELs thanks. Also if ocfs2 fences a node 
> can you confirm that a message is written to the /var/log/messages 
> logfile noting that such fencing has occurred. Your responses may help 
> us narrow down the cause
> Can you let us know if there are any particular logfiles we should 
> check, or if there is anything we can do to confirm that ocfs2 is, or 
> is not, the cause of these reboots.
>  
> Appreciate any responses
>  
> regards
> Derek Hazell  |  System Administrator
> #####################################################################
> APPENDIX 1 : REBOOT on Friday night (ocfs2 tracing running)
> Aug 15 21:00:52 Sysname  kernel: (6885,0):dlm_mle_release:535 ENTRY:
> Aug 15 21:00:52 Sysname  kernel: (6885,0):__dlm_lookup_lockres:182 
> ENTRY:M000000000000000c5b1914dc72d356
> Aug 15 21:00:52 Sysname  kernel: 
> (6885,0):__dlm_lookup_lockres_full:148 
> ENTRY:M000000000000000c5b1914dc72d356
> Aug 15 21:00:52 Sysname  kernel: (6885,0):dlm_mle_release:535 ENTRY:
> Aug 15 21:00:52 Sysname  kernel: (6885,0):__dlm_lookup_lockres:182 
> ENTRY:M0000000000000009f1bbc95e1dad74
> Aug 15 21:00:52 Sysname  kernel: 
> (6885,0):__dlm_lookup_lockres_full:148 
> ENTRY:M0000000000000009f1bbc95e1dad74
> Aug 15 21:00:52 Sysname  kernel: (6885,0):__dlm_lookup_lockres:182 
> ENTRY:M0000000000000009f1bbc95e1dad74
> Aug 15 21:00:52 Sysname  kernel: 
> (6885,0):__dlm_lookup_lockres_full:148 
> ENTRY:M0000000000000009f1bbc95e1dad74
> Aug 15 21:00:52 Sysname  kernel: (6885,0):__dlm_lookup_lockres:182 
> ENTRY:M0000000000000009f1bbc95e1dad74
> Aug 15 21:00:52 Sysname  kernel: 
> (6885,0):__dlm_lookup_lockres_full:148 
> ENTRY:M0000000000000009f1bbc95e1dad74
> Aug 15 21:00:52 Sysname  kernel: (6885,0):dlm_mle_release:535 ENTRY:
> Aug 15 21:00:52 Sysname  kernel: (6885,0):__dlm_lookup_lockres:182 
> ENTRY:M000000000000000c5bc95ddc72d357
> Aug 15 21:00:52 Sysname  kernel: 
> (6885,0):__dlm_lookup_lockres_full:148 
> ENTRY:M000000000000000c5bc95ddc72d357
> Aug 15 21:00:52 Sysname  kernel: (6885,0):__dlm_lookup_lockres:182 
> ENTRY:M000000000000000c5bc95ddc72d357
> Aug 15 21:00:52 Sysname  kernel: 
> (6885,0):__dlm_lookup_lockres_full:148 
> ENTRY:M000000000000000c5bc95ddc72d357
> Aug 15 21:00:52 Sysname  kernel: (6885,0):__dlm_lookup_lockres:182 
> ENTRY:M000000000000000c5bc95ddc72d357
> Aug 15 21:00:52 Sysname  kernel: 
> (6885,0):__dlm_lookup_lockres_full:148 
> ENTRY:M000000000000000c5bc95ddc72d357
> Aug 15 21:00:52 Sysname  kernel: (6885,0):dlm_mle_release:535 ENTRY:
> Aug 15 21:00:52 Sysname  kernel: (6885,0):__dlm_lookup_lockres:182 
> ENTRY:M00000000000000049c73bf5e1d8e29
> Aug 15 21:00:52 Sysname  kernel: 
> (6885,0):__dlm_lookup_lockres_full:148 
> ENTRY:M00000000000000049c73bf5e1d8e29
> Aug 15 21:00:52 Sysname  kernel: (6885,0):__dlm_lookup_lockres:182 
> ENTRY:M00000000000000049c73bf5e1d8e29
> [UNEXPECTED REBOOT]
> Aug 15 21:05:09 Sysname  syslogd 1.4.1: restart.
> Aug 15 21:05:09 Sysname  syslog: syslogd startup succeeded
> Aug 15 21:05:09 Sysname  kernel: klogd 1.4.1, log source = /proc/kmsg 
> started.
> Aug 15 21:05:09 Sysname  kernel: Bootdata ok (command line is ro 
> root=/dev/VolGroup_ID_12182/LogVol1 rhgb quiet)
> Aug 15 21:05:09 Sysname  kernel: Linux version 2.6.9-42.0.2.ELsmp 
> (bhcompile at ls20-bc1-13.build.redhat.com 
> <mailto:bhcompile at ls20-bc1-13.build.redhat.com>) (gcc version 3.4.6 
> 20060404 (Red Hat 3.4.6-3)) #1
>  SMP Thu Aug 17 17:57:31 EDT 2006
> Aug 15 21:05:09 Sysname  kernel: BIOS-provided physical RAM map:
> ######################################################################
> APPENDIX 2 : REBOOT on Saturday night (ocfs2 tracing NOT running)
> Aug 15 21:08:12 Sysname  kernel: o2net: connected to node 
> Othersystem2.x.y (num 1) at 172.16.172.172:7777 
> <http://172.16.172.172:7777>
> Aug 15 21:08:13 Sysname  kernel: o2net: accepted connection from node 
> Othersystem1.x.y (num 3) at 172.16.172.171:7777 
> <http://172.16.172.171:7777>
> Aug 15 21:08:16 Sysname  kernel: OCFS2 1.2.9 Mon May 19 13:00:33 PDT 
> 2008 (build a693806cb619dd7f225004092b675ede)
> Aug 15 21:08:16 Sysname  kernel: ocfs2_dlm: Nodes in domain 
> ("46C5D4A751514E55B04786DFEC7B2175"): 1 2 3
> Aug 15 21:08:17 Sysname  kernel: kjournald starting.  Commit interval 
> 5 seconds
> Aug 15 21:08:17 Sysname  kernel: ocfs2: Mounting device (120,1) on 
> (node 2, slot 2)
> Aug 15 21:08:21 Sysname  kernel: ocfs2_dlm: Nodes in domain 
> ("0D29B3C9792B46E1BD0DFF0A97E03534"): 1 2 3
> Aug 15 21:08:21 Sysname  kernel: kjournald starting.  Commit interval 
> 5 seconds
> Aug 15 21:08:21 Sysname  kernel: ocfs2: Mounting device (120,17) on 
> (node 2, slot 2)
> Aug 15 21:08:31 Sysname  ntpd[7076]: synchronized to 172.16.32.254 
> <http://172.16.32.254>, stratum 2
> Aug 15 21:08:31 Sysname  ntpd[7076]: kernel time sync disabled 0041
> Aug 15 21:08:38 Sysname  su(pam_unix)[9656]: session opened for user 
> digicol by root(uid=0)
> Aug 15 21:08:41 Sysname  su(pam_unix)[9656]: session closed for user 
> digicol
> Aug 15 21:13:52 Sysname  ntpd[7076]: kernel time sync enabled 0001
> Aug 15 21:41:46 Sysname  kernel: SCSI error : <1 0 2 1> return code = 
> 0x20000
> Aug 15 21:41:46 Sysname  kernel: end_request: I/O error, dev sdc, 
> sector 1291272320
> Aug 15 21:41:46 Sysname  kernel: SCSI error : <1 0 2 1> return code = 
> 0x20000
> Aug 15 21:41:46 Sysname  kernel: end_request: I/O error, dev sdc, 
> sector 1487646848
> Aug 15 21:41:47 Sysname  kernel: SCSI error : <1 0 2 1> return code = 
> 0x20000
> Aug 15 21:41:47 Sysname  kernel: end_request: I/O error, dev sdc, 
> sector 1301852288
> Aug 15 21:41:48 Sysname  kernel: SCSI error : <1 0 2 1> return code = 
> 0x20000
> Aug 15 21:41:48 Sysname  kernel: end_request: I/O error, dev sdc, 
> sector 1498484864
> Aug 15 21:45:09 Sysname  kernel: SCSI error : <1 0 2 1> return code = 
> 0x20000
> Aug 15 21:45:09 Sysname  kernel: end_request: I/O error, dev sdc, 
> sector 1611251840
> Aug 15 21:45:09 Sysname  kernel: SCSI error : <1 0 2 1> return code = 
> 0x20000
> Aug 15 21:45:09 Sysname  kernel: end_request: I/O error, dev sdc, 
> sector 1045610624
> Aug 15 21:45:09 Sysname  kernel: SCSI error : <1 0 2 1> return code = 
> 0x20000
> Aug 15 21:45:09 Sysname  kernel: end_request: I/O error, dev sdc, 
> sector 1234243712
> Aug 15 21:45:09 Sysname  kernel: SCSI error : <1 0 2 1> return code = 
> 0x20000
> Aug 15 21:45:09 Sysname  kernel: end_request: I/O error, dev sdc, 
> sector 989614208
> Aug 15 21:45:09 Sysname  kernel: SCSI error : <1 0 2 1> return code = 
> 0x20000
> Aug 15 21:45:09 Sysname  kernel: end_request: I/O error, dev sdc, 
> sector 1115283584
> Aug 15 21:45:09 Sysname  kernel: SCSI error : <1 0 2 1> return code = 
> 0x20000
> Aug 15 21:45:09 Sysname  kernel: end_request: I/O error, dev sdc, 
> sector 1240952960
> Aug 15 21:45:14 Sysname  kernel: SCSI error : <1 0 2 1> return code = 
> 0x20000
> Aug 15 21:45:14 Sysname  kernel: end_request: I/O error, dev sdc, 
> sector 995807360
> Aug 15 21:45:14 Sysname  kernel: SCSI error : <1 0 2 1> return code = 
> 0x20000
> Aug 15 21:45:14 Sysname  kernel: end_request: I/O error, dev sdc, 
> sector 1104961664
> Aug 15 21:45:14 Sysname  kernel: SCSI error : <1 0 2 1> return code = 
> 0x20000
> Aug 15 21:45:14 Sysname  kernel: end_request: I/O error, dev sdc, 
> sector 1008507952
> Aug 16 03:00:26 Sysname  Server Administrator: Storage Service 
> EventID: 2242  The Patrol Read has started.:  Controller 0 (PERC 5/i 
> Integrated)
> Aug 16 03:00:27 Sysname  snmpd[7589]: Got trap from peer on fd 13
> Aug 16 03:52:02 Sysname  Server Administrator: Storage Service 
> EventID: 2243  The Patrol Read has stopped.:  Controller 0 (PERC 5/i 
> Integrated)
> Aug 16 03:52:02 Sysname  snmpd[7589]: Got trap from peer on fd 13
> Aug 16 16:38:33 Sysname  sshd(pam_unix)[31901]: session opened for 
> user root by root(uid=0)
> Aug 16 16:55:55 Sysname  sshd(pam_unix)[32254]: session opened for 
> user root by root(uid=0)
> Aug 16 17:27:06 Sysname  sshd(pam_unix)[966]: session opened for user 
> root by root(uid=0)
> [UNEXPECTED REBOOT]
> Aug 16 23:18:31 Sysname  syslogd 1.4.1: restart.
> Aug 16 23:18:31 Sysname  syslog: syslogd startup succeeded
> Aug 16 23:18:31 Sysname  kernel: klogd 1.4.1, log source = /proc/kmsg 
> started.
> Aug 16 23:18:31 Sysname  kernel: Bootdata ok (command line is ro 
> root=/dev/VolGroup_ID_12182/LogVol1 rhgb quiet)
> Aug 16 23:18:31 Sysname  kernel: Linux version 2.6.9-42.0.2.ELsmp 
> (bhcompile at ls20-bc1-13.build.redhat.com 
> <mailto:bhcompile at ls20-bc1-13.build.redhat.com>) (gcc version 3.4.6 
> 20060404 (Red Hat 3.4.6-3)) #1
>  SMP Thu Aug 17 17:57:31 EDT 2006
> Aug 16 23:18:31 Sysname  kernel: BIOS-provided physical RAM map:
> #####################################################################
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users




More information about the Ocfs2-users mailing list