[Ocfs2-users] hung process -- sles10 sp2
Sunil Mushran
sunil.mushran at oracle.com
Wed Jan 13 13:23:10 PST 2010
8542 Dsl vt ocfs2_wait_for_mask
Yes, most likely dlm lock related. Get a newer ocfs2-tools from novell.
By newer I mean atleast 1.4.1.
1. scanlocks2 will tell you the mount and the lock is is waiting on.
2. Dump the state of the corresponding dlm lock. Fill in lockname and device.
debugfs.ocfs2 -R "dlm_locks lockname" /dev/sdX
3. Then see who the master (owner) is.
4, Dump the dlm lock on the master. That will tell you who currently has
the lock.
5. Dump the dlm lock and the fs_lock on that node. Run the same ps there.
Easy as pie. ;)
Charlie Sharkey wrote:
> Here's the result of the command.
> I'll check for a newer version of tools.
>
>
> PID STAT COMMAND WIDE-WCHAN-COLUMN
> 1 S init -
> 2 S migration/0 migration_thread
> 3 SN ksoftirqd/0 ksoftirqd
> 4 S migration/1 migration_thread
> 5 SN ksoftirqd/1 ksoftirqd
> 6 S migration/2 migration_thread
> 7 SN ksoftirqd/2 ksoftirqd
> 8 S migration/3 migration_thread
> 9 SN ksoftirqd/3 ksoftirqd
> 10 S migration/4 migration_thread
> 11 SN ksoftirqd/4 ksoftirqd
> 12 S migration/5 migration_thread
> 13 SN ksoftirqd/5 ksoftirqd
> 14 S migration/6 migration_thread
> 15 SN ksoftirqd/6 ksoftirqd
> 16 S migration/7 migration_thread
> 17 SN ksoftirqd/7 ksoftirqd
> 18 S< events/0 worker_thread
> 19 S< events/1 worker_thread
> 20 S< events/2 worker_thread
> 21 S< events/3 worker_thread
> 22 S< events/4 worker_thread
> 23 S< events/5 worker_thread
> 24 S< events/6 worker_thread
> 25 S< events/7 worker_thread
> 26 S< khelper worker_thread
> 27 S< kthread worker_thread
> 37 S< kblockd/0 worker_thread
> 38 S< kblockd/1 worker_thread
> 39 S< kblockd/2 worker_thread
> 40 S< kblockd/3 worker_thread
> 41 S< kblockd/4 worker_thread
> 42 S< kblockd/5 worker_thread
> 43 S< kblockd/6 worker_thread
> 44 S< kblockd/7 worker_thread
> 45 S< kacpid worker_thread
> 46 S< kacpi_notify worker_thread
> 327 S pdflush pdflush
> 328 S pdflush pdflush
> 329 S kswapd0 kswapd
> 330 S< aio/0 worker_thread
> 331 S< aio/1 worker_thread
> 332 S< aio/2 worker_thread
> 333 S< aio/3 worker_thread
> 334 S< aio/4 worker_thread
> 335 S< aio/5 worker_thread
> 336 S< aio/6 worker_thread
> 337 S< aio/7 worker_thread
> 582 S< cqueue/0 worker_thread
> 583 S< cqueue/1 worker_thread
> 584 S< cqueue/2 worker_thread
> 585 S< cqueue/3 worker_thread
> 586 S< cqueue/4 worker_thread
> 587 S< cqueue/5 worker_thread
> 588 S< cqueue/6 worker_thread
> 589 S< cqueue/7 worker_thread
> 590 S< kseriod serio_thread
> 623 S< kpsmoused worker_thread
> 1056 S< ata/0 worker_thread
> 1057 S< ata/1 worker_thread
> 1058 S< ata/2 worker_thread
> 1059 S< ata/3 worker_thread
> 1060 S< ata/4 worker_thread
> 1061 S< ata/5 worker_thread
> 1062 S< ata/6 worker_thread
> 1063 S< ata/7 worker_thread
> 1064 S< ata_aux worker_thread
> 1093 S< scsi_eh_0 scsi_error_handler
> 1218 S< scsi_eh_1 scsi_error_handler
> 1232 S< qla2xxx_1_dpc 144669341936254977
> 2061 S< scsi_eh_2 scsi_error_handler
> 2111 S< qla2xxx_2_dpc 18446604440027791361
> 2190 S kjournald kjournald
> 2251 S<s udevd -
> 3469 S< khubd hub_thread
> 4474 S< scsi_eh_3 scsi_error_handler
> 4475 S< usb-storage -
> 4620 S< kmpathd/0 worker_thread
> 4621 S< kmpathd/1 worker_thread
> 4622 S< kmpathd/2 worker_thread
> 4623 S< kmpathd/3 worker_thread
> 4624 S< kmpathd/4 worker_thread
> 4625 S< kmpathd/5 worker_thread
> 4626 S< kmpathd/6 worker_thread
> 4627 S< kmpathd/7 worker_thread
> 5768 S kjournald kjournald
> 5770 S kjournald kjournald
> 5823 S< kauditd kauditd_thread
> 6117 Ss resmgrd -
> 6249 Ss acpid -
> 6326 Ss dbus-daemon -
> 6494 Ss hald -
> 6695 S hald-addon-acpi -
> 7050 S< bond worker_thread
> 7244 S hald-addon-stor -
> 7495 Ss syslog-ng -
> 7499 Ss klogd syslog
> 7524 SLl multipathd stext
> 7529 Ss portmap -
> 7547 Ss slpd -
> 7626 Ss irqbalance 1
> 7658 SN kipmi0 -
> 7725 S snmpd -
> 7950 S< CID_control OS_cidWait
> 7951 D< CID_timer -
> 7952 S< CID_sched_0 OS_cidWait
> 7953 S< CID_sched_1 OS_cidWait
> 7975 S btitool OS_cidWait
> 7989 Ss startpar -
> 8094 Sl qlremote stext
> 8146 Ss sshd -
> 8191 S< user_dlm worker_thread
> 8206 Ss ntpd -
> 8217 S< o2net worker_thread
> 8250 Ss cron -
> 8261 S< o2hb-D5304888F9 -
> 8272 Ss httpd2-prefork -
> 8273 S httpd2-prefork -
> 8274 S httpd2-prefork -
> 8275 S httpd2-prefork -
> 8276 S httpd2-prefork -
> 8277 S httpd2-prefork -
> 8324 S< ocfs2_wq worker_thread
> 8325 S< ocfs2dc ocfs2_downconvert_thread
> 8326 S< dlm_thread -
> 8327 S< dlm_reco_thread -
> 8328 S< dlm_wq worker_thread
> 8329 S kjournald kjournald
> 8330 S< ocfs2cmt ocfs2_commit_thread
> 8336 S< o2hb-B98C95FB4B -
> 8353 S< ocfs2dc ocfs2_downconvert_thread
> 8354 S< dlm_thread -
> 8355 S< dlm_reco_thread -
> 8356 S< dlm_wq worker_thread
> 8357 S kjournald kjournald
> 8358 S< ocfs2cmt ocfs2_commit_thread
> 8364 S< o2hb-B3EE601AEB -
> 8381 S< ocfs2dc ocfs2_downconvert_thread
> 8382 S< dlm_thread -
> 8383 S< dlm_reco_thread -
> 8384 S< dlm_wq worker_thread
> 8385 S kjournald kjournald
> 8386 S< ocfs2cmt ocfs2_commit_thread
> 8392 S< o2hb-2043DFCC18 -
> 8409 S< ocfs2dc ocfs2_downconvert_thread
> 8410 S< dlm_thread -
> 8411 S< dlm_reco_thread -
> 8412 S< dlm_wq worker_thread
> 8413 S kjournald kjournald
> 8414 S< ocfs2cmt ocfs2_commit_thread
> 8420 S< o2hb-6B6685A881 -
> 8437 S< ocfs2dc ocfs2_downconvert_thread
> 8438 S< dlm_thread -
> 8439 S< dlm_reco_thread -
> 8440 S< dlm_wq worker_thread
> 8441 S kjournald kjournald
> 8442 S< ocfs2cmt ocfs2_commit_thread
> 8538 S logger pipe_wait
> 8540 Ss startpar -
> 8542 Dsl vt ocfs2_wait_for_mask
> 8555 Ss+ mingetty -
> 8556 Ss+ mingetty -
> 8557 Ss+ mingetty -
> 8558 Ss+ mingetty -
> 8559 Ss+ mingetty -
> 8560 Ss+ mingetty -
> 8615 S< dlm_thread -
> 8616 S< dlm_reco_thread -
> 8617 S< dlm_wq worker_thread
> 9369 R+ ps -
> 10405 Ss sshd -
> 10407 Ss+ vtcon -
> 26609 Ss sshd -
> 26611 Ss bash wait
> 26698 S+ gdb wait
> 26894 Ss sshd -
> 26896 Ss+ bash -
> 29881 Ss sshd -
> 29883 Ss bash wait
>
>
> -----Original Message-----
> From: Sunil Mushran [mailto:sunil.mushran at oracle.com]
> Sent: Wednesday, January 13, 2010 4:04 PM
> To: Charlie Sharkey
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] hung process -- sles10 sp2
>
> Charlie Sharkey wrote:
>
>> version info
>>
>> ---------------
>>
>> n1 kernel: OCFS2 Node Manager 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008
>>
>> n1 kernel: OCFS2 DLM 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008
>>
>> n1 kernel: OCFS2 DLMFS 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008
>>
>> ocfs2-tools-1.4.0-0.5
>>
>> ocfs2console-1.4.0-0.5
>>
>> Linux n1 2.6.16.60-0.34-smp #1 SMP Fri Jan 16 14:59:01 UTC 2009 x86_64
>> x86_64 x86_64 GNU/Linux
>>
>> ============================================================================
>>
>> One of the nodes of a six node cluster got a hung process. The 'ps
>> -elf' command shows it as:
>>
>> 5 D vtape 8542 1 6 77 0 - 77376 ocfs2_ Jan12 ? 01:34:31
>> /opt/bti/mas/bin/vt -d -p /var/run/vt.pid
>>
>> The system isn't hung, I can ssh into the system and ls each ocfs2
>> directory. I have run the debugfs.ocfs2
>>
>> command: debug.ocfs2 -R "stats" and it shows no errors. I ran the
>> 'scanlocks2' script and it didn't show
>>
>> any hung locks. It did create some files (/tmp/_fsl_dm-22 à
>> /tmp/_fsl_dm-26). The contents of those files
>>
>> are: "Debug string proto 2 found, but 1 is the highest I understand."
>>
>>
>
> You have an old debugfs.ocfs2. See if sles has a newer ocfs2-tools.
> With it, rerun scanlocks2. That will tell us if dlm is involved or not.
>
> Meanwhile what does this say.
> ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN
>
>
More information about the Ocfs2-users
mailing list