[Ocfs2-users] hung process -- sles10 sp2

Sunil Mushran sunil.mushran at oracle.com
Wed Jan 13 13:23:10 PST 2010


 8542 Dsl  vt              ocfs2_wait_for_mask

Yes, most likely dlm lock related. Get a newer ocfs2-tools from novell.
By newer I mean atleast 1.4.1.

1. scanlocks2 will tell you the mount and the lock is is waiting on.
2. Dump the state of the corresponding dlm lock. Fill in lockname and device.
debugfs.ocfs2 -R "dlm_locks lockname" /dev/sdX
3. Then see who the master (owner) is.
4, Dump the dlm lock on the master. That will tell you who currently has
the lock.
5. Dump the dlm lock and the fs_lock on that node. Run the same ps there.

Easy as pie. ;)


Charlie Sharkey wrote:
> Here's the result of the command.
> I'll check for a newer version of tools. 
>
>
> PID STAT COMMAND         WIDE-WCHAN-COLUMN
>     1 S    init            -
>     2 S    migration/0     migration_thread
>     3 SN   ksoftirqd/0     ksoftirqd
>     4 S    migration/1     migration_thread
>     5 SN   ksoftirqd/1     ksoftirqd
>     6 S    migration/2     migration_thread
>     7 SN   ksoftirqd/2     ksoftirqd
>     8 S    migration/3     migration_thread
>     9 SN   ksoftirqd/3     ksoftirqd
>    10 S    migration/4     migration_thread
>    11 SN   ksoftirqd/4     ksoftirqd
>    12 S    migration/5     migration_thread
>    13 SN   ksoftirqd/5     ksoftirqd
>    14 S    migration/6     migration_thread
>    15 SN   ksoftirqd/6     ksoftirqd
>    16 S    migration/7     migration_thread
>    17 SN   ksoftirqd/7     ksoftirqd
>    18 S<   events/0        worker_thread
>    19 S<   events/1        worker_thread
>    20 S<   events/2        worker_thread
>    21 S<   events/3        worker_thread
>    22 S<   events/4        worker_thread
>    23 S<   events/5        worker_thread
>    24 S<   events/6        worker_thread
>    25 S<   events/7        worker_thread
>    26 S<   khelper         worker_thread
>    27 S<   kthread         worker_thread
>    37 S<   kblockd/0       worker_thread
>    38 S<   kblockd/1       worker_thread
>    39 S<   kblockd/2       worker_thread
>    40 S<   kblockd/3       worker_thread
>    41 S<   kblockd/4       worker_thread
>    42 S<   kblockd/5       worker_thread
>    43 S<   kblockd/6       worker_thread
>    44 S<   kblockd/7       worker_thread
>    45 S<   kacpid          worker_thread
>    46 S<   kacpi_notify    worker_thread
>   327 S    pdflush         pdflush
>   328 S    pdflush         pdflush
>   329 S    kswapd0         kswapd
>   330 S<   aio/0           worker_thread
>   331 S<   aio/1           worker_thread
>   332 S<   aio/2           worker_thread
>   333 S<   aio/3           worker_thread
>   334 S<   aio/4           worker_thread
>   335 S<   aio/5           worker_thread
>   336 S<   aio/6           worker_thread
>   337 S<   aio/7           worker_thread
>   582 S<   cqueue/0        worker_thread
>   583 S<   cqueue/1        worker_thread
>   584 S<   cqueue/2        worker_thread
>   585 S<   cqueue/3        worker_thread
>   586 S<   cqueue/4        worker_thread
>   587 S<   cqueue/5        worker_thread
>   588 S<   cqueue/6        worker_thread
>   589 S<   cqueue/7        worker_thread
>   590 S<   kseriod         serio_thread
>   623 S<   kpsmoused       worker_thread
>  1056 S<   ata/0           worker_thread
>  1057 S<   ata/1           worker_thread
>  1058 S<   ata/2           worker_thread
>  1059 S<   ata/3           worker_thread
>  1060 S<   ata/4           worker_thread
>  1061 S<   ata/5           worker_thread
>  1062 S<   ata/6           worker_thread
>  1063 S<   ata/7           worker_thread
>  1064 S<   ata_aux         worker_thread
>  1093 S<   scsi_eh_0       scsi_error_handler
>  1218 S<   scsi_eh_1       scsi_error_handler
>  1232 S<   qla2xxx_1_dpc   144669341936254977
>  2061 S<   scsi_eh_2       scsi_error_handler
>  2111 S<   qla2xxx_2_dpc   18446604440027791361
>  2190 S    kjournald       kjournald
>  2251 S<s  udevd           -
>  3469 S<   khubd           hub_thread
>  4474 S<   scsi_eh_3       scsi_error_handler
>  4475 S<   usb-storage     -
>  4620 S<   kmpathd/0       worker_thread
>  4621 S<   kmpathd/1       worker_thread
>  4622 S<   kmpathd/2       worker_thread
>  4623 S<   kmpathd/3       worker_thread
>  4624 S<   kmpathd/4       worker_thread
>  4625 S<   kmpathd/5       worker_thread
>  4626 S<   kmpathd/6       worker_thread
>  4627 S<   kmpathd/7       worker_thread
>  5768 S    kjournald       kjournald
>  5770 S    kjournald       kjournald
>  5823 S<   kauditd         kauditd_thread
>  6117 Ss   resmgrd         -
>  6249 Ss   acpid           -
>  6326 Ss   dbus-daemon     -
>  6494 Ss   hald            -
>  6695 S    hald-addon-acpi -
>  7050 S<   bond            worker_thread
>  7244 S    hald-addon-stor -
>  7495 Ss   syslog-ng       -
>  7499 Ss   klogd           syslog
>  7524 SLl  multipathd      stext
>  7529 Ss   portmap         -
>  7547 Ss   slpd            -
>  7626 Ss   irqbalance      1
>  7658 SN   kipmi0          -
>  7725 S    snmpd           -
>  7950 S<   CID_control     OS_cidWait
>  7951 D<   CID_timer       -
>  7952 S<   CID_sched_0     OS_cidWait
>  7953 S<   CID_sched_1     OS_cidWait
>  7975 S    btitool         OS_cidWait
>  7989 Ss   startpar        -
>  8094 Sl   qlremote        stext
>  8146 Ss   sshd            -
>  8191 S<   user_dlm        worker_thread
>  8206 Ss   ntpd            -
>  8217 S<   o2net           worker_thread
>  8250 Ss   cron            -
>  8261 S<   o2hb-D5304888F9 -
>  8272 Ss   httpd2-prefork  -
>  8273 S    httpd2-prefork  -
>  8274 S    httpd2-prefork  -
>  8275 S    httpd2-prefork  -
>  8276 S    httpd2-prefork  -
>  8277 S    httpd2-prefork  -
>  8324 S<   ocfs2_wq        worker_thread
>  8325 S<   ocfs2dc         ocfs2_downconvert_thread
>  8326 S<   dlm_thread      -
>  8327 S<   dlm_reco_thread -
>  8328 S<   dlm_wq          worker_thread
>  8329 S    kjournald       kjournald
>  8330 S<   ocfs2cmt        ocfs2_commit_thread
>  8336 S<   o2hb-B98C95FB4B -
>  8353 S<   ocfs2dc         ocfs2_downconvert_thread
>  8354 S<   dlm_thread      -
>  8355 S<   dlm_reco_thread -
>  8356 S<   dlm_wq          worker_thread
>  8357 S    kjournald       kjournald
>  8358 S<   ocfs2cmt        ocfs2_commit_thread
>  8364 S<   o2hb-B3EE601AEB -
>  8381 S<   ocfs2dc         ocfs2_downconvert_thread
>  8382 S<   dlm_thread      -
>  8383 S<   dlm_reco_thread -
>  8384 S<   dlm_wq          worker_thread
>  8385 S    kjournald       kjournald
>  8386 S<   ocfs2cmt        ocfs2_commit_thread
>  8392 S<   o2hb-2043DFCC18 -
>  8409 S<   ocfs2dc         ocfs2_downconvert_thread
>  8410 S<   dlm_thread      -
>  8411 S<   dlm_reco_thread -
>  8412 S<   dlm_wq          worker_thread
>  8413 S    kjournald       kjournald
>  8414 S<   ocfs2cmt        ocfs2_commit_thread
>  8420 S<   o2hb-6B6685A881 -
>  8437 S<   ocfs2dc         ocfs2_downconvert_thread
>  8438 S<   dlm_thread      -
>  8439 S<   dlm_reco_thread -
>  8440 S<   dlm_wq          worker_thread
>  8441 S    kjournald       kjournald
>  8442 S<   ocfs2cmt        ocfs2_commit_thread
>  8538 S    logger          pipe_wait
>  8540 Ss   startpar        -
>  8542 Dsl  vt              ocfs2_wait_for_mask
>  8555 Ss+  mingetty        -
>  8556 Ss+  mingetty        -
>  8557 Ss+  mingetty        -
>  8558 Ss+  mingetty        -
>  8559 Ss+  mingetty        -
>  8560 Ss+  mingetty        -
>  8615 S<   dlm_thread      -
>  8616 S<   dlm_reco_thread -
>  8617 S<   dlm_wq          worker_thread
>  9369 R+   ps              -
> 10405 Ss   sshd            -
> 10407 Ss+  vtcon           -
> 26609 Ss   sshd            -
> 26611 Ss   bash            wait
> 26698 S+   gdb             wait
> 26894 Ss   sshd            -
> 26896 Ss+  bash            -
> 29881 Ss   sshd            -
> 29883 Ss   bash            wait
>
>
> -----Original Message-----
> From: Sunil Mushran [mailto:sunil.mushran at oracle.com] 
> Sent: Wednesday, January 13, 2010 4:04 PM
> To: Charlie Sharkey
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] hung process -- sles10 sp2
>
> Charlie Sharkey wrote:
>   
>> version info
>>
>> ---------------
>>
>> n1 kernel: OCFS2 Node Manager 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008
>>
>> n1 kernel: OCFS2 DLM 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008
>>
>> n1 kernel: OCFS2 DLMFS 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008
>>
>> ocfs2-tools-1.4.0-0.5
>>
>> ocfs2console-1.4.0-0.5
>>
>> Linux n1 2.6.16.60-0.34-smp #1 SMP Fri Jan 16 14:59:01 UTC 2009 x86_64 
>> x86_64 x86_64 GNU/Linux
>>
>> ============================================================================
>>
>> One of the nodes of a six node cluster got a hung process. The 'ps 
>> -elf' command shows it as:
>>
>> 5 D vtape 8542 1 6 77 0 - 77376 ocfs2_ Jan12 ? 01:34:31 
>> /opt/bti/mas/bin/vt -d -p /var/run/vt.pid
>>
>> The system isn't hung, I can ssh into the system and ls each ocfs2 
>> directory. I have run the debugfs.ocfs2
>>
>> command: debug.ocfs2 -R "stats" and it shows no errors. I ran the 
>> 'scanlocks2' script and it didn't show
>>
>> any hung locks. It did create some files (/tmp/_fsl_dm-22 à 
>> /tmp/_fsl_dm-26). The contents of those files
>>
>> are: "Debug string proto 2 found, but 1 is the highest I understand."
>>
>>     
>
> You have an old debugfs.ocfs2. See if sles has a newer ocfs2-tools.
> With it, rerun scanlocks2. That will tell us if dlm is involved or not.
>
> Meanwhile what does this say.
> ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN
>
>   




More information about the Ocfs2-users mailing list