[DTrace-devel] [PATCH 2/2] examples: add more scripts, and a README file

Eugene Loh eugene.loh at oracle.com
Tue Aug 19 21:58:28 UTC 2025


It's all very wordy, but maybe that's appropriate for this sort of 
material.  I don't know.

Anyhow, the patch looks fine.  I have a lot of picky comments below.  
Incorporate whatever feedback you like.  I expect to R-b the next 
version:  none of the things I mention is major, but look them over and 
see if there are any comments you like.

On 8/14/25 18:22, eugene.loh at oracle.com wrote:

> From: Ruud van der Pas <ruud.vanderpas at oracle.com>
>
> Add six 6 more examples plus a README file with a description
> of each example.  The new scripts provide a variety in
> functionality and target various levels of experience.
>
> dtrace/ChangeLog
> 2025-08-14  Ruud van der Pas  <ruud.vanderpas at oracle.com>
>
> 	* examples/io-cast-net.d: Show how to use wildcards.
> 	* examples/io-stats.d: Display I/O characteristics.
> 	* examples/sched-simple.d: Count on/off the CPU.
> 	* examples/sched-stats.d: Extensive scheduler statistics.
> 	* examples/thread-ids.d: Mapping of thread IDs.
> 	* examples/var-scope.d: Demonstrate scoping rules.
> 	* examples/README.md: Brief description of all examples.
>
> Signed-off-by: Ruud van der Pas <ruud.vanderpas at oracle.com>
> ---
>   examples/README.md      |  71 +++++++++++
>   examples/io-cast-net.d  |  70 +++++++++++
>   examples/io-stats.d     | 268 ++++++++++++++++++++++++++++++++++++++++
>   examples/sched-simple.d |  76 ++++++++++++
>   examples/sched-stats.d  | 182 +++++++++++++++++++++++++++
>   examples/thread-ids.d   | 159 ++++++++++++++++++++++++
>   examples/var-scope.d    | 117 ++++++++++++++++++
>   7 files changed, 943 insertions(+)
>   create mode 100644 examples/README.md
>   create mode 100755 examples/io-cast-net.d
>   create mode 100755 examples/io-stats.d
>   create mode 100755 examples/sched-simple.d
>   create mode 100755 examples/sched-stats.d
>   create mode 100755 examples/thread-ids.d
>   create mode 100755 examples/var-scope.d
>
> diff --git a/examples/README.md b/examples/README.md
> new file mode 100644
> index 000000000..19c693e92
> --- /dev/null
> +++ b/examples/README.md
> @@ -0,0 +1,71 @@
> +## Synopsis
> +
> +This is the README file for the examples directory that is included in the
> +DTrace distribution directory.  These examples are meant to be learned
> +from.  They can also be a starting point to explore changes and enhancements
> +you'd like to make to these scripts.
> +
> +## Target audience
> +
> +The scripts target a wide range of users.  There are examples that are
> +ideally suited for the beginners, but there also scripts that are
> +probably easier to understand once you have some more practical experiences.
> +To make it easier to select a script suitable for your level of experience,
> +we have added a level indicator, but please do not let that withold you from
> +looking at other examples.
> +
> +| Level | Description |
> +| :---: | :----       |
> +| B     | Beginner - Only some basic knowledge of DTrace is assumed. |
> +| M	| Medium  - Some experience with DTrace will be helpful.     |
> +| A     | Advanced - Practical experiences writing scripts is recommended. |

I guess.  The BMA system is hard for me to get used to (I'm 
simple-minded like that).  A 0/1/2 system might be easier for me, but no 
big deal.

> +
> +## Installation
> +
> +There is no need for any additional installation instructions.  Just copy
> +the script to your working directory and either use it as provided, or
> +make changes as you see fit.

I'm left reading between the lines and wondering if I *have* to copy, or 
if I can use in-place.

> +## Usage
> +
> +Each script has a usage line that you can find under SYNPOSIS in the

SYNPOSIS =>
SYNOPSIS

> +top part of the script.  There are also extensive additional comments.
> +
> +In the comments, we often refer to a target application, or command,
> +but most scripts can be used for both, or also on a running process.
> +For the latter you need to change the option on the dtrace command
> +though.  Please check the man page for the `dtrace` command for the
> +details.

What's the difference between application and command here?  Why should 
a reader care?  How do I know which scripts cannot be used for both?  
(Incidentally, s/both/either/?)  And how do I know whether I can use a 
script on a running process?  Instead of referring vaguely to "the man 
page", can you just say here briefly what's involved (change -c to -p?  
hit Ctrl-C to terminate?)? Otherwise, people might not know what to look 
for.

> +## The examples
> +
> +These are the scripts included, plus a brief description.
> +
> +| Script          | Description                | Level |
> +| --------------- | -------------------------- | :---: |
> +| fcalls.d        | List the functions executed by the target application.  |B|
> +|                 | In addition to this, the number of calls to each function |
> +|                 | is printed.  This information is given on a per-thread    |
> +|                 | basis, as well as aggregated over all threads.            |
> +| io-cast-net.d   | For the target application, show a list of all the libc |B|
> +|                 | functions that include one of the selected keywords       |
> +|                 | (open,close,read,write) in their name.
> +| io-stats.d      | Show statistics for the fopen(), fread(), fwrite(),     |A|
> +|                 | pwrite() and read() functions.  There are several probes  |
> +|                 | and you are encouraged to experiment with this script.    |
> +| sched-simple.d  | Use the sched provider to count how often the target    |B|
> +|                 | application, or command, was scheduled on and off the     |
> +|                 | CPU.  The information is given for all individual threads,|
> +|                 | as well as across all the threads.                        |
> +| sched-stats.d   | Display various scheduler statistics for the target     |A|
> +|                 | application, or command.  For each thread, a histogram of |
> +|                 | the time on the CPU is shown.  Plus global statistics,    |
> +|                 | like the total time on the CPU, and the minimum and       |
> +|                 | maximum.  Those threads with an on-CPU time below a user  |
> +|                 | controllable limit are listed.  Also the CPU IDs and the  |
> +|                 | number of times a thread was on/off CPU are given.        |
> +| thread-ids.d    | For a Pthreads application, show the mapping between the|M|
> +|                 | thread IDs as created by the Pthreads system and the      |
> +|                 | thread ID as returned in the DTrace tid variable.         |

Okay, though you really use the return value of clone*, so we just have 
to accept that the D tid is this value (and not, say, the pthr tid).

> +| var-scope.d     | Demonstrate some of the scoping rules for global and    |B|
> +|                 | clause-local variables.                                   |
> diff --git a/examples/io-cast-net.d b/examples/io-cast-net.d
> new file mode 100755
> index 000000000..a8b336b3f
> --- /dev/null
> +++ b/examples/io-cast-net.d
> @@ -0,0 +1,70 @@
> +#!/usr/sbin/dtrace -s
> +
> +/*
> + *  NAME
> + *    io-cast-net.d - get an overview of read and write calls
> + *
> + *  SYNOPSIS
> + *    sudo ./io-cast-net.d -c "<name-of-application> [app options]"
> + *
> + *  DESCRIPTION
> + *    This script lists all the functions in libc that include
> + *    one of the following keywords in their name:
> + *      read, write, open, close
> + *    This is a fairly wide net that is cast to find out which
> + *    libc functions are used to open, or close a file, and which
> + *    functions from libc are used to read from, or write to a
> + *    file.
> + *    The number of calls to each of these functions is counted,
> + *    plus the total of all such calls.
> + *
> + *  NOTES
> + *    - This is an example how you can find out more about a certain
> + *    area of interest.  Wildcards and empty fields in the probe
> + *    definition are ideally suited for this, but be aware not to
> + *    overask and create a huge number of probes.
> + *
> + *    - This script could be the first step to analyze the I/O
> + *    behaviour of an application, or command.  There is another
> + *    much more elaborate example to demonstrate this.

Be more specific:  which example?  How else is the curious reader going 
to find it?

> + *    - Wildcards are used and as a result, more functions may be
> + *    listed than you might be interested in, but in this way
> + *    you will see all of them and can make a selection which
> + *    one(s) to focus on.
> + *
> + *    - There are no print statements in this example, because we
> + *    would like to show the default output from DTrace.  In
> + *    particular, the feature that, when the tracing has finished,
> + *    all aggregations are printed by default.  Since this example
> + *    only uses aggregations, it means that all results are printed
> + *    upon completion.
> + */
> +
> +/*
> + *  These are the probes we use to query the system and see those
> + *  function calls.  The first aggregation has a key that consists
> + *  of 2 fields: the name of the function and the module name.
> + *  Thanks to this key, the results are differentiated by the
> + *  function name and the name of the module.

To me, it's cleaner to keep the same order.  If probe descriptions, 
which you are using, have module before name, then keep that order for 
your aggregation keys and comments.

> + *  The latter is always going to be libc.so, e.g. libc.so.6.  We
> + *  have included it in the key for the aggregation to show how
> + *  to make a script more flexible.  In case the module name is
> + *  left out in the probe definition, or changed, the script will
> + *  still work and print the module name(s) of the probe(s) that
> + *  fired.
> + *
> + *  The second aggregation has no key.  That means that the count
> + *  is incremented each time one of the probes fires.  That will
> + *  give us the total count across all the probes that fired.
> + */
> +
> +pid$target:libc.so:*read*:entry,
> +pid$target:libc.so:*write*:entry,
> +pid$target:libc.so:*open*:entry,
> +pid$target:libc.so:*close*:entry
> +{
> +  @target_calls[probefunc,probemod] = count();
> +  @total_count                      = count();
> +}
> diff --git a/examples/io-stats.d b/examples/io-stats.d
> new file mode 100755
> index 000000000..a9f25e405
> --- /dev/null
> +++ b/examples/io-stats.d
> @@ -0,0 +1,268 @@
> +#!/usr/sbin/dtrace -s
> +
> +/*
> + *  NAME
> + *    io-stats.d - show several I/O related statistics
> + *
> + *  SYNOPSIS
> + *    sudo ./io-stats.d -c "<name-of-application> [app options]"
> + *
> + *  DESCRIPTION
> + *    This script shows several I/O statistics for the target
> + *    application, or command.  Among others, the filename, the
> + *    file descriptor, or stream it is connected to, the number
> + *    of bytes read, written, or both, are printed.
> + *
> + *  NOTES
> + *    - All the probes in this script are based upon the pid
> + *    provider.

fwiw, the END probe is not from the pid provider.

> + *    - This script is quite elaborate and has several probes.
> + *    More specifically, in alphabetical order, these are the
> + *    functions from the libc library that are traced:
> + *     close(),
> + *     fclose(),
> + *     fopen(),
> + *     fread(),
> + *     fwrite()
> + *     open(),
> + *     pwrite(),
> + *     read(),
> + *    In the probes, we rely on the information from the man pages
> + *    for these functions to identify arguments of interest and
> + *    the return values.
> + *
> + *    - Not all of the probes may be relevant to your case.  While
> + *    no harm is done if a probe does not fire, the associated
> + *    aggregation(s) will be empty, but still printed.  You can of

Hmm, okay.  Normally, empty aggregations are not printed.  In this case, 
however, you are explicitly using printa().  So, that's different.

> + *    course always remove such probes and related print statements.
> + *
> + *    - It may also be that some functions that you're interested
> + *    in are missing.  If so, probably an existing probe can be
> + *    used as a basis to add one or more new probes.

The last sentence confused me:  it felt like it was telling me that when 
an existing probe fires some action can be taken to get to missing 
probes.  Now I'm thinking what you meant is that the existing script can 
be modified to add more probe descriptions.

> + *    - A related script is io-cast-net.d.  You may want to run
> + *    this first to cast a fairly wide net and explore which open,
> + *    close, read and write functions are called when executing the
> + *    target.
> + *
> + *    - All print statements are in the END probe.  On purpose
> + *    several different format strings are used.  This is done to
> + *    demonstrate the flexibility in presenting the results.

Well, and END/printa() is not needed at all, since aggregation data will 
be printed by default.

> + */
> +
> +/*
> + *  Suppress the default output from the dtrace command and
> + *  have printa() print the data sorted by the first field.

s/data/aggregation data/

> + */
> +#pragma D option quiet
> +#pragma D option aggsortkey=1
> +#pragma D option aggsortkeypos=0
> +
> +/*
> + *  Capture the name of the file that needs to be opened in the call
> + *  to any function that starts with fopen.  The first argument to
> + *  this function contains the name of the file to be opened.
> + *  As arg0 is of type integer, it is converted to a string and

Yeah, it's also copying from user space.

> + *  stored in the thread-local variable fname_fopen.
> + */
> +pid$target:libc.so:fopen*:entry
> +{
> +  self->fname_fopen = copyinstr(arg0);
> +}
> +/*
> + *  This probe is nearly identical to the one above, but note that
> + *  for this probe, the name of the thread-local variable is slightly
> + *  different and called fname_open.
> + */
> +pid$target:libc.so:open:entry
> +{
> +  self->fname_open = copyinstr(arg0);
> +}
> +
> +/*
> + *  This is the return probe for any of the fopen calls.  The
> + *  predicate ensures that fname_fopen has been set and if so,
> + *  the probe updates the count and stores this in an aggregation
> + *  with 3 fields: the name of the function, the name of the
> + *  file and the return value of the function, which is stored
> + *  in arg1.
> + *  Note that we can indeed access this file name.  Although
> + *  set in the corresponding entry probe, a thread-local
> + *  variable is accessible in the data space of the thread
> + *  and so can be read, or written, here.
> + *  The third field in the aggregation is arg1.  This is
> + *  the return value of the function and is the pointer to the
> + *  stream.
> + *  Since we no longer need variable fname_fopen it is released.
> + */
> +pid$target:libc.so:fopen*:return
> +/ self->fname_fopen != 0/
> +{
> +  @file_pointer[probefunc,self->fname_fopen,arg1] = count();
> +  self->fname_fopen = 0;
> +}
> +
> +/*
> + *  This probe is nearly identical to the previous probe.
> + *  Other than the name of the thread-local variable, the
> + *  difference is in the name of the aggregation.  This is
> + *  because the return value, which is the file descriptor, is of
> + *  type integer.  This means that we need a different format
> + *  string *  when printing this field.
> + */
> +pid$target:libc.so:open:return
> +/ self->fname_open != 0/
> +{
> +  @file_descriptor[probefunc,self->fname_open,arg1] = count();
> +  self->fname_open = 0;
> +}
> +
> +/*
> + *  This probe stores the name of the function, which is fclose,
> + *  and the argument, the pointer to the stream.
> + */
> +pid$target:libc.so:fclose:entry
> +{
> +  @fclose_stream[probefunc,arg0] = count();
> +}
> +
> +/*
> + *  This probe is very similar to the one above.  Again, the
> + *  difference is in the name of the aggregation, and the argument
> + *  which is the file descriptor and of type integer.
> + */
> +pid$target:libc.so:close:entry
> +{
> +  @close_fd[probefunc,arg0] = count();
> +}
> +
> +/*
> + *  One of the functions traced, is pwrite().  The first
> + *  argument, arg0, is the file descriptor.  The third
> + *  argument is arg2.  This contains the number of bytes
> + *  written.

You don't know the number of bytes written until return.  Here, arg2 is 
simply the max requested.

> + *  The aggregation function sum() is used to sum up all
> + *  those byte counts, and as a result, the aggregation contains
> + *  the total number of bytes written to the file connected
> + *  to the file descriptor.
> + */
> +pid$target:libc.so:pwrite:entry
> +{
> +  @pwrite_bytes[probefunc,arg0] = sum(arg2);
> +}
> +
> +/*
> + *  The pwrite() function returns the number of bytes written.
> + *  This value is added to the existing contents of the
> + *  aggregation.  The name of the function, pwrite, is the only
> + *  field in the key, so this aggregation contains the total
> + *  number of bytes written by pwrite().
> + */

Right.  So entry has the requested number and return has the actual number.

> +pid$target:libc.so:pwrite:return
> +{
> +  @pwrite_total_bytes[probefunc] = sum(arg1);
> +}
> +
> +/*
> + *  This probe accumulates the number of bytes the read() function
> + *  requests to be read.  The aggregation is differentiated by the
> + *  file descriptor.  This means that we get the totals on a per
> + *  file descriptor basis.
> + *  We also store the file descriptor in a thread-local variable
> + *  called fd, because we would like to use it in the return probe
> + *  for this function.
> + */
> +pid$target:libc.so:read:entry
> +{
> +  self->fd = arg0;
> +  @bytes_requested[probefunc,self->fd] = sum(arg2);
> +}
> +
> +/*
> + *  This is where things come together.  The read() function returns
> + *  the actual number of bytes read.  These values are accumulated in
> + *  an aggregation that includes the file descriptor in the key.  In
> + *  this way we can print both the number of bytes requested and the
> + *  actual number of bytes read.
> + *  We also accumulate the total number of bytes actually read across
> + *  all the file descriptors.
> + *  Since thread-local variable fd is no longer needed, the storage is
> + *  released.
> + */
> +pid$target:libc.so:read:return
> +{
> +  @bytes_actually_read[probefunc,self->fd] = sum(arg1);
> +  @total_bytes_read                        = sum(arg1);
> +  self->fd = 0;
> +}
> +
> +/*
> + *  This is a shared clause for the probes for the fread() and
> + *  fwrite() functions.  We can share the clause, because the
> + *  arguments and return value are the same for both functions.
> + *  The clause-local variable comment is used to store a dynamically
> + *  generated string.  This string is then concatenated with the
> + *  name of the function, resulting in a string that depends on the
> + *  function name in the probe.
> + *  The aggregation has a key that consists of the pointer to the
> + *  stream and the generated string.  This aggregation accumulates
> + *  the product of the byte size and the number of elements, which
> + *  is the total number of bytes read, or written.
> + */
> +pid$target:libc.so:fread:entry,
> +pid$target:libc.so:fwrite:entry
> +{
> +  this->comment = (probefunc == "fread") ?
> +                   "read by " : "written by ";
> +  this->rw = strjoin(this->comment,probefunc);
> +  @total_byte_count_frw[arg3,this->rw] = sum(arg1*arg2);
> +}
> +
> +/*
> + *  This probe accumulates the return value of the fread() function,
> + *  which is the number of elements read.  This number is accumulated
> + *  across all pointers to streams.
> + */
> +pid$target:libc.so:fread:return
> +{
> +  @fread_total_elements[probefunc] = sum(arg1);
> +}
> +
> +/*
> + *  The output section where all the results are printed.  Note that
> + *  in one case, we use 2 aggregations in the call to printa().
> + *  This is supported if the key is the same.
> + */
> +END
> +{
> +  printf("%8s %20s   %-15s %6s\n",
> +               "Function","Filename","File pointer","Count");
> +  printa("%8s %20s   0x%-13p %@6d\n", at file_pointer);
> +
> +  printf("\n%8s %20s   %-15s %6s\n",
> +               "Function","Filename","File descriptor","Count");
> +  printa("%8s %20s   %-15d %@6d\n", at file_descriptor);
> +
> +  printf("\n%8s   %-16s %7s\n","Function","Stream/FD","Count");
> +  printa("%8s   0x%-14p %@7d\n", at fclose_stream);
> +  printa("%8s   %-16d %@7d\n", at close_fd);
> +
> +  printf("\n");
> +  printf("%8s %5s %14s\n","Function","FD","Bytes written");
> +  printa("%8s %5d %@14d\n", at pwrite_bytes);
> +  printa("Total bytes written by %s: %@ld\n", at pwrite_total_bytes);
> +
> +  printa("\nOn stream %p - Total bytes %s = %@ld\n",
> +                                  @total_byte_count_frw);
> +
> +  printa("Total elements read by %s = %@ld\n",
> +                                  @fread_total_elements);
> +
> +  printf("\n%33s\n","Bytes read");
> +  printf("%8s %6s %10s %10s\n","Function","FD","Requested","Actual");
> +  printa("%8s %6d %@10d %@10d\n", at bytes_requested,
> +                                 @bytes_actually_read);
> +  printa("\nTotal number of bytes read: %@d\n", at total_bytes_read);
> +}
> diff --git a/examples/sched-simple.d b/examples/sched-simple.d
> new file mode 100755
> index 000000000..6f93a5d50
> --- /dev/null
> +++ b/examples/sched-simple.d
> @@ -0,0 +1,76 @@
> +#!/usr/sbin/dtrace -s
> +/*
> + *  NAME
> + *    sched-simple.d - count how often the target was on/off the CPU
> + *
> + *  SYNOPSIS
> + *    sudo ./sched-simple.d -c <target>
> + *
> + *  DESCRIPTION
> + *    This script uses the sched provider to count how often the
> + *    target proces was scheduled on and off the CPU.  The target
> + *    can be an application, or any command.
> + *    The information is given for all individual threads used by
> + *    the target, as well as across all the threads.
> + *
> + *  NOTES
> + *    There are 2 aggregations.  The one with 2 fields in the
> + *    key, thr_count_states, uses the thread ID as the first field.
> + *    By configuring printa() through pragmas to print the data
> + *    sorted by the first field, the results for this aggregation
> + *    are shown on a per thread basis.
> + */
> +
> +/*
> + *  Suppress the default output from the dtrace command and
> + *  have printa() print the data sorted by the first field.

s/data/aggregation data/.  Hopefully that will provide one more cue to 
link text with pragma names.

> + */
> +#pragma D option quiet
> +#pragma D option aggsortkey=1
> +#pragma D option aggsortkeypos=0
> +
> +/*
> + *  These are 2 probes from the sched provider that share a
> + *  clause.  Without a predicate, these probes will fire as
> + *  soon as any process goes on/off the CPU.  That will work
> + *  fine, but in this case we use a predicate to restrict the
> + *  probing to the target application, or command.
> + *
> + *  The clause-local variable state is thread safe.  It is used

The clause-local variable is thread safe?  Uh, I guess, but the 
implication is confusing.  If you want thread-safe, use self->.

The variable state?  What is the state of the variable?  Confusing. 
Maybe, to indicate that "state" is a name of a variable, it should be 
set off in quote marks?  Or rename it to be tmpstr or something, which 
looks more like a variable name and not like a generic English word.

> + *  to demonstrate how to construct a string to be used as a field
> + *  in an aggregation, or for any other purpose.

A clause-local variable is used to demonstrate how to construct a string 
for any purpose?

> + *  We know that the probe name is either on-cpu or off-cpu.  This
> + *  is used to assign a different string to variable state.
> + *
> + *  The 2 aggregations are used to count how often the probes fired.
> + *  The difference is in the key.  The first aggregation,
> + *  count_states, does not use the thread ID in the key.  It counts
> + *  across all threads.  In contrast with this, the second
> + *  aggregation, thr_count_states, differentiates the counts by the
> + *  thread ID.
> + */
> +sched:::on-cpu,
> +sched:::off-cpu
> +/ pid == $target /
> +{
> +  this->state = (probename == "on-cpu") ?
> +                  "scheduled on" : "taken off";
> +  @count_states[this->state]         = count();
> +  @thr_count_states[tid,this->state] = count();
> +}
> +
> +/*
> + *  Print the results.  We use printf() to print a header.
> + *  For both aggregations, a formatted printa() statement
> + *  is used.  The printf() in between the 2 printa()
> + *  statements is used to separate the 2 blocks with results.
> + */

Personally, I think it's great that DTrace prints aggregations by 
default, and I would remind people here that the END clause is not 
included because it's needed to print results but that it is used only 
to illustrate how to control the output format.

> +END
> +{
> +  printf("%8s %-20s %6s\n","TID","State","Count");
> +  printf("====================================\n");
> +  printa("%8d %-12s the CPU %@6d\n", at thr_count_states);
> +  printf("\n");
> +  printa("Total count %12s: %@d\n", at count_states);
> +}
> diff --git a/examples/sched-stats.d b/examples/sched-stats.d
> new file mode 100755
> index 000000000..50b3b263a
> --- /dev/null
> +++ b/examples/sched-stats.d
> @@ -0,0 +1,182 @@
> +#!/usr/sbin/dtrace -s
> +
> +/*
> + *  NAME
> + *    sched-stats.d - display various scheduler statistics
> + *
> + *  SYNOPSIS
> + *    sudo ./sched-stats.d -c "<name-of-app [app options]" <threshold>

"<name-of-app" is missing a closing ">"?

> + *  DESCRIPTION
> + *    This script shows several thread-scheduling statistics that
> + *    were collected while the target application, or command, was
> + *    running.
> + *
> + *    For good performance, a thread should run on the same CPU,
> + *    or at least in the same NUMA node for as long as possible.
> + *    In case it is taken off the CPU, it should be scheduled
> + *    back on the same CPU, or at least in the same NUMA node.
> + *
> + *    The information printed shows the scheduling decisions
> + *    for any (multithreaded) application:
> + *
> + *    - For each thread, a histogram of the on-cpu times is shown.
> + *
> + *    - This is followed by a table with the overall timing
> + *    statistics.  For each thread, the total on-cpu time is
> + *    given, plus the minimum and maximum on-cpu time.
> + *
> + *    - The script takes one parameter.  This is a threshold
> + *    value.  It is a lower limit on the on-cpu times.  For each

What are the units of the value?

> + *    thread it is reported how often the on-cpu time was at, or
> + *    below, this value.
> + *    This table might point at threads that are falling behind
> + *    on a (heavily loaded) system
> + *
> + *    - The last 2 tables give information on the thread affinity.
> + *    The first table of these 2 is sorted by thread ID and
> + *    shows the CPU(s) used by each thread, as well how often
> + *    the thread was scheduled on the CPU, and taken off again.
> + *    The second table of these 2 shows the same information,
> + *    but now from a CPU centric view.
> + *
> + *  NOTES
> + *    This script works with any target application or command.
> + */
> +
> +/*
> + *  Suppress the default output from the dtrace command and
> + *  have printa() print the data sorted by the first field.


s/data/aggregation data/.  Hopefully that will provide one more cue to 
link text with pragma names, printa() name, etc.

> + */
> +
> +#pragma D option quiet
> +#pragma D option aggsortkey=1
> +#pragma D option aggsortkeypos=0
> +
> +/*
> + *  Read the threshold value from the command line and store it in
> + *  a global variable.  By using a global variable, all probes can
> + *  read/write it.  In this case it is used as a read-only variable.
> + */

How does one specify that it is a read-only variable?  Oh, we cannot do 
that.  Oh, and all probes can access $1 anyhow.  How about just 
replacing the whole comment with something like, "Place the argument 
from the command line in a global variable for better readability." Or 
something like that.

> +BEGIN
> +{
> +  time_threshold = $1;
> +}
> +
> +/*
> + *  An on-cpu probe with a predicate to ensure that the probe only
> + *  fires when the target is scheduled on the CPU.
> + *
> + *  Set the base value of the timer.  Since each thread needs

Might as well refer to ts_start as the start time.  Besides, (at least 
to me), "base" time is a little unclear.

> + *  to have its own timer, we use a thread-local variable.
> + *
> + *  There are 2 aggregations.  Both increment their counter each
> + *  time that this probe fires.  The thread ID (in tid) and CPU ID
> + *  (in curcpu->cpu_id) are used to define the key.
> + *
> + *  The difference between the 2 aggregations is that the fields in
> + *  the key have been swapped.  Given that printa() has been
> + *  configured to sort the data by the first column, the first
> + *  aggregation shows a thread centric view, while the second one
> + *  has a CPU ID centric view.
> + */
> +sched:::on-cpu
> +/ pid == $target /
> +{
> +  self->ts_start = timestamp;
> +
> +  @thr_cpu_id_on_cpu[tid,curcpu->cpu_id] = count();
> +  @cpu_id_thr_on_cpu[curcpu->cpu_id,tid] = count();

Okay, but instead of curcpu->cpu_id you could just say "cpu".  In 
libdtrace/sched.d, we see "cpu = curcpu->cpu_id;".  Oddly, we used to 
document "cpu" as a built-in variable, but we no longer do. Dunno.

> +}
> +
> +/*
> + *  This is the first of 2 probes using off-cpu.  In this probe,
> + *  again a predicate is used to restrict the data to the target
> + *  command.
> + *
> + *  Since the process/thread is about to go off CPU, now is the time
> + *  to record how long the thread has been running on the CPU.
> + *  The value is converted to microseconds (us) and stored in a

"converted to microseconds" =>
"converted from nanoseconds to microseconds"

> + *  clause-local variable.
> + *
> + *  This variable is then used to update 4 aggregations: the
> + *  histogram of the on-CPU times, the total on CPU time, plus the
> + *  minimum and maximum time on the CPU.
> + *
> + *  The next 2 aggregations are very similar to those used in the
> + *  on-cpu probe.
> + *
> + *  Last, but not least, the storage for the thread-local variable
> + *  is released, because it is no longer needed.
> + */

You have a monolithic comment block with a concatenated series of 
comments, followed by a bunch of statements.  This means my eye has to 
move back and forth between comments and statements, which I have to 
search out.  I might even have to scroll.  How about breaking the 
comments up so that each comment is by the statement it describes. You 
could then drop some descriptive text like "The next 2 aggregations."

> +sched:::off-cpu
> +/ pid == $target /
> +{
> +  this->time_on_cpu = (timestamp - self->ts_start)/1000;
> +
> +  @timings[tid]    = quantize(this->time_on_cpu);
> +  @total_time[tid] = sum(this->time_on_cpu);
> +  @min_time[tid]   = min(this->time_on_cpu);
> +  @max_time[tid]   = max(this->time_on_cpu);
> +
> +  @thr_cpu_id_off_cpu[tid,curcpu->cpu_id] = count();
> +  @cpu_id_thr_off_cpu[curcpu->cpu_id,tid] = count();
> +
> +  self->ts_start = 0;
> +}
> +
> +/*
> + *  This is the second off-cpu probe.  It has a different and
> + *  somewhat more complicated predicate.
> + *  This probe fires for the target command, but only in case the
> + *  time on the CPU is below the user controllable threshold value.
> + *  We basically record and count the event, but of course any
> + *  other action can be taken here.
> + *  An aggregation is used to store the data.  The thread ID (tid)
> + *  is used as a key and we increment the counter.  This produces
> + *  a count for each thread.
> + */
> +sched:::off-cpu
> +/ pid == $target && this->time_on_cpu <= time_threshold /
> +{
> +  @thr_below[tid] = count();
> +}
> +
> +/*
> + *  This is where all results are printed.  It is all quite
> + *  straightforward, but note that in three cases, multiple
> + *  aggregations are printed with a single printa().  As long as the
> + *  keys are the same, this is supported and makes it possible to
> + *  print multiple columns with values.
> + */

As I say in a different example, it'd be good to remind users that this 
END probe is to pretty-print the format and illustrate printa() for 
pedagogical purposes;  but DTrace by default will print aggregations.

> +END
> +{
> +  printf("Thread times on the CPU (us)\n");
> +  printa(@timings);
> +
> +  printf("\n=====================================\n");
> +  printf("%10s %26s\n","TID", "Times on the CPU (us)");
> +  printf("%21s %6s %8s\n","Total","Min","Max");
> +  printf("=====================================\n");
> +  printa("%10d %@10d %@6d %@8d\n", at total_time, at min_time, at max_time);
> +
> +  printf("\n=====================================\n");
> +  printf("Thread ON CPU time threshold: %d us\n",time_threshold);
> +  printf("=====================================\n");
> +  printf("%8s %26s\n","TID","Count below threshold");
> +  printa("%8d %@14d\n", at thr_below);
> +
> +  printf("\n=====================================\n");
> +  printf("%10s  %8s %11s\n","TID","CPU ID","Count");
> +  printf("%37s\n","On CPU  Off CPU");
> +  printf("=====================================\n");
> +  printa("%10d  %8d %@7d %@8d\n", at thr_cpu_id_on_cpu,
> +                                 @thr_cpu_id_off_cpu);
> +
> +  printf("\n=====================================\n");
> +  printf("%8s  %10s %11s\n","CPU ID","TID","Count");
> +  printf("%37s\n","On CPU  Off CPU");
> +  printf("=====================================\n");
> +  printa("%8d  %10d %@7d %@8d\n", at cpu_id_thr_on_cpu,
> +                                 @cpu_id_thr_off_cpu);
> +}
> diff --git a/examples/thread-ids.d b/examples/thread-ids.d
> new file mode 100755
> index 000000000..38210c4d6
> --- /dev/null
> +++ b/examples/thread-ids.d
> @@ -0,0 +1,159 @@
> +#!/usr/sbin/dtrace -s
> +
> +/*
> + *  NAME
> + *    thread-ids.d - show the mapping between Pthread IDs and tid values
> + *
> + *  SYNOPSIS
> + *    sudo ./thread-ids.d -c "<name-of-application> [app options]"
> + *
> + *  DESCRIPTION
> + *    This script assumes that the target uses the Pthreads library
> + *    to create one or more threads.  It shows the mapping of the
> + *    Pthread thread IDs and the thread ID, as returned in the tid
> + *    built-in variable.
> + *
> + *  NOTES
> + *    - In addition to showing how to uncover this mapping, this
> + *    script also shows a technique how to retrieve a value from
> + *    a pointer argument in a function call.
> + *
> + *    In this case, this is the thread ID that pthread_create()
> + *    returns in its first argument.
> + *
> + *    This is from the man page for pthread_create():
> + *
> + *    int pthread_create(pthread_t *restrict thread,
> + *                        const pthread_attr_t *restrict attr,
> + *                        void *(*start_routine)(void *),
> + *                        void *restrict arg);
> + *
> + *    We need to capture the contents of *thread.
> + *
> + *    In this case, we cannot use the built-in tid variable
> + *    within pthread_create(), because typically, this function
> + *    is executed by one thread, the main thread.  This means
> + *    that the value in tid is the thread ID of this main thread,
> + *    but we need to have the value of the thread that is created
> + *    as a result of calling pthread_create().  As shown below,
> + *    this can be done by tracing clone3, which is called by
> + *    pthread_create().
> +
> + *    - It is assumed that a function called main is executed.
> + *    If this is not the case, this is not a critical error.
> + *    The first probe is used to capture the tid value for the main
> + *    program.  This thread is however not created by function
> + *    pthread_create() and therefore this part of the script is
> + *    not essential.
> + *    This is why this probe and printf statement in the END probe
> + *    can safely be removed, or replaced by a suitable alternative.
> + */
> +
> +/*
> + *  Suppress the default output from the dtrace command.
> + */
> +#pragma D option quiet

FWIW, you could add q to the -s in the shebang line instead.

> +
> +/*
> + *  Declare a global variable.  This is to ensure that the compiler

self->clone_tid is not a global variable.

> + *  sees it before it is referenced.
> + */
> +self int clone_tid;
> +
> +/*
> + *  Store the thread ID of the main thread.
> + */
> +pid$target:a.out:main:entry
> +{
> +  tid_main = tid;
> +}
> +
> +/*
> + *  Two thread-local variables are set here.  The first one is used
> + *  in a predicate on the second probe to ensure that we are only
> + *  tracing the clone* calls executed while within pthread_create().
> + *
> + *  We actually know that clone3 is called.  By using the wildcard
> + *  here, the script still works should this number change in the
> + *  future.
> + *
> + *  Variable pthr_id_p captures the first argument of the call to
> + *  pthread_create().  This is a pointer, but we can't dereference
> + *  it here, because the contents this pointer points to, is only

contents is =>
contents are

> + *  available upon return.
> + *  The way this is solved, is by storing the pointer in this probe.

The way this is solved, is by =>
This is solved by

> + *  In the return probe for this function, we can then dereference
> + *  the pointer.
> + */
> +pid$target:libc.so:pthread_create:entry
> +{
> +  self->in_pthr_create = 1;

Why in_pthr_create?  Why not test for pthr_id_p (!=NULL) instead? Get 
rid of an unnecessary variable.

> +  self->pthr_id_p = (int64_t *) arg0;
> +}
> +
> +/*
> + *  We know that one of the clone functions is called from within
> + *  pthread_create() and it returns the thread ID that DTrace stores
> + *  in the tid variable.
> + *  This is why the return value, which is stored in arg1, is copied
> + *  into a thread-local variable called clone_tid.  This variable is
> + *  then referenced in the return probe for pthread_create().
> + */
> +pid$target:libc.so:clone*:return
> +/ self->in_pthr_create == 1 /
> +{
> +  self->clone_tid = arg1;
> +}
> +
> +/*
> + *  This is where things come together.
> + *
> + *  We already have the value for the thread ID as used by DTrace.
> + *  It is stored in thread-local variable clone_tid.
> + *
> + *  Now we can capture the Pthreads thread ID.
> + *
> + *  We are about to return from pthread_create() and can dereference
> + *  the pointer.
> + *  Before we do so, the data needs to be copied from userland
> + *  into the kernel.  Since this is an address, 8 bytes are copied.
> + *  The value is what pthread_create() returns in its *thread
> + *  first argument, which is the Pthreads thread ID.
> + *  This gives us both thread IDs and they are used in the key
> + *  for aggregation @thread_mapping.
> + *
> + *  There is one more thing to this though.  It is not strictly
> + *  necessary to use an aggregation.  An associative array can
> + *  be used, but that is going to be a problem when it comes
> + *  to printing the results.  There is no printa() equivalent
> + *  for associative arrays, but since we do not know the key(s)
> + *  upfront, we don't know what to print.
> + *  Instead, we use an aggregation, but do not print the result.
> + *
> + *  Last, but not least, at this point we can and should return

s/return/free/

> + *  the storage for the thread-local variables.
> + */
> +pid$target:libc.so:pthread_create:return
> +{
> +  this->pthr_id = *(int64_t *) copyin(*self->pthr_id_p,8);
> +  @thread_mapping[this->pthr_id,self->clone_tid] = count();

This is interesting -- and I, for one, have learned something from this 
-- but it's pretty unnatural.  Count, when the number does not interest 
you?  Instead of using an aggregation, how about simply printing here 
and now the mapping between the two IDs?  Just printf(), no 
aggregation.  No need for an associative array either.

> +  self->pthr_id_p      = 0;
> +  self->clone_tid      = 0;
> +  self->in_pthr_create = 0;
> +}
> +
> +/*
> + *  The aggregation is printed in the END probe.  We use printf()
> + *  statements to print the thread ID of the main program and the
> + *  table header.
> + *  Note that there is no format field for the value for the
> + *  aggregation.  As explained above, the value is not relevant
> + *  in this case.
> + */
> +END
> +{
> +  printf("Thread ID of main program: %d\n\n",tid_main);
> +  printf("%16s <=> %-9s\n\n","Pthreads ID","Thread ID");
> +  printa("%16d <=> %-9d\n", at thread_mapping);
> +}
> diff --git a/examples/var-scope.d b/examples/var-scope.d
> new file mode 100755
> index 000000000..b5317afd2
> --- /dev/null
> +++ b/examples/var-scope.d
> @@ -0,0 +1,117 @@
> +#!/usr/sbin/dtrace -s
> +
> +/*
> + *  NAME
> + *    var-scope.d - initialize, update and print variables
> + *
> + *  SYNOPSIS
> + *    sudo ./var-scope.d -c <target>
> + *
> + *  DESCRIPTION
> + *    This script demonstrates some of the scoping rules.  It
> + *    shows the behaviour of global and clause-local variables.
> + *
> + *  NOTES
> + *    Other than the BEGIN and END probe used here, there are
> + *    2 more probes for the fopen() function, but you can use any
> + *    probe you like, or no additional probes at all.  You are
> + *    encouraged to experiment with this example and make changes.
> + *
> + *    In case this example is used with the date command, you will
> + *    see something like this:
> + *
> + *    $ sudo ./var-scope.d -c date
> + *    In BEGIN
> + *        var_A = 10 var_B = 20

Why var_A and var_B?  How about just A and B?  Alternatively, globalA 
and globalB, but I think A and B would be better.  It emphasizes that 
variables are "by default" global.

> + *        this->C = 60
> + *    In fopen - return
> + *        var_A = 11 var_B = 21
> + *    <output from the date command>
> + *    In END
> + *        var_A = 11 var_B = 21
> + *    $
> + *
> + *    Below, the workings of each probe is explained.  Since the
> + *    results depend on the command that is traced, we base the
> + *    explanations on the output from the above example.
> + */
> +
> +/*
> + *  Suppress the default output from the dtrace command.
> + */
> +#pragma D option quiet
> +
> +/*
> + *  This demonstrates how to declare a variable outside of a clause.
> + *  Variable var_B is declared to be of type int64_t.  This is a
> + *  global variable.  It is initialized to zero by default.
> + */
> +int64_t var_B;
> +
> +/*
> + *  In this clause for the BEGIN probe, var_A is initialized to 10
> + *  and var_B is given a value of 20.  Both are global variables.
> + *  Variable C is a clause-local variable initialized to 30.
> + */
> +BEGIN
> +{
> +  var_A = 10;
> +  var_B = 20;
> +  this->C = 30;
> +}
> +
> +/*
> + *  In this clause for the BEGIN probe, clause-local variable C is
> + *  updated to a value of 60.  This demonstrates that clause-local
> + *  variables are local to the set of clauses executed for a
> + *  particular probe firing.  This set is executed in lexical order.

What is lexical order?  Isn't it the order in some lexicon... probably 
alphabetical order?  Maybe you have to say "in the order they appear in 
the program"?

Demonstrating stuff about clause-local variables is tricky, since they 
start with undefined values.  So, e.g., you could have one clause say 
this->a=31415926 and have a different clause print this->a and see the 
value 31415926.  That would seem to demonstrate that the clause-local 
variable's value was persistent from one clause to an entirely unrelated 
clause, when in fact it simply means that the second clause reused some 
value without resetting it.  Bottom line:  "demonstrating" stuff with 
clause-local variables is tricky.

There are no thread-local variables.

Mostly, what you are doing is demonstrating global variables, which 
might feel rather underwhelming to many readers.  Personally, I am not a 
fan of this example.

> + *  Here, there are 2 sets for the BEGIN probe.
> + *  The printed values in the example above confirm this update
> + *  for C, as well as the initial values for var_A and var_B.
> + */
> +BEGIN
> +{
> +  this->C += 30;
> +  printf("In %s\n",probename);
> +  printf("\tvar_A = %d var_B = %d\n",var_A,var_B);
> +  printf("\tthis->C = %d\n",this->C);
> +}
> +
> +/*
> + *  The clause in this probe updates var_A.  Since this is a global
> + *  variable, this change is visible to all probes that fire after
> + *  this probe has fired.
> + */
> +pid$target::fopen:entry
> +{
> +  var_A += 1;
> +}
> +
> +/*
> + *  The clause in this probe updates var_B and prints both var_A
> + *  and var_B.  The probefunc (fopen) and probename (return) are
> + *  printed first, followed by the values for var_A and var_B.
> + *  We are in the return probe here, so we know that the entry
> + *  probe has fired too.  This means that var_A has been updated
> + *  as well.
> + *  This is exactly what we see in the output above: var_A is 11
> + *  and var_B is 21, as it has just been updated.
> + *  Note that C is not visible here, because it is local to the
> + *  BEGIN probe.
> + */
> +pid$target::fopen:return
> +{
> +  var_B += 1;
> +  printf("In %s - %s\n",probefunc,probename);
> +  printf("\tvar_A = %d var_B = %d\n",var_A,var_B);
> +}
> +
> +/*
> + *  Both var_A and var_B are global variables and are visible
> + *  in the END probe.  The final values are printed here.
> + */
> +END
> +{
> +  printf("In %s\n",probename);
> +  printf("\tvar_A = %d var_B = %d\n",var_A,var_B);
> +}



More information about the DTrace-devel mailing list