Friday, October 31, 2008

Dtrace Basics

DTrace is a comprehensive dynamic tracing facility that can be used by administrators and developers to examine the behavior of both user programs and of the operating system itself. With DTrace we can explore our system to understand how it works, track down performance problems across many layers of software, or locate the cause of aberrant behavior. It is safe to use on production systems and does not require restarting/recompiling either the system or applications.

We write D scripts which consist of the probe desctiption , predicates and actions to be taken :
probe description
/predicate/
{
actions
}

When we run the D script , we get results based on the probe desciptions ( the actions are executaed based on the predicate filter ). Think of probes as events: a probe fires when the event happens. Let's take a simple D script example,example.d :

syscall::write:entry
/execname == "bash"/
{
printf("bash with pid %d called write system call\n",pid);
}

Here the probe description is syscall::write:entry , which describes the write system call.The predicate is execname == bash , execname is a builtin variable which contains the executable name and here we proceed with the actions only when the string matches.The action statements contain a builtin function print.


Providers/Probes

To list all of the available probes on your system, type the command:
# sudo dtrace -l

It might take some time to display all of the output. To count up all your probes, you can type the command:

#sudo dtrace -l | wc -l
22567

If you look at the output from dtrace -l in your terminal window,each probe has two names , an integer ID and a human-readable name. The human readable name is composed of four parts.When writing out the full human-readable name of a probe, we write all four parts of the name separated by colons like this:

provider:module:function:name

You might note that some fields are left blank.A blank field is a wildcard and matches all of the probes that have matching values in the parts of the name that you do specify.

Now let's look a little deeper. The probe is described using four fields, the provider, module, function, and name.

* provider—Specifies the instrumentation method to be used. For example, the syscall provider is used to monitor system calls while the io provider is used to monitor the disk io.
* module and function—Describes the module and function you want to observe
* name—Typically represents the location in the function. For example, use entry for name to instrument when you enter the function.

Note that wild cards like * and ? can be used. Blank fields are interpreted as wildcards.Below table shows a few examples :

Probe Description Explanation
syscall::open:entry entry into open system call
syscall::open*:entry entry into any system call that starts with open (open and open64)
syscall:::entry entry into any system called
syscall::: all probes published by the system call provider

A predicate can be any D expression.The action is executed only when the predicate evaluates to true.Below table shows some examples :
Predicate Explanation
cpu == 0 true if the probe executes on cpu0
pid == 1029 true if the pid of the process that caused the probe to fire is 1029
execname != "sched" true if the process is not the scheduler (sched)
ppid !=0 && arg0 == 0 true if the parent process id is not 0 and first argument is 0

The action section can contain a series of action commands separated by semi-colons (;).Below table provides some examples :
Action Explanation
printf() print something using C-style printf() command
ustack() print the user level stack
trace print the given variable

Note that predicates and action statements are optional. If the predicate is missing, then the action is always executed. If the action is missing, then the name of the probe which fired is printed.

Below links provide references for different parts of a probe.
List of providers
List of functions
List of aggregating functions
List of variables
List of built-in variables

Examples

pid provider
------------
Example Explanation
pid2439:libc:malloc:entry entry into the malloc()in libc for process id 2439
pid1234:a.out:main:return return from main for process id 1234
pid1234:a.out::entry entry into any func in 1234 that is main exec
pid1234:::entry entry into any function in any library for pid 1234

You can limit the number of probes enabled by modifying the probe description.
Probe Description Explanation
pid$1:libc::entry/div> Limit to only a given library
pid$1:a.out::entry/div> Limit probes to non-library functions
pid$1:libc:printf:entry Limit probes to just one function

Here is the command you can run to print all the functions that process id 1234 calls:
# dtrace -n pid1234:::entry

Modify the script to take the process id as a parameter. Your script will now look like:

#!/usr/sbin/dtrace -s
pid$1:::entry
{}

script to find the stack trace when the program makes the write system call. Note that you need to run this with the -c option.

#!/usr/sbin/dtrace -s
syscall::write:entry
{
@[ustack()]=count();
}

The syscall Provider
--------------------
This is probably the most important provider to learn and use because system calls are the main communication channel between user level applications and the kernel.

To list all the occurrences of the probe when it was fired and give information about the system calls at entry into the system that are performing a close(2) system call, use the following script:

# dtrace -n syscall::close:entry

To start to identify the process which sent a kill(2) signal to a particular process, use the following script:

#!/usr/sbin/dtrace -s
syscall::kill:entry
{
trace(pid);
trace(execname);
}

The proc Provider
-----------------
Trace all the signals sent to all the processes currently running on the system:

#!/usr/sbin/dtrace -wqs
proc:::signal-send
{
printf("%d was sent to %s by ", args[2], args[1]->pr_fname);
system("getent passwd %d | cut -d: -f5", uid);
}

Add the conditional statement (/args[2] == SIGKILL/) into the script and send SIGKILL signals to different processes from different users.

#!/usr/sbin/dtrace -wqs
proc:::signal-send
/args[2] == SIGKILL/
{
printf("SIGKILL was sent to %s by ", args[1]->pr_fname);
system("getent passwd %d | cut -d: -f5", uid);
}

Here you can see the introduction of pr_fname, which is part of the structure of psinfo_t of the receiving process.

References :

Dtrace @ OpenSolaris
Dtrace inventor blogs
Big Admin Page
Dtrace Guide