NAME
collect - command used to collect program performance data
SYNOPSIS
collect collect-arguments target target-arguments
collect
collect -V
collect -R
DESCRIPTION
The collect command runs the target process and records per-
formance data and global data for the process. Performance
data is collected using profiling or tracing techniques.
The data can be examined with a GUI program, (analyzer), or
a command-line program (er_print). The data collection
software run by the collect command is referred to here as
the Collector.
The data from a single run of the collect command is called
an experiment. The experiment is represented in the file
system as a directory, with various files inside that direc-
tory.
The target is the path name of the executable, Java(TM) .jar
file, or Java .class file for which you want to collect per-
formance data. (For more information about Java profiling,
see JAVA PROFILING, below.) Executables that are targets
for the collect command can be compiled with any level of
optimization, but must use dynamic linking. If a program is
statically linked, the collect command prints an error mes-
sage. In order to see annotated source using analyzer or
er_print, targets should be compiled with the -g flag, and
should not be stripped.
In order to enable dataspace profiling, executables must be
compiled with the -xhwcprof -xdebugformat=dwarf -g flags.
These flags are valid only for the C compiler, and only on
SPARC[R] platforms. See the section "DATASPACE PROFILING",
below.
The collect command uses the following strategy to find its
target:
- If there is a file with the name of the target that is
marked executable, the file is verified as an ELF execut-
able that can run on the target machine. If the file is
not such a valid ELF executable, the collect command
fails.
- If there is a file with the name of the target, and the
file is not executable, collect checks whether the file is
a Java[TM] jar file or class file. If the file is a Java
jar file or class file, the Java[TM] virtual machine (JVM)
software is inserted as the target, with any necessary
flags, and data is collected on that JVM machine. (The
terms "Java virtual machine" and "JVM" mean a virtual
machine for the Java[TM] platform.) See the section on
"JAVA PROFILING", below.
- If there is no file with the name of the target, the
user's path is searched to find an executable; if an exe-
cutable is found, it is verified as described above.
- If no file of the current name is found, the command looks
for a file with that name and the string .class appended;
if a file is found, the target of a JVM machine is
inserted, with the appropriate flags, as above.
- If none of these procedures can find the target, the com-
mand fails.
OPTIONS
If invoked with no arguments, print a usage summary, includ-
ing the default configuration of the experiment. If the pro-
cessor supports hardware counter overflow profiling, print
two lists containing information about hardware counters.
The first list contains "well known" hardware counters; the
second list contains raw hardware counters. For more
details, see the "Hardware Counter Overflow Profiling" sec-
tion below.
Data Specifications
-p option
Collect clock-based profiling data. The allowed values
of option are:
Value Meaning
off Turn off clock-based profiling
on Turn on clock-based profiling with the
default profiling interval of approximately
10 milliseconds.
lo[w] Turn on clock-based profiling with the low-
resolution profiling interval of approxi-
mately 100 milliseconds.
hi[gh] Turn on clock-based profiling with the high-
resolution profiling interval of approxi-
mately 1 millisecond.
n Turn on clock-based profiling with a
profiling interval of n. The value n can be
an integer or a floating-point number, with a
suffix of u for values in microseconds, or m
for values in milliseconds. If no suffix is
used, assume the value to be in milliseconds.
If the value is smaller than the clock pro-
filing minimum, set it to the minimum; if it
is not a multiple of the clock profiling
resolution, round down to the nearest multi-
ple of the clock resolution. If it exceeds
the clock profiling maximum, report an error.
If it is negative or zero, report an error.
If invoked with no arguments, report the
clock-profiling intervals.
If no explicit -p off argument is given, and no
hardware counter overflow profiling is specified, turn
on clock-based profiling.
-h ctr_def...[,ctr_n_def]
Collect hardware counter overflow profiles. The number
of counter definitions, (ctr_def through ctr_n_def) is
processor-dependent. For example, on an UltraSPARC III
system, up to two counters may be programmed; on an
Intel Pentium IV with Hyperthreading, up to 18 counters
are available. The user can ascertain the maximum
number of hardware counters definitions for profiling
on a target system, and the full list of available
hardware counters, by running the collect command
without any arguments.
This option is now available on systems running the
Linux OS. The user is responsible for installing the
required perfctr patch on the system; that patch can be
downloaded from:
http://user.it.uu.se/~mikpe/linux/perfctr/2.6/perfctr-2.6.15.tar.gz
Instructions for installation are contained within that
tar file.
Each counter definition takes one of the following
forms, depending on whether attributes for hardware
counters are supported on the processor:
1. [+]ctr[/reg#][,interval]
2. [+]ctr[~attr=val]...[~attrN=valN][/reg#][,interval]
The meanings of the counter definition options are as
follows:
Value Meaning
+ Optional parameter that can be applied to
memory-related counters. Causes collect to
collect dataspace data by backtracking to
find the instruction that triggered the over-
flow, and to find the virtual and physical
addresses of the memory reference. Backtrack-
ing works on SPARC processors, and only with
counters of type load, store, or load-store,
as displayed in the counter list obtained by
running the collect command without any
command-line arguments. See the section
"DATASPACE PROFILING", below.
ctr Processor-specific counter name. The user can
ascertain the list of counter names by run-
ning the collect command without any
command-line arguments.
attr=val On some processors, attribute options can be
associated with a hardware counter. If the
processor supports attribute options, then
running collect without any command-line
arguments specifies the counter definition,
ctr_def, in the second form listed above, and
provide a list of attribute names to use for
attr. Value val can be in decimal or hexade-
cimal format. Hexadecimal format numbers are
in C program format where the number is
prepended by a zero and lower-case x (0x
hex_number).
reg# Hardware register to use for the counter. If
not specified, collect attempts to place the
counter into the first available register and
as a result, might be unable to place subse-
quent counters due to register conflicts. If
the user specifies more than one counter, the
counters must use different registers. The
list of allowable register numbers can be
ascertained by running the collect command
without any command-line arguments.
interval Sampling frequency, set by defining the
counter overflow value. Valid values are as
follows:
Value Meaning
on Select the default rate, which can
be determined by running the col-
lect command without any command-
line arguments. Note that the
default value for all raw counters
is the same, and might not be the
most suitable value for a specific
counter.
hi Set interval to approximately 10
times shorter than on.
lo Set interval to approximately 10
times longer than on.
value Set interval to a specific value,
specified in decimal or hexadecimal
format.
An experiment can specify both hardware counter over-
flow profiling and clock-based profiling. If hardware
counter overflow profiling is specified, but clock-
based profiling is not explicitly specified, turn off
clock-based profiling.
For more information on hardware counters, see the
"Hardware Counter Overflow Profiling" section below.
-s option
Collect synchronization tracing data. This option is
not available on systems running the Linux OS.
The minimum delay threshold for tracing events is set
using option. The allowed values of option are:
Value Meaning
on Turn on synchronization delay tracing and set
the threshold value by calibration at runtime
calibrate Same as on
off Turn off synchronization delay tracing
n Turn on synchronization delay tracing with a
threshold value of n microseconds. If n is
zero, trace all events.
all Turn on synchronization delay tracing and
trace all synchronization events.
By default, turn off synchronization delay tracing.
Record synchronization events for Java monitors, but
not for native synchronization within the JVM machine.
-H option
Collect heap trace data. The allowed values of option
are:
Value Meaning
on Turn on tracing of memory allocation requests
off Turn off tracing of memory allocation
requests
By default, turn off heap tracing.
Record heap-tracing events for any native calls. Treat
calls to mmap as memory allocations.
Heap profiling is not supported for Java programs.
Specifying it is treated as an error.
Note that heap tracing may produce a very large experi-
ment. Such experiments are very slow to load and
browse.
-m option
Collect MPI tracing data. This option is not available
on systems running the Linux OS.
The allowed values of option are:
Value Meaning
on Turn on tracing of MPI calls
off Turn off tracing of MPI calls
By default, turn off MPI tracing.
-S interval
Collect periodic samples at the interval specified (in
seconds). Record data samples from the process, and
include a timestamp and execution statistics from the
kernel, among other things. The allowed values of
interval are:
Value Meaning
off Turn off periodic sampling
on Turn on periodic sampling with the default
sampling interval (1 second)
n Turn on periodic sampling with a sampling
interval of n in seconds; n must be positive.
By default, turn on periodic sampling.
If no data specification arguments are supplied, col-
lect clock-based profiling data, using the default
resolution.
If clock-based profiling is explicitly disabled, and
neither hardware-counter overflow profiling nor any
kind of tracing is enabled, display a warning that no
function-level data is being collected, then execute
the target and record global data.
Note: With this release, data discrepancies are high
when running the collect command with this option
on systems running the Linux OS.
Experiment Controls
-L size
Limit the amount of profiling and tracing data recorded
to size megabytes. The limit applies to the sum of all
profiling data and tracing data, but not to sample
points. The limit is only approximate, and can be
exceeded. When the limit is reached, stop profiling
and tracing data, but keep the experiment open and
record samples until the target process terminates.
The allowed values of size are:
Value Meaning
unlimited or none
Do not impose a size limit on the experiment
n Impose a limit of n MB.; n must be positive
and greater than zero.
The default limit on the amount of data recorded is
2000 Mbytes.
-F option
Control whether or not descendant processes should have
their data recorded. The allowed values of option are:
Value Meaning
on Record experiments on descendant processes
from fork and exec
all Record experiments on all descendant
processes
off Do not record experiments on descendant
processes
By default, do not record descendant processes. For
more details, users should read the section "FOLLOWING
DESCENDANT PROCESSES", below.
-A option
Control whether or not load objects used by the target
process should be archived or copied into the recorded
experiment. The allowed values of option are:
Value Meaning
on Archive load objects into the experiment.
off Do not archive load objects into the experi-
ment.
copy Copy and archive load objects into the exper-
iment.
A user that copies experiments onto a different
machine, or reads the experiments from a different
machine, should specify -A copy. Note that doing so
does not copy any sources or object files; it is the
user's responsibility to ensure that those files are
accessible on the machine where the experiment is being
run.
The default setting for -A is on.
-j option
Control Java profiling when the target is a JVM
machine. The allowed values of option are:
Value Meaning
on Record profiling data for the JVM machine,
and recognize methods compiled by the Java
HotSpot[TM] virtual machine, and also record
Java callstacks.
off Do not record Java profiling data.
<path> Record profiling data for the JVM, and use
the JVM as installed in <path>.
See the section "JAVA PROFILING", below.
The user must use -j on to obtain profiling data if the
target is a JVM machine. The -j on option is not
needed if the target is a class or jar file. Users on
a 64-bit JVM machine must specify its path explicitly
as the target; do not use the -d64 option for a 32-bit
JVM machine. If the -j on option is specified, but the
target is not a JVM machine, an invalid argument might
be passed to the target, and no data would be recorded.
The collect command validates the version of the JVM
machine specified for Java profiling.
-J java_args
Specify arguments to be passed to the JVM used for pro-
filing. If -J is specified, but Java profiling is not
specified, and error is generated, and no experiment
run.
-l signal
Record a sample point whenever the given signal is
delivered to the process.
-y signal[,r]
Control recording of data with signal. Whenever the
given signal is delivered to the process, switch
between paused (no data is recorded) and resumed (data
is recorded) states. Start in the resumed state if the
optional ,r flag is given, otherwise start in the
paused state. This option does not affect the record-
ing of sample points.
Output Controls
-o experiment_name
Use experiment_name as the name of the experiment to be
recorded. The experiment_name string must end in the
string .er; if not, print an error message and do not
run the experiment.
If -o is not specified, give the experiment a name of
the form stem.n.ern, where stem is a string, and n is a
number. If a group name has been specified with -g,
set stem to the group name without the .erg suffix. If
no group name has been specified, set stem to the
string "test".
If invoked from one of the commands used to run MPI
jobs and -o is not specified, take the value of n used
in the name from the environment variable used to
define the MPI rank of that process. Otherwise, set n
to one greater than the highest integer currently in
use.
If the name is not specified in the form stem.n.er, and
the given name is in use, print an error message and do
not run the experiment. If the name is of the form
stem.n.er and the name supplied is in use, record the
experiment under a name corresponding to one greater
than the highest value of n that is currently in use.
Print a warning if the name is changed.
-d directory_name
Place the experiment in directory directory_name. If
no directory is given, place the experiment in the
current working directory. If a group is specified
(see -g, below), the group file is also written to the
directory named by -d.
-g group_name
Add the experiment to the experiment group group_name.
The group_name string must end in the string .erg; if
not, report an error and do not run the experiment.
-O file
Append all output from collect itself to the named
file, but do not redirect the output from the spawned
target. If file is set to /dev/null suppress all out-
put from collect, including any error messages.
Other Arguments
-C comment
Put the comment into the notes file for the experiment.
Up to ten -C arguments may be supplied.
-n Dry run: do not run the target, but print all the
details of the experiment that would be run. Turn on
-v.
-R Display the text version of the performance tools
README in the terminal window. If the README is not
found, print a warning. Do not examine further argu-
ments and do no further processing.
-V Print the current version. Do not examine further
arguments and do no further processing.
-v Print the current version and further detailed informa-
tion about the experiment being run.
-x Leave the target process stopped on the exit from the
exec system call, in order to allow a debugger to
attach to it.
To attach a debugger to the target once it is stopped
by collect, the user must follow the procedure below.
This option is not available on systems running the
Linux OS.
- Determine the PID of the process
- Start the debugger
- Configure the debugger to ignore SIGPROF (and SIGEMT,
if you chose to collect hardware counter data)
- Attach to the process using the PID.
As the process runs under the control of the debugger,
the Collector records an experiment.
FOLLOWING DESCENDANT PROCESSES
Processes can create descendant processes by calling a sys-
tem library function. The Collector can collect data for
descendant processes initiated by calls to fork(2),
fork1(2), fork(3F), vfork(2), and exec(2) and its variants.
The call to vfork is replaced internally by a call to fork1.
The Collector ignores calls to system(3C), system(3F),
sh(3F), popen(3C), and similar functions, and their associ-
ated descendant processes. If the -F on argument is used,
the Collector opens a new experiment for each descendant
process inside the parent experiment. These new experiments
are named with their lineage as follows:
- An underscore is appended to the creator's experiment
name.
- A code letter is added: either "f" for a fork, or "x" for
an exec.
- A number is added after the code letter, which is the
index of the fork or exec. The assignment of this number
is applied whether the process was started successfully or
not.
For example, if the experiment name for the initial process
is "test.1.er", the experiment for the descendant process
created by its third fork is "test.1.er/_f3.er". If that
descendant process execs a new image, the corresponding
experiment name is "test.1.er/_f3_x1.er".
If the -F all argument is used, all descendants are pro-
cessed, including those from system(3C), system(3F), sh(3F),
popen(3C), and similar functions. Those descendants that
are processed by -F all but not by -F on are named with the
code letter "c".
The Analyzer and er_print automatically read experiments for
descendant processes when the founder experiment is read,
but the experiments for the descendant processes are not
selected for data display.
To select the data for display from the command line,
specify the path name explicitly to either er_print or
Analyzer. The specified path must include the founder exper-
iment name, and the descendant experiment's name inside the
founder directory.
For example, to see the data for the third fork of the
test.1.er experiment:
er_print test.1.er/_f3.er
analyzer test.1.er/_f3.er
You can prepare an experiment group file with the explicit
names of descendant experiments of interest.
To examine descendant processes in the Analyzer, load the
founder experiment and select "Filter data" from the View
menu. The analyzer will display a list of experiments with
only the founder experiment checked. Uncheck the founder
experiment and check the descendant experiment of interest.
PROFILING MULTITHREADED APPLICATIONS
The collect command works on all multithreaded applications,
but the default threads library on the Solaris 8 Operating
System (Solaris OS)(known as T1), has several problems under
profiling. It might discard profiling interrupts when no
thread is scheduled onto an LWP; in such cases, the Total
LWP Time reported may seriously underestimate the true LWP
time. Under some circumstances, it may also get a segmenta-
tion violation accessing an internal library mutex, causing
the application to crash.
On the Solaris 8 OS, the workaround is to use the alternate
threads library (known as T2), by prepending /usr/lib/lwp to
your LD_LIBRARY_PATH setting. On the Solaris 9 OS and
later, the default library is T2.
The Collector detects the use of T1, and puts a warning in
the experiment.
Whereas multithreaded profiling is available under Linux,
there are large data discrepancies on RedHat Linux systems
with this release.
JAVA PROFILING
Java profiling consists of collecting a performance experi-
ment on the JVM machine as it runs the user's .class or .jar
files. If possible, callstacks are collected in both the
Java model and in the machine model.
Data can be shown with view mode set to User, Expert, or
Machine. User mode shows each method by name, with data for
interpreted and HotSpot-compiled methods aggregated
together; it also suppresses data for non-user-Java threads.
Expert mode separates HotSpot-compiled methods from inter-
preted methods, and does not suppress non-user Java threads.
Machine mode shows data for interpreted Java methods against
the JVM machine as it does the interpreting, while data for
methods compiled with the Java HotSpot virtual machine is
reported for named methods. All threads are shown. In all
three modes, data is reported in the usual way for any C,
C++, or Fortran code called by a Java target. Such code
corresponds to Java native methods. The Analyzer and the
er_print utility can switch between the view mode User, view
mode Expert, and view mode Machine, with User being the
default.
Clock-based profiling and hardware counter overflow profil-
ing are supported. Synchronization tracing collects data
only on the Java monitor calls, and synchronization calls
from native code; it does not collect data about internal
synchronization calls within the JVM machine.
Heap tracing is not supported for Java,and generates an
error if specified.
When collect inserts a target name of java into the argument
list, it examines environment variables for a path to the
java target, in the order JDK_HOME, and then JAVA_PATH. For
the first of these environment variables that is set, the
resultant target is verified as an ELF executable. If it is
not, collect fails with an error indicating which environ-
ment variable was used, and the full path name that was
tried.
JDK_1_4_HOME is obsolete, and, if set, is ignored with a
warning.
If none of those environment variables is set, the collect
command uses the default path where the Java[TM] 2 Platform,
Standard Edition technology was installed with the release,
if any, and if it was not installed, as set by the user's
PATH.
Java Profiling requires the Java[TM] 2 SDK, version 1.4.2_02
or later.
OPENMP PROFILING
Data collection for OpenMP programs collects data that can
be displayed in any of the three view modes, just as for
Java programs. The presentation is identical for user mode
and expert mode. Slave threads are shown as if they were
really forked from the master thread, and have call stacks
matching the master thread. Frames in the call stack coming
from the OpenMP runtime code (libmtsk.so) are suppressed.
For machine mode, the actual native stacks are shown.
In user mode, various artificial functions are introduced as
the leaf function of a callstack whenever the runtime
library is in one of several states. These functions are
<OMP-overhead>, <OMP-idle>, <OMP-reduction>, <OMP-
implicit_barrier>, <OMP-explicit_barrier>, <OMP-lock_wait>,
<OMP-critical_section_wait>, and <OMP-ordered_section_wait>.
Two additional clock-profiling metrics are added to the data
for clock-profiling experiments: OMP Work, and OMP Wait.
The inclusive metrics are visible by default; the exclusive
are not. Together, the sum of those two metrics equals the
Total LWP Time metric. No additional metrics are added for
other experiments.
DATASPACE PROFILING
A dataspace profile is a data collection in which memory-
related events, such as cache misses, are reported against
the data object references that cause the events rather than
just the instructions where the memory-related events occur.
Dataspace profiling is not available on systems running the
Linux OS, nor on x86 based systems running the Solaris OS.
To allow dataspace profiling, the target must be a C pro-
gram, compiled for SPARC architecture, with the -xhwcprof
-xdebugformat=dwarf -g flags, as described above. Further-
more, the data collected must be hardware counter profiles
and the optional + must be prepended to the counter name.
If the optional + is prepended to one memory-related
counter, but not all, the counters without the + will report
dataspace data against the <Unknown> data object, with sub-
type (Dataspace data not requested during data collection).
With the data collected, the er_print utility allows three
additional commands: data_objects, data_single, and
data_layout, as well as various commands relating to Memory
Objects. See the er_print(1) man page for more information.
In addition, the Analyzer now includes two tabs related to
dataspace profiling, labeled DataObjects and DataLayout, as
well as a set of tabs relating to Memory Objects. See the
analyzer(1) man page for more information.
USING COLLECT WITH MPI
The collect command can be used with MPI by simply prefacing
the target and its arguments with the collect command and
its arguments in the command line that starts the MPI job.
For example, on an SMP machine,
% mprun -np 16 a.out 3 5
can be replaced by
% mprun -np 16 collect -m on -d /tmp/mydirectory -g
run1.erg a.out 3 5
This command runs an MPI tracing experiment on each of the
16 MPI processes, collecting them all in a specific direc-
tory, and collecting them as a group. The individual exper-
iments are named by the MPI rank, as described above under
the -o option. The experiments, as specified above, contain
clock-based profiling data, which is turned on by default,
and MPI tracing data.
On a cluster, local file systems like /tmp may be private to
a node. If experiments are collected on node-private file
systems, you should gather those experiments to a globally
visible file system after the experiments have completed,
and edit any group file to reflect the new location of those
experiments.
USING COLLECT WITH PPGSZ
The collect command can be used with ppgsz by running the
collect command on the ppgsz command, and specifying the -F
on flag. The founder experiment is on the ppgsz executable
and is uninteresting. If your path finds the 32-bit version
of ppgsz, and the experiment is being run on a system that
supports 64-bit processes, the first thing the collect com-
mand does is execute an exec function on its 64-bit version,
creating _x1.er. That executable forks, creating _x1_f1.er.
The descendant process attempts to execute an exec function
on the named target, in the first directory on your path,
then in the second, and so forth, until one of the exec
functions succeeds. If, for example, the third attempt
succeeds, the first two descendant experiments are named
_x1_f1_x1.er and _x1_f1_x2.er, and both are completely
empty. The experiment on the target is the one from the
successful exec, the third one in the example, and is named
_x1_f1_x3.er, stored under the founder experiment. It can
be processed directly by invoking the Analyzer or the
er_print utility on test.1.er/_x1_f1_x3.er.
If the 64-bit ppgsz is the initial process run, or if the
32-bit ppgsz is invoked on a 32-bit kernel, the fork descen-
dant that executes exec on the real target has its data in
_f1.er, and the real target's experiment is in _f1_x3.er,
assuming the same path properties as in the example above.
See the section "FOLLOWING DESCENDANT PROCESSES", above.
For more information on hardware counters, see the "Hardware
Counter Overflow Profiling" section below.
The collect command operates by inserting a shared library,
libcollector.so, into the target's address space
(LD_PRELOAD), and by using a second shared library,
collaudit.so, to record shared-object use with the runtime
linker's audit interface (LD_AUDIT). Those two shared
libraries write the files that constitute the experiment.
Several problems may arise if collect is invoked on execut-
ables that call setuid or setgid, or that create descendant
processes that call setuid or setgid. If the user running
the experiment is not root, collection fails because the
shared libraries are not installed in a trusted directory.
The workaround is to run the experiments as root.
In addition, the umask for the user running the collect com-
mand must be set to allow write permission for that user,
and for any users or groups that are set by the
setuid/setgid attributes on a program being on which exec is
being run, and for any user or group to which such a program
sets itself. If the mask is not set properly, some files
may not be written to the experiment, and processing of the
experiment is not possible. If the log file can be written,
an error is shown when the user attempts to process the
experiment.
Other problems can arise if the target itself makes any of
the system calls to set UID or GID, or if it changes its
umask and then forks or runs exec on some other process, or
crle was used to configure how the runtime linker searches
for shared objects.
DATA COLLECTED
Three types of data are collected: profiling data, tracing
data and sampling data. The data packets recorded in profil-
ing and tracing include the callstack of each LWP, the LWP,
thread and CPU IDs, and some event-specific data. The data
packets recorded in sampling contain global data such as
execution statistics, but no program-specific or event-
specific data. All data packets include a timestamp.
Clock-based Profiling
The event-specific data recorded in clock-based profil-
ing is an array of counts for each accounting micro-
state. The microstate array is incremented by the sys-
tem at a prescribed frequency, and is recorded by the
Collector when a profiling signal is processed.
Clock-based profiling can run at a range of frequencies
which must be multiples of the clock resolution used
for the profiling timer. In the Solaris 7 OS and some
updates of the Solaris 8 OS, the system clock is used.
To do high-resolution profiling on these systems, the
operating system on the machine must be running with a
high-resolution clock routine, which can be done by
putting the following line in the file /etc/system and
rebooting:
set hires_tick=1
If you try to do high-resolution profiling on a machine
with an operating system that does not support it, the
command prints a warning message and uses the highest
resolution supported. Similarly, a custom setting that
is not a multiple of the resolution supported by the
system is rounded down to the nearest non-zero multiple
of that resolution, and a warning message is printed.
Clock-based profiling data is converted into the fol-
lowing metrics:
User CPU Time
Wall Time
Total LWP Time
System CPU Time
Wait CPU Time
User Lock Time
Text Page Fault Time
Data Page Fault Time
Other Wait Time
For experiments on multithreaded applications, all of
the times, other than Wall Time, are summed across all
LWPs in the process; Wall Time is the time spent in
all states for LWP 1 only. Total LWP Time adds up to
the real elapsed time, multiplied by the average number
of LWPs in the process.
Hardware Counter Overflow Profiling
Hardware counter overflow profiling records the number
of events counted by the hardware counter at the time
the overflow signal was processed. This type of profil-
ing is now available on systems running the Linux OS,
provided that they have the Perfctr patch installed.
Hardware counter overflow profiling can be done on sys-
tems that support overflow profiling and that include
the hardware counter shared library, libcpc.so(3). You
must use a version of the Solaris OS no earlier that
the Solaris 8 OS. On UltraSPARC[R] computers, you must
use a version of the hardware no earlier than the
UltraSPARC III hardware. On computers that do not sup-
port overflow profiling, an attempt to select hardware
counter overflow profiling generates an error.
The counters available depend on the specific CPU pro-
cessor and operating system. Running the collect com-
mand with no arguments prints out a usage message that
contains the names of the counters. The counters that
are considered well-known are displayed first in the
list, followed by a list of the raw hardware counters.
The lines of output are formatted similar to the fol-
lowing:
Well known HW counters available for profiling:
cycles[/{0|1}],9999991 ('CPU Cycles', alias for Cycle_cnt; CPU-cycles)
insts[/{0|1}],9999991 ('Instructions Executed', alias for Instr_cnt; events)
dcrm[/1],100003 ('D$ Read Misses', alias for DC_rd_miss; load events)
...
Raw HW counters available for profiling:
Cycle_cnt[/{0|1}],1000003 (CPU-cycles)
Instr_cnt[/{0|1}],1000003 (events)
DC_rd[/0],1000003 (load events)
SI_snoop[/0],1000003 (not-program-related events)
...
In the first line of the well-known counter output, the
first field, "cycles", gives the well-known counter
name that can be used in the -h counter... argument. It
is followed by a specification of which registers can
be used for that counter. The next field, "9999991",
is the default overflow value for that counter. The
following field in parentheses, "CPU Cycles", is the
metric name, followed by the raw hardware counter name.
The last field, "CPU-cycle", specifies the type of
units being counted. There can be up to two words for
the type of information. The second or only word of
the type information may be either "CPU-cycles" or
"events". If the counter can be used to provide a
time-based metric, the value is CPU-cycles; otherwise
it is events.
The second output line of the well-known counter output
above has "events" instead of "CPU-cycles" at the end
of the line, indicating that it counts events, and can
not be converted to a time.
The third output line above has two words of type
information, "load events", at the end of the line. The
first word of type information may have the value of
"load", "store", "load-store", or "not-program-
related". The first three of these type values indicate
that the counter is memory-related and the counter name
can be preceded by the "+" sign when used in the col-
lect -h command. The "+" sign indicates the request
for data collection to attempt to find the precise
instruction and virtual address that caused the event
on the counter that overflowed.
The "not-program-related" value indicates that the
counter captures events initiated by some other pro-
gram, such as CPU-to-CPU cache snoops. Using the
counter for profiling generates a warning and profiling
does not record a call stack. It does, however, show
the time being spent in an artificial function called
"collector_not_program_related". Thread IDs and LWP IDs
are recorded, but are meaningless.
The information included in the raw hardware counter
list is a subset of the well-known counter list. Each
line includes the internal counter name as used by cpu-
track(1), the register number(s) on which that counter
can be used, the default overflow value, and the
counter units, which is either CPU-cycles or Events.
EXAMPLES:
Example 1: Using the well-known counter information
listed in the above sample output, the following com-
mand:
collect -h cycles/0,hi,+dcrm,9999
enables the CPU Cycle profiling on register 0. The "hi"
value enables a sample rate that is approximately 10
times faster than the default rate of 9999991. The
"dcrm" value enables the D$ Read Miss profiling on
register 1 and the preceding "+" enables Dataspace pro-
filing for the dcrm. The "9999" value sets the sampling
to be done every 9999 read misses, instead of the
default value of every 100003 read misses.
Example 2:
Running the collect command with no arguments on an AMD
Opteron machine would produce a raw hardware counter
output similar to the following :
FP_dispatched_fpu_ops[/{0|1|2|3}],1000003 (events)
FP_cycles_no_fpu_ops_retired[/{0|1|2|3}],1000003 (CPU-cycles)
...
Using the above raw hardware counter output, the fol-
lowing command:
collect -h FP_dispatched_fpu_ops~umask=0x3/2,10007
enables the Floating Point Add and Multiply operations
to be tracked at the rate of 1 capture every 10007
events. (For more details on valid attribute values,
refer to the processor documentation). The "/2" value
specifies the instructions to be executed using the
second register of the hardware. The CPU cycle profil-
ing is done at an interval rate of 10007 events.
Synchronization Delay Tracing
Synchronization delay tracing records all calls to the
various thread synchronization routines where the
real-time delay in the call exceeds a specified thres-
hold. The data packet contains timestamps for entry and
exit to the synchronization routines, the thread ID,
and the LWP ID at the time the request is initiated.
(Synchronization requests from a thread can be ini-
tiated on one LWP, but complete on another.) Synchron-
ization delay tracing is not available on systems run-
ning the Linux OS.
Synchronization delay tracing data is converted into
the following metrics:
Synchronization Delay Events
Synchronization Wait Time
Heap Tracing
Heap tracing records all calls to malloc, free, real-
loc, memalign, and valloc with the size of the block
requested, its address, and for realloc, the previous
address.
Heap tracing data is converted into the following
metrics:
Leaks
Bytes Leaked
Allocations
Bytes Allocated
Leaks are defined as allocations that are not freed.
If a zero-length block is allocated, it counts as an
allocation with zero bytes allocated. If a zero-length
block is not freed, it counts as a leak with zero bytes
leaked.
For applications written in the Java[TM] programming
language, leaks are defined as allocations that have
not been garbage-collected. Heap profiling for such
applications is obsolescent and will not be supported
in future releases.
Heap tracing experiments can be very large, and may be
slow to process.
MPI Tracing
MPI tracing records calls to the MPI library for
functions that can take a significant amount of time to
complete. MPI tracing is not available on systems run-
ning the Linux OS.
The following functions from the MPI library are
traced:
MPI_Allgather
MPI_Allgatherv
MPI_Allreduce
MPI_Alltoall
MPI_Alltoallv
MPI_Barrier
MPI_Bcast
MPI_Bsend
MPI_Gather
MPI_Gatherv
MPI_Irecv
MPI_Isend
MPI_Recv
MPI_Reduce
MPI_Reduce_scatter
MPI_Rsend
MPI_Scan
MPI_Scatter
MPI_Scatterv
MPI_Send
MPI_Sendrecv
MPI_Sendrecv_replace
MPI_Ssend
MPI_Wait
MPI_Waitall
MPI_Waitany
MPI_Waitsome
MPI_Win_fence
MPI_Win_lock
MPI tracing data is converted into the following
metrics:
MPI Time
MPI Sends
MPI Bytes Sent
MPI Receives
MPI Bytes Received
Other MPI Calls
MPI Time is the total LWP time spent in the MPI func-
tion.
The MPI Bytes Received metric uses the actual number of
bytes for blocking calls, but uses the buffer size for
non-blocking calls. Metrics that are computed for col-
lective operations such as gather, scatter, and reduce
have the maximum possible values for these operations.
No reduction in the values is made due to optimization
of the collective operations.
Note that the MPI Bytes Received as reported may seri-
ously overestimate the actual transmissions whenever
the buffer used for a non-blocking receive is signifi-
cantly larger than the size needed for the receive.
Sampling and Global Data
Sampling refers to the process of generating markers
along the time line of execution. At each sample point,
execution statistics are recorded. All of the data
recorded at sample points is global to the program, and
does not map to function-level metrics.
Samples are always taken at the start of the process,
and at its termination. By default or if a non-zero -S
argument is specified, samples are taken periodically
at the specified interval. In addition, samples can be
taken by using the libcollector(3) API.
The data recorded at each sample point consists of
microstate accounting information from the kernel,
along with various other statistics maintained within
the kernel.
RESTRICTIONS
The Collector interposes on some signal-handling routines to
ensure that its use of SIGPROF signals for clock-based pro-
filing and SIGEMT for hardware counter overflow profiling is
not disrupted by the target program. The Collector library
re-installs its own signal handler if the target program
installs a signal handler. The Collector's signal handler
sets a flag that ensures that system calls are not inter-
rupted to deliver signals. This setting could change the
behavior of the target program.
The Collector interposes on setitimer(2) to ensure that the
profiling timer is not available to the target program if
clock-based profiling is enabled.
The Collector interposes on functions in the hardware
counter library, libcpc.so, so that an application cannot
use hardware counters while the Collector is collecting per-
formance data. The interposed functions return a value of
-1.
Hardware counter profiling, MPI profiling, dataspace profil-
ing, and synchronization tracing are not available on Linux
systems.
The -x option, to leave the target stopped on exit, is not
available on systems running the Linux OS.
For this release, the data from collecting periodic samples
is not reliable on systems running the Linux OS.
For this release, wide data discrepancies are observed when
profiling multithreaded applications on systems running the
RedHat Enterprise Linux OS.
Hardware counter overflow profiling cannot be run on a sys-
tem where cpustat is running, because cpustat takes control
of the counters, and does not let a user process use them.
Java Profiling requires the Java[TM] 2 SDK, version 1.4.2_02
or later.
Data is not collected on descendant processes that are
created to use the setuid attribute, nor on any descendant
processes created with an exec function run on an executable
that is not dynamically linked. Furthermore, subsequent
descendant processes may produce corrupted or unreadable
experiments. The workaround is to ensure that all processes
spawned are dynamically-linked and do not have the setuid
attribute.
Applications that call vfork(2) have these calls replaced by
a call to fork1(2).
SEE ALSO
collector(1), dbx(1), er_archive(1), er_cp(1), er_export(1),
er_mv(1), er_print(1), er_rm(1), libcollector(3), and the
Performance Analyzer manual.