Man Page collect.1

NAME

     collect - command used to collect program performance data

SYNOPSIS

     collect collect-arguments target target-arguments
     collect
     collect -V
     collect -R

DESCRIPTION

     The collect command runs the target process and records per-
     formance  data and global data for the process.  Performance
     data is collected using  profiling  or  tracing  techniques.
     The  data can be examined with a GUI program, (analyzer), or
     a command-line  program  (er_print).   The  data  collection
     software  run  by the collect command is referred to here as
     the Collector.

     The data from a single run of the collect command is  called
     an  experiment.   The  experiment is represented in the file
     system as a directory, with various files inside that direc-
     tory.

     The target is the path name of the executable, Java(TM) .jar
     file, or Java .class file for which you want to collect per-
     formance data.  (For more information about Java  profiling,
     see  JAVA  PROFILING,  below.)  Executables that are targets
     for the collect command can be compiled with  any  level  of
     optimization, but must use dynamic linking.  If a program is
     statically linked, the collect command prints an error  mes-
     sage.   In  order  to see annotated source using analyzer or
     er_print, targets should be compiled with the -g  flag,  and
     should not be stripped.

     In order to enable dataspace profiling, executables must  be
     compiled  with  the  -xhwcprof -xdebugformat=dwarf -g flags.
     These flags are valid only for the C compiler, and  only  on
     SPARC[R]  platforms.  See the section "DATASPACE PROFILING",
     below.

     The collect command uses the following strategy to find  its
     target:

     - If there is a file with the name of  the  target  that  is
       marked  executable, the file is verified as an ELF execut-
       able that can run on the target machine. If  the  file  is
       not  such  a  valid  ELF  executable,  the collect command
       fails.

     - If there is a file with the name of the  target,  and  the
       file is not executable, collect checks whether the file is
       a Java[TM] jar file or class file. If the file is  a  Java
       jar file or class file, the Java[TM] virtual machine (JVM)
       software is inserted as the  target,  with  any  necessary
       flags,  and  data  is collected on that JVM machine.  (The
       terms "Java virtual machine"  and  "JVM"  mean  a  virtual
       machine  for  the  Java[TM] platform.)  See the section on
       "JAVA PROFILING", below.

     - If there is no file with  the  name  of  the  target,  the
       user's  path is searched to find an executable; if an exe-
       cutable is found, it is verified as described above.

     - If no file of the current name is found, the command looks
       for  a file with that name and the string .class appended;
       if a file is  found,  the  target  of  a  JVM  machine  is
       inserted, with the appropriate flags, as above.

     - If none of these procedures can find the target, the  com-
       mand fails.

OPTIONS

     If invoked with no arguments, print a usage summary, includ-
     ing the default configuration of the experiment. If the pro-
     cessor supports hardware counter overflow  profiling,  print
     two  lists  containing  information about hardware counters.
     The first list contains "well known" hardware counters;  the
     second   list  contains  raw  hardware  counters.  For  more
     details, see the "Hardware Counter Overflow Profiling"  sec-
     tion below.

  Data Specifications
     -p option
          Collect clock-based profiling data.  The allowed values
          of option are:

          Value     Meaning

          off       Turn off clock-based profiling

          on        Turn  on  clock-based  profiling   with   the
                    default  profiling  interval of approximately
                    10 milliseconds.

          lo[w]     Turn on clock-based profiling with  the  low-
                    resolution  profiling  interval  of  approxi-
                    mately 100 milliseconds.

          hi[gh]    Turn on clock-based profiling with the  high-
                    resolution  profiling  interval  of  approxi-
                    mately 1 millisecond.

          n         Turn  on   clock-based   profiling   with   a
                    profiling  interval of n.  The value n can be
                    an integer or a floating-point number, with a
                    suffix  of u for values in microseconds, or m
                    for values in milliseconds.  If no suffix  is
                    used, assume the value to be in milliseconds.

                    If the value is smaller than the  clock  pro-
                    filing  minimum, set it to the minimum; if it
                    is not a  multiple  of  the  clock  profiling
                    resolution,  round down to the nearest multi-
                    ple of the clock resolution.  If  it  exceeds
                    the clock profiling maximum, report an error.
                    If it is negative or zero, report  an  error.
                    If  invoked  with  no  arguments,  report the
                    clock-profiling intervals.

          If no  explicit  -p  off  argument  is  given,  and  no
          hardware  counter overflow profiling is specified, turn
          on clock-based profiling.

     -h ctr_def...[,ctr_n_def]
          Collect hardware counter overflow profiles. The  number
          of  counter definitions, (ctr_def through ctr_n_def) is
          processor-dependent. For example, on an UltraSPARC  III
          system,  up  to  two  counters may be programmed; on an
          Intel Pentium IV with Hyperthreading, up to 18 counters
          are  available.  The  user  can  ascertain  the maximum
          number of hardware counters definitions  for  profiling
          on  a  target  system,  and  the full list of available
          hardware  counters,  by  running  the  collect  command
          without any arguments.

          This option is now available  on  systems  running  the
          Linux  OS.  The  user is responsible for installing the
          required perfctr patch on the system; that patch can be
          downloaded from:
          http://user.it.uu.se/~mikpe/linux/perfctr/2.6/perfctr-2.6.15.tar.gz
          Instructions for installation are contained within that
          tar file.

          Each counter definition  takes  one  of  the  following
          forms,  depending  on  whether  attributes for hardware
          counters are supported on the processor:

          1. [+]ctr[/reg#][,interval]

          2. [+]ctr[~attr=val]...[~attrN=valN][/reg#][,interval]

          The meanings of the counter definition options  are  as
          follows:

          Value     Meaning
          +         Optional parameter that  can  be  applied  to
                    memory-related  counters.  Causes  collect to
                    collect dataspace  data  by  backtracking  to
                    find the instruction that triggered the over-
                    flow, and to find the  virtual  and  physical
                    addresses of the memory reference. Backtrack-
                    ing works on SPARC processors, and only  with
                    counters  of type load, store, or load-store,
                    as displayed in the counter list obtained  by
                    running   the  collect  command  without  any
                    command-line  arguments.   See  the   section
                    "DATASPACE PROFILING", below.

          ctr       Processor-specific counter name. The user can
                    ascertain  the  list of counter names by run-
                    ning  the  collect   command    without   any
                    command-line arguments.

          attr=val  On some processors, attribute options can  be
                    associated  with  a  hardware counter. If the
                    processor supports  attribute  options,  then
                    running   collect  without  any  command-line
                    arguments specifies the  counter  definition,
                    ctr_def, in the second form listed above, and
                    provide a list of attribute names to use  for
                    attr.  Value val can be in decimal or hexade-
                    cimal format. Hexadecimal format numbers  are
                    in  C  program  format  where  the  number is
                    prepended by a  zero  and  lower-case  x  (0x
                    hex_number).

          reg#      Hardware register to use for the counter.  If
                    not  specified, collect attempts to place the
                    counter into the first available register and
                    as  a result, might be unable to place subse-
                    quent counters due to register conflicts.  If
                    the user specifies more than one counter, the
                    counters must use different  registers.   The
                    list  of  allowable  register  numbers can be
                    ascertained by running  the  collect  command
                    without any command-line arguments.

          interval  Sampling  frequency,  set  by  defining   the
                    counter  overflow value.  Valid values are as
                    follows:

                    Value     Meaning

                    on        Select the default rate, which  can
                              be  determined  by running the col-
                              lect command without  any  command-
                              line   arguments.   Note  that  the
                              default value for all raw  counters
                              is  the  same, and might not be the
                              most suitable value for a  specific
                              counter.

                    hi        Set interval  to  approximately  10
                              times shorter than on.

                    lo        Set interval  to  approximately  10
                              times longer than on.

                    value     Set interval to a  specific  value,
                              specified in decimal or hexadecimal
                              format.

          An experiment can specify both hardware  counter  over-
          flow  profiling and clock-based profiling.  If hardware
          counter overflow profiling  is  specified,  but  clock-
          based  profiling  is not explicitly specified, turn off
          clock-based profiling.

          For more information  on  hardware  counters,  see  the
          "Hardware Counter Overflow Profiling" section below.

     -s option
          Collect synchronization tracing data.  This  option  is
          not available on systems running the Linux OS.

          The minimum delay threshold for tracing events  is  set
          using option.  The allowed values of option are:

          Value     Meaning

          on        Turn on synchronization delay tracing and set
                    the threshold value by calibration at runtime

          calibrate Same as on

          off       Turn off synchronization delay tracing

          n         Turn on synchronization delay tracing with  a
                    threshold  value  of  n microseconds. If n is
                    zero, trace all events.

          all       Turn on  synchronization  delay  tracing  and
                    trace all synchronization events.

          By default, turn off synchronization delay tracing.

          Record synchronization events for  Java  monitors,  but
          not for native synchronization within the JVM machine.

     -H option
          Collect heap trace data. The allowed values  of  option
          are:

          Value     Meaning

          on        Turn on tracing of memory allocation requests

          off       Turn  off  tracing   of   memory   allocation
                    requests

          By default, turn off heap tracing.

          Record heap-tracing events for any native calls.  Treat
          calls to mmap as memory allocations.

          Heap profiling is  not  supported  for  Java  programs.
          Specifying it is treated as an error.

          Note that heap tracing may produce a very large experi-
          ment.   Such  experiments  are  very  slow  to load and
          browse.

     -m option
          Collect MPI tracing data. This option is not  available
          on systems running the Linux OS.

          The allowed values of option are:

          Value     Meaning

          on        Turn on tracing of MPI calls

          off       Turn off tracing of MPI calls

          By default, turn off MPI tracing.

     -S interval
          Collect periodic samples at the interval specified  (in
          seconds).   Record  data  samples from the process, and
          include a timestamp and execution statistics  from  the
          kernel,  among  other  things.   The  allowed values of
          interval are:

          Value     Meaning

          off       Turn off periodic sampling

          on        Turn on periodic sampling  with  the  default
                    sampling interval (1 second)

          n         Turn on periodic  sampling  with  a  sampling
                    interval of n in seconds; n must be positive.

          By default, turn on periodic sampling.

          If no data specification arguments are  supplied,  col-
          lect  clock-based  profiling  data,  using  the default
          resolution.

          If clock-based profiling is  explicitly  disabled,  and
          neither  hardware-counter  overflow  profiling  nor any
          kind of tracing is enabled, display a warning  that  no
          function-level  data  is  being collected, then execute
          the target and record global data.

          Note: With this release, data  discrepancies  are  high
                when running the collect command with this option
                on systems running the Linux OS.

  Experiment Controls
     -L size
          Limit the amount of profiling and tracing data recorded
          to size megabytes.  The limit applies to the sum of all
          profiling data and tracing  data,  but  not  to  sample
          points.  The  limit  is  only  approximate,  and can be
          exceeded.  When the limit is  reached,  stop  profiling
          and  tracing  data,  but  keep  the experiment open and
          record samples until  the  target  process  terminates.
          The allowed values of size are:

          Value     Meaning

          unlimited or none
                    Do not impose a size limit on the experiment

          n         Impose a limit of n MB.; n must  be  positive
                    and greater than zero.

          The default limit on the amount  of  data  recorded  is
          2000 Mbytes.

     -F option
          Control whether or not descendant processes should have
          their data recorded.  The allowed values of option are:

          Value     Meaning

          on        Record experiments  on  descendant  processes
                    from fork and exec

          all       Record   experiments   on   all    descendant
                    processes

          off       Do  not  record  experiments  on   descendant
                    processes

          By default, do not record  descendant  processes.   For
          more  details, users should read the section "FOLLOWING
          DESCENDANT PROCESSES", below.

     -A option
          Control whether or not load objects used by the  target
          process  should be archived or copied into the recorded
          experiment.  The allowed values of option are:

          Value     Meaning

          on        Archive load objects into the experiment.

          off       Do not archive load objects into the  experi-
                    ment.

          copy      Copy and archive load objects into the exper-
                    iment.

          A  user  that  copies  experiments  onto  a   different
          machine,  or  reads  the  experiments  from a different
          machine, should specify -A copy.  Note  that  doing  so
          does  not  copy  any sources or object files; it is the
          user's responsibility to ensure that  those  files  are
          accessible on the machine where the experiment is being
          run.

          The default setting for -A is on.

     -j option
          Control  Java  profiling  when  the  target  is  a  JVM
          machine. The allowed values of option are:

          Value     Meaning

          on        Record profiling data for  the  JVM  machine,
                    and  recognize  methods  compiled by the Java
                    HotSpot[TM] virtual machine, and also  record
                    Java callstacks.

          off       Do not record Java profiling data.

          <path>    Record profiling data for the  JVM,  and  use
                    the JVM as installed in <path>.

          See the section "JAVA PROFILING", below.

          The user must use -j on to obtain profiling data if the
          target  is  a  JVM  machine.   The  -j on option is not
          needed if the target is a class or jar file.  Users  on
          a  64-bit  JVM machine must specify its path explicitly
          as the target; do not use the -d64 option for a  32-bit
          JVM machine.  If the -j on option is specified, but the
          target is not a JVM machine, an invalid  argument might
          be passed to the target, and no data would be recorded.
          The collect command validates the version  of  the  JVM
          machine specified for Java profiling.

     -J java_args
          Specify arguments to be passed to the JVM used for pro-
          filing.  If  -J is specified, but Java profiling is not
          specified, and error is generated,  and  no  experiment
          run.

     -l signal
          Record a sample point  whenever  the  given  signal  is
          delivered to the process.

     -y signal[,r]
          Control recording of data with  signal.   Whenever  the
          given  signal  is  delivered  to  the  process,  switch
          between paused (no data is recorded) and resumed  (data
          is  recorded) states. Start in the resumed state if the
          optional ,r flag  is  given,  otherwise  start  in  the
          paused  state.  This option does not affect the record-
          ing of sample points.

  Output Controls
     -o experiment_name
          Use experiment_name as the name of the experiment to be
          recorded.   The  experiment_name string must end in the
          string .er; if not, print an error message and  do  not
          run the experiment.

          If -o is not specified, give the experiment a  name  of
          the form stem.n.ern, where stem is a string, and n is a
          number.  If a group name has been  specified  with  -g,
          set  stem to the group name without the .erg suffix. If
          no group name has  been  specified,  set  stem  to  the
          string "test".

          If invoked from one of the commands  used  to  run  MPI
          jobs  and -o is not specified, take the value of n used
          in the name  from  the  environment  variable  used  to
          define  the  MPI rank of that process. Otherwise, set n
          to one greater than the highest  integer  currently  in
          use.

          If the name is not specified in the form stem.n.er, and
          the given name is in use, print an error message and do
          not run the experiment.  If the name  is  of  the  form
          stem.n.er  and  the name supplied is in use, record the
          experiment under a name corresponding  to  one  greater
          than  the  highest value of n that is currently in use.
          Print a warning if the name is changed.

     -d directory_name
          Place the experiment in directory  directory_name.   If
          no  directory  is  given,  place  the experiment in the
          current working directory.  If  a  group  is  specified
          (see  -g, below), the group file is also written to the
          directory named by -d.

     -g group_name
          Add the experiment to the experiment group  group_name.
          The  group_name  string must end in the string .erg; if
          not, report an error and do not run the experiment.

     -O file
          Append all output from  collect  itself  to  the  named
          file,  but  do not redirect the output from the spawned
          target.  If file is set to /dev/null suppress all  out-
          put from collect, including any error messages.

  Other Arguments
     -C comment
          Put the comment into the notes file for the experiment.
          Up to ten -C arguments may be supplied.

     -n   Dry run: do not run  the  target,  but  print  all  the
          details  of  the experiment that would be run.  Turn on
          -v.

     -R   Display the  text  version  of  the  performance  tools
          README  in  the  terminal  window. If the README is not
          found, print a warning.  Do not examine  further  argu-
          ments and do no further processing.

     -V   Print the current  version.   Do  not  examine  further
          arguments and do no further processing.

     -v   Print the current version and further detailed informa-
          tion about the experiment being run.

     -x   Leave the target process stopped on the exit  from  the
          exec  system  call,  in  order  to  allow a debugger to
          attach to it.

          To attach a debugger to the target once it  is  stopped
          by  collect,  the user must follow the procedure below.
          This option is not available  on  systems  running  the
          Linux OS.

          - Determine the PID of the process

          - Start the debugger

          - Configure the debugger to ignore SIGPROF (and SIGEMT,
            if you chose to collect hardware counter data)

          - Attach to the process using the PID.

          As the process runs under the control of the  debugger,
          the Collector records an experiment.

FOLLOWING DESCENDANT PROCESSES

     Processes can create descendant processes by calling a  sys-
     tem  library  function.  The  Collector can collect data for
     descendant  processes  initiated  by   calls   to   fork(2),
     fork1(2),  fork(3F), vfork(2), and exec(2) and its variants.
     The call to vfork is replaced internally by a call to fork1.
     The  Collector  ignores  calls  to  system(3C),  system(3F),
     sh(3F), popen(3C), and similar functions, and their  associ-
     ated  descendant  processes.  If the -F on argument is used,
     the Collector opens a new  experiment  for  each  descendant
     process  inside the parent experiment. These new experiments
     are named with their lineage as follows:

     - An underscore is  appended  to  the  creator's  experiment
       name.

     - A code letter is added: either "f" for a fork, or "x"  for
       an exec.

     - A number is added after the  code  letter,  which  is  the
       index  of  the fork or exec. The assignment of this number
       is applied whether the process was started successfully or
       not.

     For example, if the experiment name for the initial  process
     is  "test.1.er",  the  experiment for the descendant process
     created by its third fork  is  "test.1.er/_f3.er".  If  that
     descendant  process  execs  a  new  image, the corresponding
     experiment name is "test.1.er/_f3_x1.er".

     If the -F all argument is used,  all  descendants  are  pro-
     cessed, including those from system(3C), system(3F), sh(3F),
     popen(3C), and similar functions.   Those  descendants  that
     are  processed by -F all but not by -F on are named with the
     code letter "c".

     The Analyzer and er_print automatically read experiments for
     descendant  processes  when  the founder experiment is read,
     but the experiments for the  descendant  processes  are  not
     selected for data display.
     To select the  data  for  display  from  the  command  line,
     specify  the  path  name  explicitly  to  either er_print or
     Analyzer. The specified path must include the founder exper-
     iment  name, and the descendant experiment's name inside the
     founder directory.

     For example, to see the data  for  the  third  fork  of  the
     test.1.er experiment:
               er_print test.1.er/_f3.er
               analyzer test.1.er/_f3.er

     You can prepare an experiment group file with  the  explicit
     names of descendant experiments of interest.

     To examine descendant processes in the  Analyzer,  load  the
     founder  experiment  and  select "Filter data" from the View
     menu. The analyzer will display a list of  experiments  with
     only  the  founder  experiment  checked. Uncheck the founder
     experiment and check the descendant experiment of interest.

PROFILING MULTITHREADED APPLICATIONS

     The collect command works on all multithreaded applications,
     but  the  default threads library on the Solaris 8 Operating
     System (Solaris OS)(known as T1), has several problems under
     profiling.   It  might  discard profiling interrupts when no
     thread is scheduled onto an LWP; in such  cases,  the  Total
     LWP  Time  reported may seriously underestimate the true LWP
     time.  Under some circumstances, it may also get a segmenta-
     tion  violation accessing an internal library mutex, causing
     the application to crash.

     On the Solaris 8 OS, the workaround is to use the  alternate
     threads library (known as T2), by prepending /usr/lib/lwp to
     your LD_LIBRARY_PATH setting.   On  the  Solaris  9  OS  and
     later, the default library is T2.

     The Collector detects the use of T1, and puts a  warning  in
     the experiment.

     Whereas multithreaded profiling is  available  under  Linux,
     there  are  large data discrepancies on RedHat Linux systems
     with this release.

JAVA PROFILING

     Java profiling consists of collecting a performance  experi-
     ment on the JVM machine as it runs the user's .class or .jar
     files.  If possible, callstacks are collected  in  both  the
     Java model and in the machine model.

     Data can be shown with view mode set  to  User,  Expert,  or
     Machine.  User mode shows each method by name, with data for
     interpreted   and   HotSpot-compiled   methods    aggregated
     together; it also suppresses data for non-user-Java threads.
     Expert mode separates HotSpot-compiled methods  from  inter-
     preted methods, and does not suppress non-user Java threads.
     Machine mode shows data for interpreted Java methods against
     the  JVM machine as it does the interpreting, while data for
     methods compiled with the Java HotSpot  virtual  machine  is
     reported  for named methods.  All threads are shown.  In all
     three modes, data is reported in the usual way  for  any  C,
     C++,  or  Fortran  code  called by a Java target.  Such code
     corresponds to Java native methods.  The  Analyzer  and  the
     er_print utility can switch between the view mode User, view
     mode Expert, and view mode  Machine,  with  User  being  the
     default.

     Clock-based profiling and hardware counter overflow  profil-
     ing  are  supported.   Synchronization tracing collects data
     only on the Java monitor calls,  and  synchronization  calls
     from  native  code;  it does not collect data about internal
     synchronization calls within the JVM machine.

     Heap tracing is not  supported  for  Java,and  generates  an
     error if specified.

     When collect inserts a target name of java into the argument
     list,  it  examines  environment variables for a path to the
     java target, in the order JDK_HOME, and then JAVA_PATH.  For
     the  first  of  these environment variables that is set, the
     resultant target is verified as an ELF executable. If it  is
     not,  collect  fails with an error indicating which environ-
     ment variable was used, and the  full  path  name  that  was
     tried.

     JDK_1_4_HOME is obsolete, and, if set,  is  ignored  with  a
     warning.

     If none of those environment variables is set,  the  collect
     command uses the default path where the Java[TM] 2 Platform,
     Standard Edition technology was installed with the  release,
     if  any,  and  if it was not installed, as set by the user's
     PATH.

     Java Profiling requires the Java[TM] 2 SDK, version 1.4.2_02
     or later.

OPENMP PROFILING

     Data collection for OpenMP programs collects data  that  can
     be  displayed  in  any  of the three view modes, just as for
     Java programs.  The presentation is identical for user  mode
     and  expert  mode.   Slave threads are shown as if they were
     really forked from the master thread, and have  call  stacks
     matching  the master thread. Frames in the call stack coming
     from the OpenMP runtime code  (libmtsk.so)  are  suppressed.
     For machine mode, the actual native stacks are shown.

     In user mode, various artificial functions are introduced as
     the  leaf  function  of  a  callstack  whenever  the runtime
     library is in one of several states.   These  functions  are
     <OMP-overhead>,     <OMP-idle>,    <OMP-reduction>,    <OMP-
     implicit_barrier>, <OMP-explicit_barrier>,  <OMP-lock_wait>,
     <OMP-critical_section_wait>, and <OMP-ordered_section_wait>.

     Two additional clock-profiling metrics are added to the data
     for  clock-profiling  experiments:  OMP  Work, and OMP Wait.
     The inclusive metrics are visible by default; the  exclusive
     are  not.  Together, the sum of those two metrics equals the
     Total LWP Time metric.  No additional metrics are added  for
     other experiments.

DATASPACE PROFILING

     A dataspace profile is a data collection  in  which  memory-
     related  events,  such as cache misses, are reported against
     the data object references that cause the events rather than
     just the instructions where the memory-related events occur.
     Dataspace profiling is not available on systems running  the
     Linux OS, nor on x86 based systems running the Solaris OS.

     To allow dataspace profiling, the target must be  a  C  pro-
     gram,  compiled  for  SPARC architecture, with the -xhwcprof
     -xdebugformat=dwarf -g flags, as described above.   Further-
     more,  the  data collected must be hardware counter profiles
     and the optional + must be prepended to  the  counter  name.
     If  the  optional  +  is  prepended  to  one  memory-related
     counter, but not all, the counters without the + will report
     dataspace  data against the <Unknown> data object, with sub-
     type (Dataspace data not requested during data collection).

     With the data collected, the er_print utility  allows  three
     additional   commands:    data_objects,   data_single,   and
     data_layout, as well as various commands relating to  Memory
     Objects.  See the er_print(1) man page for more information.

     In addition, the Analyzer now includes two tabs  related  to
     dataspace  profiling, labeled DataObjects and DataLayout, as
     well as a set of tabs relating to Memory Objects.   See  the
     analyzer(1) man page for more information.

USING COLLECT WITH MPI

     The collect command can be used with MPI by simply prefacing
     the  target  and  its arguments with the collect command and
     its arguments in the command line that starts the  MPI  job.
     For example, on an SMP machine,
          % mprun -np 16 a.out 3 5
     can be replaced by
          % mprun -np 16 collect -m  on  -d  /tmp/mydirectory  -g
          run1.erg a.out 3 5
     This command runs an MPI tracing experiment on each  of  the
     16  MPI  processes, collecting them all in a specific direc-
     tory, and collecting them as a group.  The individual exper-
     iments  are  named by the MPI rank, as described above under
     the -o option.  The experiments, as specified above, contain
     clock-based  profiling  data, which is turned on by default,
     and MPI tracing data.

     On a cluster, local file systems like /tmp may be private to
     a  node.   If experiments are collected on node-private file
     systems, you should gather those experiments to  a  globally
     visible  file  system  after the experiments have completed,
     and edit any group file to reflect the new location of those
     experiments.

USING COLLECT WITH PPGSZ

     The collect command can be used with ppgsz  by  running  the
     collect  command on the ppgsz command, and specifying the -F
     on flag.  The founder experiment is on the ppgsz  executable
     and is uninteresting.  If your path finds the 32-bit version
     of ppgsz, and the experiment is being run on a  system  that
     supports  64-bit processes, the first thing the collect com-
     mand does is execute an exec function on its 64-bit version,
     creating _x1.er.  That executable forks, creating _x1_f1.er.
     The descendant process attempts to execute an exec  function
     on  the  named  target, in the first directory on your path,
     then in the second, and so forth,  until  one  of  the  exec
     functions  succeeds.   If,  for  example,  the third attempt
     succeeds, the first two  descendant  experiments  are  named
     _x1_f1_x1.er  and  _x1_f1_x2.er,  and  both  are  completely
     empty.  The experiment on the target is  the  one  from  the
     successful  exec, the third one in the example, and is named
     _x1_f1_x3.er, stored under the founder experiment.   It  can
     be  processed  directly  by  invoking  the  Analyzer  or the
     er_print utility on test.1.er/_x1_f1_x3.er.

     If the 64-bit ppgsz is the initial process run,  or  if  the
     32-bit ppgsz is invoked on a 32-bit kernel, the fork descen-
     dant that executes exec on the real target has its  data  in
     _f1.er,  and  the  real target's experiment is in _f1_x3.er,
     assuming the same path properties as in the example above.

     See the section  "FOLLOWING  DESCENDANT  PROCESSES",  above.
     For more information on hardware counters, see the "Hardware
     Counter Overflow Profiling" section below.

     The collect command operates by inserting a shared  library,
     libcollector.so,    into    the   target's   address   space
     (LD_PRELOAD),  and  by  using  a  second   shared   library,
     collaudit.so,  to  record shared-object use with the runtime
     linker's  audit  interface  (LD_AUDIT).   Those  two  shared
     libraries write the files that constitute the experiment.

     Several problems may arise if collect is invoked on  execut-
     ables  that call setuid or setgid, or that create descendant
     processes that call setuid or setgid.  If the  user  running
     the  experiment  is  not  root, collection fails because the
     shared libraries are not installed in a  trusted  directory.
     The workaround is to run the experiments as root.

     In addition, the umask for the user running the collect com-
     mand  must  be  set to allow write permission for that user,
     and  for  any  users  or  groups  that  are   set   by   the
     setuid/setgid attributes on a program being on which exec is
     being run, and for any user or group to which such a program
     sets  itself.   If  the mask is not set properly, some files
     may not be written to the experiment, and processing of  the
     experiment is not possible.  If the log file can be written,
     an error is shown when the  user  attempts  to  process  the
     experiment.

     Other problems can arise if the target itself makes  any  of
     the  system  calls  to  set UID or GID, or if it changes its
     umask and then forks or runs exec on some other process,  or
     crle  was  used to configure how the runtime linker searches
     for shared objects.

DATA COLLECTED

Three types of data are collected: profiling data, tracing
data and sampling data. The data packets recorded in profil-
ing and tracing include the callstack of each LWP, the LWP,
thread and CPU IDs, and some event-specific data. The data
packets recorded in sampling contain global data such as
execution statistics, but no program-specific or event-
specific data. All data packets include a timestamp.

Clock-based Profiling
The event-specific data recorded in clock-based profil-
ing is an array of counts for each accounting micro-
state. The microstate array is incremented by the sys-
tem at a prescribed frequency, and is recorded by the
Collector when a profiling signal is processed.

Clock-based profiling can run at a range of frequencies
which must be multiples of the clock resolution used
for the profiling timer. In the Solaris 7 OS and some
updates of the Solaris 8 OS, the system clock is used.
To do high-resolution profiling on these systems, the
operating system on the machine must be running with a
high-resolution clock routine, which can be done by
putting the following line in the file /etc/system and
rebooting:

set hires_tick=1

If you try to do high-resolution profiling on a machine
with an operating system that does not support it, the
command prints a warning message and uses the highest
resolution supported. Similarly, a custom setting that
is not a multiple of the resolution supported by the
system is rounded down to the nearest non-zero multiple
of that resolution, and a warning message is printed.

Clock-based profiling data is converted into the fol-
lowing metrics:

User CPU Time
Wall Time
Total LWP Time
System CPU Time
Wait CPU Time
User Lock Time
Text Page Fault Time
Data Page Fault Time
Other Wait Time

For experiments on multithreaded applications, all of
the times, other than Wall Time, are summed across all
LWPs in the process; Wall Time is the time spent in
all states for LWP 1 only. Total LWP Time adds up to
the real elapsed time, multiplied by the average number
of LWPs in the process.

Hardware Counter Overflow Profiling
Hardware counter overflow profiling records the number
of events counted by the hardware counter at the time
the overflow signal was processed. This type of profil-
ing is now available on systems running the Linux OS,
provided that they have the Perfctr patch installed.

Hardware counter overflow profiling can be done on sys-
tems that support overflow profiling and that include
the hardware counter shared library, libcpc.so(3). You
must use a version of the Solaris OS no earlier that
the Solaris 8 OS. On UltraSPARC[R] computers, you must
use a version of the hardware no earlier than the
UltraSPARC III hardware. On computers that do not sup-
port overflow profiling, an attempt to select hardware
counter overflow profiling generates an error.

The counters available depend on the specific CPU pro-
cessor and operating system. Running the collect com-
mand with no arguments prints out a usage message that
contains the names of the counters. The counters that
are considered well-known are displayed first in the
list, followed by a list of the raw hardware counters.

The lines of output are formatted similar to the fol-
lowing:

Well known HW counters available for profiling:
cycles[/{0|1}],9999991 ('CPU Cycles', alias for Cycle_cnt; CPU-cycles)
insts[/{0|1}],9999991 ('Instructions Executed', alias for Instr_cnt; events)
dcrm[/1],100003 ('D$ Read Misses', alias for DC_rd_miss; load events)
...
Raw HW counters available for profiling:
Cycle_cnt[/{0|1}],1000003 (CPU-cycles)
Instr_cnt[/{0|1}],1000003 (events)
DC_rd[/0],1000003 (load events)
SI_snoop[/0],1000003 (not-program-related events)
...

In the first line of the well-known counter output, the
first field, "cycles", gives the well-known counter
name that can be used in the -h counter... argument. It
is followed by a specification of which registers can
be used for that counter. The next field, "9999991",
is the default overflow value for that counter. The
following field in parentheses, "CPU Cycles", is the
metric name, followed by the raw hardware counter name.
The last field, "CPU-cycle", specifies the type of
units being counted. There can be up to two words for
the type of information. The second or only word of
the type information may be either "CPU-cycles" or
"events". If the counter can be used to provide a
time-based metric, the value is CPU-cycles; otherwise
it is events.

The second output line of the well-known counter output
above has "events" instead of "CPU-cycles" at the end
of the line, indicating that it counts events, and can
not be converted to a time.

The third output line above has two words of type
information, "load events", at the end of the line. The
first word of type information may have the value of
"load", "store", "load-store", or "not-program-
related". The first three of these type values indicate
that the counter is memory-related and the counter name
can be preceded by the "+" sign when used in the col-
lect -h command. The "+" sign indicates the request
for data collection to attempt to find the precise
instruction and virtual address that caused the event
on the counter that overflowed.

The "not-program-related" value indicates that the
counter captures events initiated by some other pro-
gram, such as CPU-to-CPU cache snoops. Using the
counter for profiling generates a warning and profiling
does not record a call stack. It does, however, show
the time being spent in an artificial function called
"collector_not_program_related". Thread IDs and LWP IDs
are recorded, but are meaningless.

The information included in the raw hardware counter
list is a subset of the well-known counter list. Each
line includes the internal counter name as used by cpu-
track(1), the register number(s) on which that counter
can be used, the default overflow value, and the
counter units, which is either CPU-cycles or Events.

EXAMPLES:

Example 1: Using the well-known counter information
listed in the above sample output, the following com-
mand:

collect -h cycles/0,hi,+dcrm,9999

enables the CPU Cycle profiling on register 0. The "hi"
value enables a sample rate that is approximately 10
times faster than the default rate of 9999991. The
"dcrm" value enables the D$ Read Miss profiling on
register 1 and the preceding "+" enables Dataspace pro-
filing for the dcrm. The "9999" value sets the sampling
to be done every 9999 read misses, instead of the
default value of every 100003 read misses.

Example 2:

Running the collect command with no arguments on an AMD
Opteron machine would produce a raw hardware counter
output similar to the following :

FP_dispatched_fpu_ops[/{0|1|2|3}],1000003 (events)
FP_cycles_no_fpu_ops_retired[/{0|1|2|3}],1000003 (CPU-cycles)
...

Using the above raw hardware counter output, the fol-
lowing command:

collect -h FP_dispatched_fpu_ops~umask=0x3/2,10007

enables the Floating Point Add and Multiply operations
to be tracked at the rate of 1 capture every 10007
events. (For more details on valid attribute values,
refer to the processor documentation). The "/2" value
specifies the instructions to be executed using the
second register of the hardware. The CPU cycle profil-
ing is done at an interval rate of 10007 events.

Synchronization Delay Tracing
Synchronization delay tracing records all calls to the
various thread synchronization routines where the
real-time delay in the call exceeds a specified thres-
hold. The data packet contains timestamps for entry and
exit to the synchronization routines, the thread ID,
and the LWP ID at the time the request is initiated.
(Synchronization requests from a thread can be ini-
tiated on one LWP, but complete on another.) Synchron-
ization delay tracing is not available on systems run-
ning the Linux OS.

Synchronization delay tracing data is converted into
the following metrics:

Synchronization Delay Events
Synchronization Wait Time

Heap Tracing
Heap tracing records all calls to malloc, free, real-
loc, memalign, and valloc with the size of the block
requested, its address, and for realloc, the previous
address.

Heap tracing data is converted into the following
metrics:

Leaks
Bytes Leaked
Allocations
Bytes Allocated

Leaks are defined as allocations that are not freed.
If a zero-length block is allocated, it counts as an
allocation with zero bytes allocated. If a zero-length
block is not freed, it counts as a leak with zero bytes
leaked.

For applications written in the Java[TM] programming
language, leaks are defined as allocations that have
not been garbage-collected. Heap profiling for such
applications is obsolescent and will not be supported
in future releases.

Heap tracing experiments can be very large, and may be
slow to process.

MPI Tracing
MPI tracing records calls to the MPI library for
functions that can take a significant amount of time to
complete. MPI tracing is not available on systems run-
ning the Linux OS.

The following functions from the MPI library are
traced:

MPI_Allgather
MPI_Allgatherv
MPI_Allreduce
MPI_Alltoall
MPI_Alltoallv
MPI_Barrier
MPI_Bcast
MPI_Bsend
MPI_Gather
MPI_Gatherv
MPI_Irecv
MPI_Isend
MPI_Recv
MPI_Reduce
MPI_Reduce_scatter
MPI_Rsend
MPI_Scan
MPI_Scatter
MPI_Scatterv
MPI_Send
MPI_Sendrecv
MPI_Sendrecv_replace
MPI_Ssend
MPI_Wait
MPI_Waitall
MPI_Waitany
MPI_Waitsome
MPI_Win_fence
MPI_Win_lock

MPI tracing data is converted into the following
metrics:

MPI Time
MPI Sends
MPI Bytes Sent
MPI Receives
MPI Bytes Received
Other MPI Calls

MPI Time is the total LWP time spent in the MPI func-
tion.

The MPI Bytes Received metric uses the actual number of
bytes for blocking calls, but uses the buffer size for
non-blocking calls. Metrics that are computed for col-
lective operations such as gather, scatter, and reduce
have the maximum possible values for these operations.
No reduction in the values is made due to optimization
of the collective operations.

Note that the MPI Bytes Received as reported may seri-
ously overestimate the actual transmissions whenever
the buffer used for a non-blocking receive is signifi-
cantly larger than the size needed for the receive.

Sampling and Global Data
Sampling refers to the process of generating markers
along the time line of execution. At each sample point,
execution statistics are recorded. All of the data
recorded at sample points is global to the program, and
does not map to function-level metrics.

Samples are always taken at the start of the process,
and at its termination. By default or if a non-zero -S
argument is specified, samples are taken periodically
at the specified interval. In addition, samples can be
taken by using the libcollector(3) API.

The data recorded at each sample point consists of
microstate accounting information from the kernel,
along with various other statistics maintained within
the kernel.

RESTRICTIONS

     The Collector interposes on some signal-handling routines to
     ensure  that its use of SIGPROF signals for clock-based pro-
     filing and SIGEMT for hardware counter overflow profiling is
     not  disrupted by the target program.  The Collector library
     re-installs its own signal handler  if  the  target  program
     installs  a  signal  handler. The Collector's signal handler
     sets a flag that ensures that system calls  are  not  inter-
     rupted  to  deliver  signals.  This setting could change the
     behavior of the target program.

     The Collector interposes on setitimer(2) to ensure that  the
     profiling  timer  is  not available to the target program if
     clock-based profiling is enabled.

     The  Collector  interposes  on  functions  in  the  hardware
     counter  library,  libcpc.so,  so that an application cannot
     use hardware counters while the Collector is collecting per-
     formance  data.  The  interposed functions return a value of
     -1.


     Hardware counter profiling, MPI profiling, dataspace profil-
     ing,  and synchronization tracing are not available on Linux
     systems.

     The -x option, to leave the target stopped on exit,  is  not
     available on systems running the Linux OS.

     For this release, the data from collecting periodic  samples
     is not reliable on systems running the Linux OS.

     For this release, wide data discrepancies are observed  when
     profiling  multithreaded applications on systems running the
     RedHat Enterprise Linux OS.

     Hardware counter overflow profiling cannot be run on a  sys-
     tem  where cpustat is running, because cpustat takes control
     of the counters, and does not let a user process use them.

     Java Profiling requires the Java[TM] 2 SDK, version 1.4.2_02
     or later.

     Data is not  collected  on  descendant  processes  that  are
     created  to  use the setuid attribute, nor on any descendant
     processes created with an exec function run on an executable
     that  is  not  dynamically  linked.  Furthermore, subsequent
     descendant processes may  produce  corrupted  or  unreadable
     experiments.  The workaround is to ensure that all processes
     spawned are dynamically-linked and do not  have  the  setuid
     attribute.

     Applications that call vfork(2) have these calls replaced by
     a call to fork1(2).