mirror of
https://sourceware.org/git/binutils-gdb.git
synced 2024-11-21 01:12:32 +08:00
d25ba4596e
gprofng application names have a gp- prefix (gp-display-text, gp-archive, etc.). But our man pages use the gprofng- prefix (gprofng-display-text, gprofng-archive, etc.). I renamed the gprofng binaries and temporarily created the gp-* links for compatibility with the old gprofng-gui. We plan to remove these links in version 2.46. gprofng/ChangeLog 2024-10-31 Vladimir Mezentsev <vladimir.mezentsev@oracle.com> * doc/gprofng-archive.texi: Rename gprofng application names. * doc/gprofng-collect-app.texi: Likewise. * doc/gprofng-display-html.texi: Likewise. * doc/gprofng-display-src.texi: Likewise. * doc/gprofng-display-text.texi: Likewise. * doc/gprofng.texi: Likewise. * doc/gprofng_ug.texi: Likewise. * gp-display-html/Makefile.am: Likewise. * gp-display-html/gp-display-html.in: Likewise. * libcollector/collector.c: Likewise. * src/Application.cc: Likewise. * src/Experiment.cc: Likewise. * src/Makefile.am: Likewise. * src/gp-archive.cc: Likewise. * src/gp-collect-app.cc: Likewise. * src/gp-display-src.cc: Likewise. * src/gp-display-text.cc: Likewise. * src/gprofng.cc: Likewise. * src/Makefile.in: Rebuild. * gp-display-html/Makefile.in: Rebuild.
4445 lines
161 KiB
Plaintext
4445 lines
161 KiB
Plaintext
\input texinfo @c -*-texinfo-*-
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@c This is the Texinfo source file for the GPROFNG manual. This manual
|
|
@c includes the man pages for the various tools.
|
|
@c
|
|
@c Author: Ruud van der Pas
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@c %**start of header
|
|
|
|
@setfilename gprofng.info
|
|
@settitle GNU gprofng
|
|
|
|
@c -- Set the indent for the @example command to 1 space, not 5 ---------------
|
|
@exampleindent 1
|
|
|
|
@paragraphindent 3
|
|
|
|
@c %**end of header
|
|
|
|
@c -- Start a new chapter on a new, odd numbered, page ------------------------
|
|
@setchapternewpage odd
|
|
|
|
@c -- Merge all index entries into the Concepts Index -------------------------
|
|
@syncodeindex fn cp
|
|
@syncodeindex ky cp
|
|
@syncodeindex pg cp
|
|
@syncodeindex vr cp
|
|
|
|
@c -- Macros specific to gprofng ----------------------------------------------
|
|
@include gp-macros.texi
|
|
|
|
@c -- Get the version information ---------------------------------------------
|
|
@include version.texi
|
|
|
|
@c -- Entry for the Info dir structure ----------------------------------------
|
|
@ifnottex
|
|
@dircategory Software development
|
|
@direntry
|
|
* gprofng: (gprofng). The next generation profiling tool for Linux
|
|
@end direntry
|
|
@end ifnottex
|
|
|
|
@c -- Copyright stuff ---------------------------------------------------------
|
|
@copying
|
|
This document is the manual for @ProductName{}, last updated @value{UPDATED}.
|
|
|
|
Copyright @copyright{} 2022-2024 Free Software Foundation, Inc.
|
|
|
|
@c -- @quotation
|
|
Permission is granted to copy, distribute and/or modify this document
|
|
under the terms of the GNU Free Documentation License,
|
|
Version 1.3 or any later version published by the Free Software
|
|
Foundation; with no Invariant Sections, with no Front-Cover texts,
|
|
and with no Back-Cover Texts. A copy of the license is included in the
|
|
section entitled ``GNU Free Documentation License.''
|
|
|
|
@c -- @end quotation
|
|
@end copying
|
|
|
|
@finalout
|
|
@smallbook
|
|
|
|
@c -- Define the title page ---------------------------------------------------
|
|
@titlepage
|
|
@title GNU gprofng
|
|
@subtitle The next generation profiling tool for Linux
|
|
@subtitle version @value{VERSION} (last updated @value{UPDATED})
|
|
@author Ruud van der Pas
|
|
@page
|
|
@vskip 0pt plus 1filll
|
|
@insertcopying
|
|
@end titlepage
|
|
|
|
@c -- Generate the Table of Contents ------------------------------------------
|
|
@contents
|
|
|
|
@c -- The Top node ------------------------------------------------------------
|
|
@c Should contain a short summary, copying permissions and a master menu.
|
|
@c ----------------------------------------------------------------------------
|
|
@ifnottex
|
|
@node Top
|
|
@top GNU Gprofng
|
|
|
|
@insertcopying
|
|
@end ifnottex
|
|
|
|
@ifinfo
|
|
@c -- The menu entries --------------------------------------------------------
|
|
|
|
@c * Display Source Code:: Display the source code and disassembly.
|
|
@c * Archive Experiment Data:: Archive an experiment.
|
|
|
|
@menu
|
|
* Introduction:: About this manual.
|
|
* Overview:: A brief overview of @ProductName{}.
|
|
* A Mini Tutorial:: A short tutorial with the key features.
|
|
* The gprofng Tools:: An overview of the tools supported.
|
|
* Performance Data Collection:: Record the performance information.
|
|
* View the Performance Information:: Different ways to view the data.
|
|
* Terminology:: Concepts and terminology explained.
|
|
* Other Document Formats:: Create this document in other formats.
|
|
* The gprofng Man Pages:: The gprofng man pages.
|
|
* Index:: The index.
|
|
|
|
@detailmenu
|
|
|
|
--- The Detailed Node Listing ---
|
|
|
|
Introduction
|
|
|
|
Overview
|
|
|
|
* Main Features:: A high level overview.
|
|
* Sampling versus Tracing:: The pros and cons of both approaches.
|
|
* Steps Needed to Create a Profile:: How to create a profile.
|
|
|
|
A Mini Tutorial
|
|
|
|
* Getting Started:: The basics of profiling with gprofng.
|
|
* Support for Multithreading:: Commands for multithreaded applications.
|
|
* View Multiple Experiments:: Analyze multiple experiments.
|
|
* Profile Hardware Event Counters:: How to use hardware event counters.
|
|
* Java Profiling:: How to profile a Java application.
|
|
|
|
The gprofng Tools
|
|
|
|
* Tools Overview:: A brief description of the tools.
|
|
* The gprofng.rc file with default settings:: Customize the settings.
|
|
* Filters:: Filters.
|
|
* Supported Environment Variables:: The supported environment variables.
|
|
|
|
Performance Data Collection
|
|
|
|
* The @command{gprofng collect app} Tool:: Collect application performance data.
|
|
|
|
View the Performance Information
|
|
|
|
* The @command{gprofng display text} Tool:: View the performance data in plain text.
|
|
|
|
Terminology
|
|
|
|
* The Program Counter:: What is a Program Counter?
|
|
* Inclusive and Exclusive Metrics:: An explanation of inclusive and exclusive metrics.
|
|
* Metric Definitions:: Definitions associated with metrics.
|
|
* The Viewmode:: Select the way call stacks are presented.
|
|
* The Selection List:: How to define a selection.
|
|
* Load Objects and Functions:: The components in an application.
|
|
* The Concept of a CPU in gprofng:: The definition of a CPU.
|
|
* Hardware Event Counters Explained:: What are event counters?
|
|
* apath:: Our generic definition of a path.
|
|
|
|
The gprofng Man Pages
|
|
|
|
* Man page for @command{gprofng collect app}:: The man page for gprofng collect app.
|
|
* Man page for @command{gprofng display text}:: The man page for gprofng display text.
|
|
* Man page for @command{gprofng display html}:: The man page for gprofng display html.
|
|
* Man page for @command{gprofng display src}:: The man page for gprofng display src.
|
|
* Man page for @command{gprofng archive}:: The man page for gprofng archive.
|
|
|
|
@c -- Index
|
|
|
|
@end detailmenu
|
|
@end menu
|
|
@end ifinfo
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Introduction
|
|
@chapter Introduction
|
|
@c ----------------------------------------------------------------------------
|
|
The @ProductName{} tool is the next generation profiler for Linux. It consists
|
|
of various commands to generate and display profile information.
|
|
|
|
This manual starts with a tutorial how to create and interpret a profile. This
|
|
part is highly practical and has the goal to get users up to speed as quickly
|
|
as possible. As soon as possible, we would like to show you how to get your
|
|
first profile on your screen.
|
|
|
|
This is followed by more examples, covering many of the features. At the
|
|
end of this tutorial, you should feel confident enough to tackle the more
|
|
complex tasks.
|
|
|
|
In a future update a more formal reference manual will be included as well.
|
|
Since even in this tutorial we use certain terminology, we have included a
|
|
chapter with descriptions at the end. In case you encounter unfamiliar
|
|
wordings or terminology, please check this chapter.
|
|
|
|
One word of caution. In several cases we had to somewhat tweak the screen
|
|
output in order to make it fit. This is why the output may look somewhat
|
|
different when you try things yourself.
|
|
|
|
For now, we wish you a smooth profiling experience with @ProductName{} and
|
|
good luck tackling performance bottlenecks.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@c cccccc @node A Brief Overview of @ProductName{}
|
|
@node Overview
|
|
@chapter A Brief Overview of gprofng
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@menu
|
|
* Main Features:: A high level overview.
|
|
* Sampling versus Tracing:: The pros and cons of sampling versus tracing.
|
|
* Steps Needed to Create a Profile:: How to create a profile.
|
|
@end menu
|
|
|
|
Before we cover this tool in quite some detail, we start with a brief overview
|
|
of what it is, and the main features. Since we know that many of you would
|
|
like to get started rightaway, already in this first chapter we explain the
|
|
basics of profiling with @ToolName{}.
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@c TBD Review this text. Probably be more specific on the gcc releases and
|
|
@c processor specifics.
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Main Features
|
|
@section Main Features
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@noindent
|
|
These are the main features of the @ProductName{} tool:
|
|
|
|
@itemize @bullet
|
|
|
|
@item
|
|
Profiling is supported for an application written in C, C++, Java, or Scala.
|
|
|
|
@c TBD Java: up to 1.8 full support, support other than for modules
|
|
|
|
@item
|
|
Shared libraries are supported. The information is presented at the
|
|
instruction level.
|
|
|
|
@item
|
|
The following multithreading programming models are supported: Pthreads,
|
|
OpenMP, and Java threads.
|
|
|
|
@item
|
|
This tool works with unmodified production level executables. There is no need
|
|
to recompile the code, but if the @samp{-g} option has been used when building
|
|
the application, source line level information is available.
|
|
|
|
@item
|
|
The focus is on support for code generated with the @command{gcc} compiler, but
|
|
there is some limited support for the @command{icc} compiler as well. Future
|
|
improvements and enhancements will focus on @command{gcc} though.
|
|
|
|
@item
|
|
Processors from Intel, AMD, and Arm are supported, but the level of support
|
|
depends on the architectural details. In particular, hardware event counters
|
|
may not be supported. If this is the case, all views not related to these
|
|
counters still ought to work though.
|
|
|
|
@item
|
|
Several views into the data are supported. For example, a function overview
|
|
where the time is spent, but also a source line, disassembly, call tree and
|
|
a caller-callees overview are available.
|
|
|
|
@item
|
|
Through filters, the user can zoom in on an area of interest.
|
|
|
|
@item
|
|
Two or more profiles can be aggregated, or used in a comparison. This
|
|
comparison can be obtained at the function, source line, and disassembly
|
|
level.
|
|
|
|
@item
|
|
Through a simple scripting language, and customization of the metrics shown,
|
|
the generation and creation of a profile can be fully automated and provide
|
|
tailored output.
|
|
|
|
@end itemize
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Sampling versus Tracing
|
|
@section Sampling versus Tracing
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
A key difference with some other profiling tools is that the main data
|
|
collection command @CollectApp{} mostly uses
|
|
@cindex Program Counter sampling
|
|
@cindex PC sampling
|
|
Program Counter (PC) sampling
|
|
under the hood.
|
|
|
|
With @emph{sampling}, the executable is interrupted at regular intervals. Each
|
|
time it is halted, key information is gathered and stored. This includes the
|
|
Program Counter that keeps track of where the execution is. Hence the name.
|
|
|
|
Together with operational data, this information is stored in the experiment
|
|
directory and can be viewed in the second phase.
|
|
|
|
For example, the PC information is used to derive where the program was when
|
|
it was halted. Since the sampling interval is known, it is relatively easy to
|
|
derive how much time was spent in the various parts of the program.
|
|
|
|
The opposite technique is generally referred to as @emph{tracing}. With
|
|
tracing, the target is instrumented with specific calls that collect the
|
|
requested information.
|
|
|
|
These are some of the pros and cons of PC sampling verus tracing:
|
|
|
|
@itemize
|
|
|
|
@item
|
|
Since there is no need to recompile, existing executables can be used
|
|
and the profile measures the behaviour of exactly the same executable that is
|
|
used in production runs.
|
|
|
|
With sampling, one inherently profiles a different executable, because
|
|
the calls to the instrumentation library may affect the compiler optimizations
|
|
and run time behaviour.
|
|
|
|
@item
|
|
With sampling, there are very few restrictions on what can be profiled and
|
|
even without access to the source code, a basic profile can be made.
|
|
|
|
@item
|
|
A downside of sampling is that, depending on the sampling frequency, small
|
|
functions may be missed or not captured accurately. Although this is rare,
|
|
this may happen and is the reason why the user has control over the sampling
|
|
rate.
|
|
|
|
@item
|
|
While tracing produces precise information, sampling is statistical in nature.
|
|
As a result, small variations may occur across seemingly identical runs. We
|
|
have not observed more than a few percent deviation though. Especially if
|
|
the target job executed for a sufficiently long time.
|
|
|
|
@item
|
|
With sampling, it is not possible to get an accurate count how often
|
|
functions are called.
|
|
|
|
@end itemize
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Steps Needed to Create a Profile
|
|
@section Steps Needed to Create a Profile
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
Creating a profile takes two steps. First the profile data needs to be
|
|
generated. This is followed by a viewing step to create a report from the
|
|
information that has been gathered.
|
|
|
|
Every @ProductName{} command starts with @ToolName{}, the name of the driver.
|
|
This is followed by a keyword to define the high level functionality.
|
|
Depending on this keyword, a third qualifier may be needed to further narrow
|
|
down the request.
|
|
This combination is then followed by options that are specific to the
|
|
functionality desired.
|
|
|
|
The command to gather, or ``collect'', the performance data is called
|
|
@CollectApp{}. Aside from numerous options, this command takes the name
|
|
of the target executable as an input parameter.
|
|
|
|
Upon completion of the run, the performance data can be
|
|
found in the newly created
|
|
@cindex Experiment directory
|
|
experiment directory.
|
|
|
|
Unless explicitly specified otherwise, a default
|
|
name for this directory is chosen. The name is @file{test.<n>.er} where
|
|
@var{<n>} is the first integer number not in use yet for such a name.
|
|
|
|
For example, the first time @CollectApp{} is invoked, an experiment
|
|
directory with the name @file{test.1.er} is created.
|
|
Upon a subsequent invocation of @CollectApp{} in the same directory,
|
|
an experiment directory with the name @file{test.2.er} will be created,
|
|
and so forth.
|
|
|
|
Note that @CollectApp{} supports an option to explicitly name the experiment
|
|
directory.
|
|
Aside from the restriction that the name of this directory has to end
|
|
with @samp{.er}, any valid directory name can be used for this.
|
|
|
|
Now that we have the performance data, the next step is to display it.
|
|
|
|
@IndexSubentry{@code{gprofng}, @code{display text}}
|
|
The most commonly used command to view the performance information is
|
|
@DisplayText{}. This is a very extensive and customizable tool that
|
|
produces the information in ASCII format.
|
|
|
|
@IndexSubentry{@code{gprofng}, @code{display html}}
|
|
Another option is to use @DisplayHTML{}. This tool generates a directory with
|
|
files in html format. These can be viewed in a browser, allowing for easy
|
|
navigation through the profile data.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node A Mini Tutorial
|
|
@chapter A Mini Tutorial
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
In this chapter we present and discuss the main functionality of @ToolName{}.
|
|
This will be a practical approach, using an example code to generate profile
|
|
data and show how to get various performance reports.
|
|
|
|
@menu
|
|
* Getting Started:: The basics of profiling with @ProductName().
|
|
* Support for Multithreading:: Commands specific to multithreaded applications.
|
|
* View Multiple Experiments:: Analyze multiple experiments simultaneously.
|
|
* Profile Hardware Event Counters:: How to use hardware event counters.
|
|
* Java Profiling:: How to profile a Java application.
|
|
@end menu
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Getting Started
|
|
@section Getting Started
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The information presented here provides a good and common basis for many
|
|
profiling tasks, but there are more features that you may want to leverage.
|
|
|
|
These are covered in subsequent sections in this chapter.
|
|
|
|
@menu
|
|
* The Example Program:: A description of the example program used.
|
|
* A First Profile:: How to get the first profile.
|
|
* The Source Code View:: Display the metrics in the source code.
|
|
* The Disassembly View:: Display the metrics at the instruction level.
|
|
* Display and Define the Metrics:: An example how to customize the metrics.
|
|
* Customization of the Output:: An example how to customize the output.
|
|
* Name the Experiment Directory:: Change the name of the experiment directory.
|
|
* Control the Number of Lines in the Output:: Change the number of lines in the tables.
|
|
* Sorting the Performance Data:: How to set the metric to sort by.
|
|
* Scripting:: Use a script to execute the commands.
|
|
* A More Elaborate Example:: An example of customization.
|
|
* The Call Tree:: Display the dynamic call tree.
|
|
* More Information on the Experiment:: How to get additional statistics.
|
|
* Control the Sampling Frequency:: How to control the sampling granularity.
|
|
* Information on Load Objects:: How to get more information on load objects.
|
|
@end menu
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node The Example Program
|
|
@subsection The Example Program
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
Throughout this guide we use the same example C code that implements the
|
|
multiplication of a vector of length @math{n} by an @math{m} by @math{n}
|
|
matrix. The result is stored in a vector of length @math{m}.
|
|
@cindex Pthreads
|
|
@cindex Posix Threads
|
|
The algorithm has been parallelized using Posix Threads, or Pthreads for short.
|
|
|
|
The code was built using the @code{gcc} compiler and the name of the executable
|
|
is
|
|
@cindex mxv-pthreads
|
|
@command{mxv-pthreads}.
|
|
|
|
The matrix sizes can be set through the @code{-m} and @code{-n} options. The
|
|
number of threads is set with the @code{-t} option. These are additional
|
|
threads that are used in the multiplication. To increase the duration of
|
|
the run, the computations are executed repeatedly.
|
|
|
|
This is an example that multiplies a @math{8000} by @math{4000} matrix with
|
|
a vector of length @math{4000}. Although this is a multithreaded application,
|
|
initially we will be using @math{1} threads. Later on we will show examples
|
|
using multiple threads.
|
|
|
|
@smallexample
|
|
@verbatim
|
|
$ ./mxv-pthreads -m 8000 -n 4000 -t 1
|
|
mxv: error check passed - rows = 8000 columns = 4000 threads = 1
|
|
$
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The program performs an internal check to verify that the computed results
|
|
are correct. The result of this check is printed, as well as the matrix
|
|
sizes and the number of threads used.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node A First Profile
|
|
@subsection A First Profile
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The first step is to collect the performance data. It is important to remember
|
|
that much more information is gathered than may be shown by default. Often a
|
|
single data collection run is sufficient to get a lot of insight.
|
|
|
|
The @CollectApp{} command is used for the data collection. Nothing needs to be
|
|
changed in the way the application is executed. The only difference is that it
|
|
is now run under control of the tool, as shown below:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng collect app ./mxv-pthreads -m 8000 -n 4000 -t 1
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@noindent
|
|
This produces the following output:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Creating experiment directory test.1.er (Process ID: 2749878) ...
|
|
mxv: error check passed - rows = 8000 columns = 4000 threads = 1
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
We see a message that an experiment directory with the name @file{test.1.er}
|
|
has been created. The process id is also echoed. The application completes
|
|
as usual and we have our first experiment directory that can be analyzed.
|
|
|
|
The tool we use for this is called @DisplayText{}. It takes the name of
|
|
the experiment directory as an argument.
|
|
|
|
@cindex Interpreter mode
|
|
If invoked this way, the tool starts in the interactive @emph{interpreter}
|
|
mode.
|
|
While in this environment, commands can be given and the tool responds. This
|
|
is illustrated below:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
$ gprofng display text test.1.er
|
|
Warning: History and command editing is not supported on this system.
|
|
(gprofng-display-text) quit
|
|
$
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@cindex Command line mode
|
|
While useful in certain cases, we prefer to use this tool in command line mode
|
|
by specifying the commands to be issued when invoking the tool. The way to do
|
|
this is to prepend the command(s) with a hyphen (@samp{-}) if used on the
|
|
command line.
|
|
|
|
Since this makes the command appear to be an option, they are also sometimes
|
|
referred to as such, but technically they are commands. This is the
|
|
terminology we will use in this user guide, but for convenience the commands
|
|
are also listed as options in the index.
|
|
|
|
For example,
|
|
@IndexSubentry{Options, @code{-functions}}
|
|
@IndexSubentry{Commands, @code{functions}}
|
|
below we use the @command{functions} command to request a list of the functions
|
|
that have been executed, plus their respective CPU times:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -functions test.1.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Incl. Total Name
|
|
CPU CPU
|
|
sec. % sec. %
|
|
9.367 100.00 9.367 100.00 <Total>
|
|
8.926 95.30 8.926 95.30 mxv_core
|
|
0.210 2.24 0.420 4.49 init_data
|
|
0.080 0.85 0.210 2.24 drand48
|
|
0.070 0.75 0.130 1.39 erand48_r
|
|
0.060 0.64 0.060 0.64 __drand48_iterate
|
|
0.010 0.11 0.020 0.21 _int_malloc
|
|
0.010 0.11 0.010 0.11 sysmalloc
|
|
0. 0. 8.926 95.30 <static>@0x47960 (<libgp-collector.so>)
|
|
0. 0. 0.440 4.70 __libc_start_main
|
|
0. 0. 0.020 0.21 allocate_data
|
|
0. 0. 8.926 95.30 driver_mxv
|
|
0. 0. 0.440 4.70 main
|
|
0. 0. 0.020 0.21 malloc
|
|
0. 0. 8.926 95.30 start_thread
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@noindent
|
|
As easy and simple as these steps are, we do have a first profile of our
|
|
program!
|
|
|
|
There are five columns. The first four contain the
|
|
@cindex Total CPU time
|
|
''Total CPU Time'', which
|
|
is the sum of the user and system time. @xref{Inclusive and Exclusive Metrics}
|
|
for an explanation of ``exclusive'' and ``inclusive'' times.
|
|
|
|
The first line echoes the metric that is used to sort the output. By default,
|
|
this is the exclusive CPU time, but through the @command{sort} command, the
|
|
sort metric can be changed by the user.
|
|
|
|
Next, there are four columns with the exclusive and inclusive CPU times and the
|
|
respective percentages. This is followed by the name of the function.
|
|
|
|
@IndexSubentry{Miscellaneous, @code{<Total>}}
|
|
The function with the name @code{<Total>} is not a user function. It is a
|
|
pseudo function introduced by @ToolName{}. It is used to display the
|
|
accumulated measured metric values. In this example, we see that the total
|
|
CPU time of this job was 9.367 seconds and it is scaled to 100%. All
|
|
other percentages in the same column are relative to this number.
|
|
|
|
@c -- If the metric is derived, for example the @code{IPC}, the value shown
|
|
@c -- under
|
|
@c -- @code{<Total>} is based upon the total values of the that are metrics
|
|
@c -- used to
|
|
@c -- compute the derived metric.
|
|
@c -- @IndexSubentry{Hardware event counters, IPC}
|
|
|
|
With 8.926 seconds, function @code{mxv_core} takes 95.30% of the
|
|
total time and is by far the most time consuming function.
|
|
The exclusive and inclusive metrics are identical, which means that it
|
|
is a leaf function not calling any other functions.
|
|
|
|
The next function in the list is @code{init_data}. Although with 4.49%,
|
|
the CPU time spent in this part is modest, this is an interesting entry because
|
|
the inclusive CPU time of 0.420 seconds is twice the exclusive CPU time
|
|
of 0.210 seconds. Clearly this function is calling another function,
|
|
or even more than one function and collectively this takes 0.210 seconds.
|
|
Below we show the call tree feature that provides more details on the call
|
|
structure of the application.
|
|
|
|
The function @code{<static>@@0x47960 (<libgp-collector.so>)} does odd and
|
|
certainly not familiar. It is one of the internal functions used by
|
|
@CollectApp{} and can be ignored. Also, while the inclusive time is high,
|
|
the exclusive time is zero. This means it doesn't contribute to the
|
|
performance.
|
|
|
|
The question is how we know where this function originates from. There are
|
|
several commands to dig deeper an get more details on a function.
|
|
@xref{Information on Load Objects}.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node The Source Code View
|
|
@subsection The Source Code View
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
In general, the tuning efforts are best focused on the most time consuming
|
|
part(s) of an application. In this case that is easy, since over 95% of
|
|
the total CPU time is spent in function @code{mxv_core}.
|
|
It is now time to dig deeper and look
|
|
@cindex Source level metrics
|
|
at the metrics distribution at the source code level. Since we measured
|
|
CPU times, these are the metrics shown.
|
|
|
|
@IndexSubentry{Options, @code{-source}}
|
|
@IndexSubentry{Commands, @code{source}}
|
|
The @code{source} command is used to accomplish this. It takes the name of the
|
|
function, not the source filename, as an argument. This is demonstrated
|
|
below, where the @DisplayText{} command is used to show the annotated
|
|
source listing of function @code{mxv_core}.
|
|
|
|
Be aware that when using the @command{gcc} compiler, the source code has to
|
|
be compiled with the @code{-g} option in order for the source code feature
|
|
to work. Otherwise the location(s) can not be determined. For other compilers
|
|
we recommend to check the documentation for such an option.
|
|
|
|
Below the command to display the source code of a function is shown. Since at
|
|
this point we are primarily interested in the timings only, we use the
|
|
@code{metrics} command to request the exclusive and inclusive total CPU
|
|
timings only. @xref{Display and Define the Metrics} for more information
|
|
how to define the metrics to be displayed.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -metrics ei.totalcpu -source mxv_core test.1.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
The output is shown below. It has been somewhat modified to fit the formatting
|
|
constraints and reduce the number of lines.
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Current metrics: e.totalcpu:i.totalcpu:name
|
|
Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
|
|
Source file: <apath>/mxv.c
|
|
Object file: mxv-pthreads (found as test.1.er/archives/...)
|
|
Load Object: mxv-pthreads (found as test.1.er/archives/...)
|
|
|
|
Excl. Incl.
|
|
Total Total
|
|
CPU sec. CPU sec.
|
|
|
|
<lines deleted>
|
|
<Function: mxv_core>
|
|
43. void __attribute__ ((noinline))
|
|
mxv_core (int64_t row_index_start,
|
|
44. int64_t row_index_end,
|
|
45. int64_t m,
|
|
46. int64_t n,
|
|
47. double **restrict A,
|
|
48. double *restrict b,
|
|
49. double *restrict c)
|
|
50. {
|
|
0. 0. 50. {
|
|
0. 0. 51. for (int64_t i=row_index_start;
|
|
i<=row_index_end; i++)
|
|
52. {
|
|
0. 0. 53. double row_sum = 0.0;
|
|
## 4.613 4.613 54. for (int64_t j=0; j<n; j++)
|
|
## 4.313 4.313 55. row_sum += A[i][j] * b[j];
|
|
0. 0. 56. c[i] = row_sum;
|
|
57. }
|
|
0. 0. 58. }
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The first line echoes the metrics that have been selected. The second line
|
|
is not very meaningful when looking at the source code listing, but it shows
|
|
the metric that is used to sort the data.
|
|
|
|
The next three lines provide information on the location of the source file,
|
|
the object file and the load object (@xref{Load Objects and Functions}).
|
|
|
|
Function @code{mxv_core} is part of a source file that has other functions
|
|
as well. These functions will be shown with the values for the metrics, but
|
|
for lay-out purposes they have been removed in the output shown above.
|
|
|
|
The header is followed by the annotated source code listing. The selected
|
|
metrics are shown first, followed by a source line number, and the source code.
|
|
@IndexSubentry{Miscellaneous ,@code{##}}
|
|
The most time consuming line(s) are marked with the @code{##} symbol. In
|
|
this way they are easier to identify and find with a search.
|
|
|
|
What we see is that all of the time is spent in lines 54-55.
|
|
|
|
@IndexSubentry{Options, @code{-lines}}
|
|
@IndexSubentry{Commands, @code{lines}}
|
|
A related command sometimes comes handy as well. It is called @code{lines}
|
|
and displays a list of the source lines and their metrics, ordered according
|
|
to the current sort metric (@xref{Sorting the Performance Data}).
|
|
|
|
Below the command and the output. For lay-out reasons, only the top 10 is
|
|
shown here and the last part of the text on some lines has been replaced
|
|
by dots. The full text is @samp{instructions without line numbers} and
|
|
means that the line number information for that function was not found.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -lines test.1.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Lines sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Incl. Total Name
|
|
CPU CPU
|
|
sec. % sec. %
|
|
9.367 100.00 9.367 100.00 <Total>
|
|
4.613 49.25 4.613 49.25 mxv_core, line 54 in "mxv.c"
|
|
4.313 46.05 4.313 46.05 mxv_core, line 55 in "mxv.c"
|
|
0.160 1.71 0.370 3.95 init_data, line 118 in "manage_data.c"
|
|
0.080 0.85 0.210 2.24 <Function: drand48, instructions ...>
|
|
0.070 0.75 0.130 1.39 <Function: erand48_r, instructions ...>
|
|
0.060 0.64 0.060 0.64 <Function: __drand48_iterate, ...>
|
|
0.040 0.43 0.040 0.43 init_data, line 124 in "manage_data.c"
|
|
0.010 0.11 0.020 0.21 <Function: _int_malloc, instructions ...>
|
|
0.010 0.11 0.010 0.11 <Function: sysmalloc, instructions ...>
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
What this overview immediately highlights is that the third most time consuming
|
|
source line takes 0.370 seconds only. This means that the inclusive time is
|
|
only 3.95% and clearly this branch of the code hardly impacts the performance.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node The Disassembly View
|
|
@subsection The Disassembly View
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The source view is very useful to obtain more insight where the time is spent,
|
|
but sometimes this is not sufficient. The disassembly view provides more
|
|
details since it shows the metrics at the instruction level.
|
|
|
|
This view is displayed with the
|
|
@IndexSubentry{Options, @code{-disasm}}
|
|
@IndexSubentry{Commands, @code{disasm}}
|
|
@command{disasm}
|
|
command and as with the source view, it displays an annotated listing. In this
|
|
@cindex Instruction level metrics
|
|
case it shows the instructions with the metrics, interleaved with the
|
|
source lines. The
|
|
instructions have a reference in square brackets (@code{[} and @code{]})
|
|
to the source line they correspond to.
|
|
|
|
@noindent
|
|
We again focus on the tmings only and set the metrics accordingly:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -metrics ei.totalcpu -disasm mxv_core test.1.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Current metrics: e.totalcpu:i.totalcpu:name
|
|
Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
|
|
Source file: <apath>/src/mxv.c
|
|
Object file: mxv-pthreads (found as test.1.er/archives/...)
|
|
Load Object: mxv-pthreads (found as test.1.er/archives/...)
|
|
|
|
Excl. Incl.
|
|
Total Total
|
|
CPU sec. CPU sec.
|
|
|
|
<lines deleted>
|
|
43. void __attribute__ ((noinline))
|
|
mxv_core (int64_t row_index_start,
|
|
44. int64_t row_index_end,
|
|
45. int64_t m,
|
|
46. int64_t n,
|
|
47. double **restrict A,
|
|
48. double *restrict b,
|
|
49. double *restrict c)
|
|
50. {
|
|
<Function: mxv_core>
|
|
0. 0. [50] 401d56: mov 0x8(%rsp),%r10
|
|
51. for (int64_t i=row_index_start;
|
|
i<=row_index_end; i++)
|
|
0. 0. [51] 401d5b: cmp %rsi,%rdi
|
|
0. 0. [51] 401d5e: jg 0x47
|
|
0. 0. [51] 401d60: add $0x1,%rsi
|
|
0. 0. [51] 401d64: jmp 0x36
|
|
52. {
|
|
53. double row_sum = 0.0;
|
|
54. for (int64_t j=0; j<n; j++)
|
|
55 row_sum += A[i][j] * b[j];
|
|
0. 0. [55] 401d66: mov (%r8,%rdi,8),%rdx
|
|
0. 0. [54] 401d6a: mov $0x0,%eax
|
|
0. 0. [53] 401d6f: pxor %xmm1,%xmm1
|
|
0.110 0.110 [55] 401d73: movsd (%rdx,%rax,8),%xmm0
|
|
1.921 1.921 [55] 401d78: mulsd (%r9,%rax,8),%xmm0
|
|
2.282 2.282 [55] 401d7e: addsd %xmm0,%xmm1
|
|
## 4.613 4.613 [54] 401d82: add $0x1,%rax
|
|
0. 0. [54] 401d86: cmp %rax,%rcx
|
|
0. 0. [54] 401d89: jne 0xffffffffffffffea
|
|
56. c[i] = row_sum;
|
|
0. 0. [56] 401d8b: movsd %xmm1,(%r10,%rdi,8)
|
|
0. 0. [51] 401d91: add $0x1,%rdi
|
|
0. 0. [51] 401d95: cmp %rsi,%rdi
|
|
0. 0. [51] 401d98: je 0xd
|
|
0. 0. [53] 401d9a: pxor %xmm1,%xmm1
|
|
0. 0. [54] 401d9e: test %rcx,%rcx
|
|
0. 0. [54] 401da1: jg 0xffffffffffffffc5
|
|
0. 0. [54] 401da3: jmp 0xffffffffffffffe8
|
|
57. }
|
|
58. }
|
|
0. 0. [58] 401da5: ret
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
For each instruction, the timing values are given and we can immediately
|
|
identify the most expensive instructions. As with the source level view,
|
|
these are marked with the @code{##} symbol.
|
|
|
|
It comes as no surprise that the time consuming instructions originate from
|
|
the source code at lines 54-55.
|
|
One thing to note is that the source line numbers no longer appear in
|
|
sequential order.
|
|
This is because the compiler has re-ordered the instructions as part of
|
|
the code optimizations it has performed.
|
|
|
|
As illustrated below and similar to the @command{lines} command, we can get
|
|
an overview of the instructions executed by using the
|
|
@IndexSubentry{Options, @code{-pcs}}
|
|
@IndexSubentry{Commands, @code{pcs}}
|
|
@command{pcs}
|
|
command.
|
|
|
|
@noindent
|
|
Below the command and the output, which again has been restricted
|
|
to 10 lines. As before, some lines have been shortened for lay-out
|
|
purposes.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -pcs test.1.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
PCs sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Incl. Total Name
|
|
CPU CPU
|
|
sec. % sec. %
|
|
9.367 100.00 9.367 100.00 <Total>
|
|
4.613 49.25 4.613 49.25 mxv_core + 0x0000002C, line 54 in "mxv.c"
|
|
2.282 24.36 2.282 24.36 mxv_core + 0x00000028, line 55 in "mxv.c"
|
|
1.921 20.51 1.921 20.51 mxv_core + 0x00000022, line 55 in "mxv.c"
|
|
0.150 1.60 0.150 1.60 init_data + 0x000000AC, line 118 in ...
|
|
0.110 1.18 0.110 1.18 mxv_core + 0x0000001D, line 55 in "mxv.c"
|
|
0.040 0.43 0.040 0.43 drand48 + 0x00000022
|
|
0.040 0.43 0.040 0.43 init_data + 0x000000F1, line 124 in ...
|
|
0.030 0.32 0.030 0.32 __drand48_iterate + 0x0000001E
|
|
0.020 0.21 0.020 0.21 __drand48_iterate + 0x00000038
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@noindent
|
|
What we see is that the top three instructions take 94% of the total CPU time
|
|
and any optimizations should focus on this part of the code..
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Display and Define the Metrics
|
|
@subsection Display and Define the Metrics
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The metrics shown by @DisplayText{} are useful, but there is more recorded
|
|
than displayed by default. We can customize the values shown by defining the
|
|
metrics ourselves.
|
|
|
|
There are two commands related to changing the metrics shown:
|
|
@IndexSubentry{Options, @code{-metric_list}}
|
|
@IndexSubentry{Commands, @code{metric_list}}
|
|
@command{metric_list} and
|
|
@IndexSubentry{Options, @code{-metrics}}
|
|
@IndexSubentry{Commands, @code{metrics}}
|
|
@command{metrics}.
|
|
|
|
The first command shows the currently selected metrics, plus all the metrics
|
|
that have been stored as part of the experiment. The second command may be
|
|
used to define the metric list.
|
|
|
|
@noindent
|
|
This is the way to get the information about the metrics:
|
|
|
|
@IndexSubentry{Options, @code{-metric_list}}
|
|
@IndexSubentry{Commands, @code{metric_list}}
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -metric_list test.1.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@noindent
|
|
This is the output:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Current metrics: e.%totalcpu:i.%totalcpu:name
|
|
Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
|
|
Available metrics:
|
|
Exclusive Total CPU Time: e.%totalcpu
|
|
Inclusive Total CPU Time: i.%totalcpu
|
|
Size: size
|
|
PC Address: address
|
|
Name: name
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
This shows the metrics that are currently used, the metric that is used to sort
|
|
the data and all the metrics that have been recorded, but are not necessarily
|
|
shown.
|
|
|
|
@cindex Default metrics
|
|
In this case, the current metrics are set to the exclusive and inclusive
|
|
total CPU times, the respective percentages, and the name of the function,
|
|
or load object.
|
|
|
|
@IndexSubentry{Options, @code{-metrics}}
|
|
@IndexSubentry{Commands, @code{metrics}}
|
|
The @code{metrics} command is used to define the metrics that need to be
|
|
displayed.
|
|
|
|
For example, to swap the exclusive and inclusive metrics, use the following
|
|
metric definition: @code{i.%totalcpu:e.%totalcpu}.
|
|
|
|
Since the metrics can be tailored for different views, there is also a way
|
|
to reset them to the default. This is done through the special keyword
|
|
@code{default} for the metrics definition (@command{-metrics default}).
|
|
@IndexSubentry{Metrics, Reset to default}
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Customization of the Output
|
|
@subsection Customization of the Output
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
With the information just given, the function overview can be customized.
|
|
For sake of the example, we would like to display the name of the function
|
|
first, only followed by the exclusive CPU time, given as an absolute number
|
|
and a percentage.
|
|
|
|
Note that the commands are parsed in order of appearance. This is why we
|
|
need to define the metrics @emph{before} requesting the function overview:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -metrics name:e.%totalcpu -functions test.1.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Current metrics: name:e.%totalcpu
|
|
Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Name Excl. Total
|
|
CPU
|
|
sec. %
|
|
<Total> 9.367 100.00
|
|
mxv_core 8.926 95.30
|
|
init_data 0.210 2.24
|
|
drand48 0.080 0.85
|
|
erand48_r 0.070 0.75
|
|
__drand48_iterate 0.060 0.64
|
|
_int_malloc 0.010 0.11
|
|
sysmalloc 0.010 0.11
|
|
<static>@0x47960 (<libgp-collector.so>) 0. 0.
|
|
__libc_start_main 0. 0.
|
|
allocate_data 0. 0.
|
|
driver_mxv 0. 0.
|
|
main 0. 0.
|
|
malloc 0. 0.
|
|
start_thread 0. 0.
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
This was a first and simple example how to customize the output. Note that we
|
|
did not rerun our profiling job and merely modified the display settings.
|
|
Below we will show other and also more advanced examples of customization.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Name the Experiment Directory
|
|
@subsection Name the Experiment Directory
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
When using @CollectApp{}, the default names for experiments work fine, but
|
|
they are quite generic. It is often more convenient to select a more
|
|
descriptive name. For example, one that reflects conditions for the experiment
|
|
conducted, like the number of threads used.
|
|
|
|
For this, the mutually exclusive @code{-o} and @code{-O} options come in handy.
|
|
Both may be used to provide a name for the experiment directory, but the
|
|
behaviour of @CollectApp{} is different.
|
|
|
|
With the
|
|
@IndexSubentry{Options, @code{-o}}
|
|
@samp{-o}
|
|
option, an existing experiment directory is not overwritten. Any directory
|
|
with the same name either needs to be renamed, moved, or removed, before the
|
|
experiment can be conducted.
|
|
|
|
This is in contrast with the behaviour for the
|
|
@IndexSubentry{Options, @code{-O}}
|
|
@samp{-O}
|
|
option. Any existing directory with the same name is silently overwritten.
|
|
|
|
Be aware that the name of the experiment directory has to end with @file{.er}.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Control the Number of Lines in the Output
|
|
@subsection Control the Number of Lines in the Output
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@IndexSubentry{Options, @code{-limit}}
|
|
@IndexSubentry{Commands, @code{limit}}
|
|
The @command{limit} @var{<n>} command can be used to control the number of
|
|
lines printed in various views. For example it impacts the function view, but
|
|
also takes effect for other display commands, like @command{lines}.
|
|
|
|
The argument @var{<n>} should be a positive integer number. It sets the number
|
|
of lines in the (function) view. A value of zero resets the limit to the
|
|
default.
|
|
|
|
Be aware that the pseudo-function @code{<Total>} counts as a regular function.
|
|
For example @command{limit 10} displays nine user level functions.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Sorting the Performance Data
|
|
@subsection Sorting the Performance Data
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@IndexSubentry{Options, @code{-sort}}
|
|
@IndexSubentry{Commands, @code{sort}}
|
|
The @command{sort} @var{<key>} command sets the key to be used when sorting the
|
|
performance data.
|
|
|
|
The key is a valid metric definition, but the
|
|
@IndexSubentry{Metrics, Visibility field}
|
|
visibility field
|
|
(@xref{Metric Definitions})
|
|
in the metric
|
|
definition is ignored, since this does not affect the outcome of the sorting
|
|
operation.
|
|
For example if the sort key is set to @code{e.totalcpu}, the values
|
|
will be sorted in descending order with respect to the exclusive total
|
|
CPU time.
|
|
|
|
@IndexSubentry{Sort, Reverse order}
|
|
The data can be sorted in reverse order by prepending the metric definition
|
|
with a minus (@samp{-}) sign. For example @command{sort -e.totalcpu}.
|
|
|
|
@IndexSubentry{Sort, Reset to default}
|
|
A default metric for the sort operation has been defined and since this is
|
|
a persistent command, this default can be restored with @code{default} as
|
|
the key (@command{sort default}).
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Scripting
|
|
@subsection Scripting
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@cindex Script files
|
|
The list with commands for @DisplayText{} can be very long. This is tedious
|
|
and also error prone. Luckily, there is an easier and elegant way to control
|
|
the output of this tool.
|
|
|
|
@IndexSubentry{Options, @code{-script}}
|
|
@IndexSubentry{Commands, @code{script}}
|
|
Through the @command{script} command, the name of a file with commands can be
|
|
passed in. These commands are parsed and executed as if they appeared on
|
|
the command line in the same order as encountered in the file. The commands
|
|
in this script file can actually be mixed with commands on the command line
|
|
and multiple script files may be used.
|
|
The difference between the commands in the script file and those used on the
|
|
command line is that the latter require a leading dash (@samp{-}) symbol.
|
|
|
|
Comment lines in a script file are supported. They need to start with the
|
|
@samp{#} symbol.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node A More Elaborate Example
|
|
@subsection A More Elaborate Example
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
With the information presented so far, we can customize our data
|
|
gathering and display commands.
|
|
|
|
As an example, we would like to use @file{mxv.1.thr.er} as the name for the
|
|
experiment directory. In this way, the name of the algorithm and the
|
|
number of threads that were used are included in the name.
|
|
We also don't mind to overwrite an existing
|
|
experiment directory with the same name.
|
|
|
|
All that needs to be done is to use the
|
|
@IndexSubentry{Options, @code{-O}}
|
|
@samp{-O}
|
|
option, followed by the directory name of choice when running @CollectApp{}:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ exe=mxv-pthreads
|
|
$ m=8000
|
|
$ n=4000
|
|
$ gprofng collect app -O mxv.1.thr.er ./$exe -m $m -n $n -t 1
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
Since we want to customize the profile and prefer to keep the command line
|
|
short, the commands to generate the profile are put into a file with the
|
|
name @file{my-script}:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
$ cat my-script
|
|
# This is my first gprofng script
|
|
# Set the metrics
|
|
metrics i.%totalcpu:e.%totalcpu:name
|
|
# Use the exclusive time to sort
|
|
sort e.totalcpu
|
|
# Limit the function list to 5 lines
|
|
limit 5
|
|
# Show the function list
|
|
functions
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
This script file is specified as input to the @DisplayText{} command
|
|
that is used to display the performance information stored in experiment
|
|
directory @file{mxv.1.thr.er}:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -script my-script mxv.1.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
This command produces the following output:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
# This is my first gprofng script
|
|
# Set the metrics
|
|
Current metrics: i.%totalcpu:e.%totalcpu:name
|
|
Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
|
|
# Use the exclusive time to sort
|
|
Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
|
|
# Limit the function list to 5 lines
|
|
Print limit set to 5
|
|
# Show the function list
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Incl. Total Excl. Total Name
|
|
CPU CPU
|
|
sec. % sec. %
|
|
9.703 100.00 9.703 100.00 <Total>
|
|
9.226 95.09 9.226 95.09 mxv_core
|
|
0.455 4.69 0.210 2.17 init_data
|
|
0.169 1.75 0.123 1.26 erand48_r
|
|
0.244 2.52 0.075 0.77 drand48
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
In the first part of the output the comment lines in the script file are
|
|
echoed. These are interleaved with an acknowledgement message for the
|
|
commands.
|
|
|
|
This is followed by a profile consisting of 5 lines only. For both metrics,
|
|
the percentages plus the timings are given. The numbers are sorted with
|
|
respect to the exclusive total CPU time. Although this is the default, for
|
|
demonstration purposes we use the @command{sort} command to explicitly define
|
|
the metric for the sort.
|
|
|
|
While we executed the same job as before and only changed the name of the
|
|
experiment directory, the results are somewhat different. This is sampling
|
|
in action. The numbers are not all that different though.
|
|
It is seen that function @code{mxv_core} is responsbile for
|
|
95% of the CPU time and @code{init_data} takes 4.5% only.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node The Call Tree
|
|
@subsection The Call Tree
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The call tree shows the dynamic structure of the application by displaying the
|
|
functions executed and their parent. The CPU time attributed to each function
|
|
is shown as well. This view helps to find the most expensive
|
|
execution path in the program.
|
|
|
|
@IndexSubentry{Options, @code{-calltree}}
|
|
@IndexSubentry{Commands, @code{calltree}}
|
|
This feature is enabled through the @command{calltree} command. For example,
|
|
this is how to get the call tree for our current experiment:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -calltree mxv.1.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
This displays the following structure:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Functions Call Tree. Metric: Attributed Total CPU Time
|
|
|
|
Attr. Total Name
|
|
CPU
|
|
sec. %
|
|
9.703 100.00 +-<Total>
|
|
9.226 95.09 +-start_thread
|
|
9.226 95.09 | +-<static>@0x47960 (<libgp-collector.so>)
|
|
9.226 95.09 | +-driver_mxv
|
|
9.226 95.09 | +-mxv_core
|
|
0.477 4.91 +-__libc_start_main
|
|
0.477 4.91 +-main
|
|
0.455 4.69 +-init_data
|
|
0.244 2.52 | +-drand48
|
|
0.169 1.75 | +-erand48_r
|
|
0.047 0.48 | +-__drand48_iterate
|
|
0.021 0.22 +-allocate_data
|
|
0.021 0.22 | +-malloc
|
|
0.021 0.22 | +-_int_malloc
|
|
0.006 0.06 | +-sysmalloc
|
|
0.003 0.03 | +-__default_morecore
|
|
0.003 0.03 | +-sbrk
|
|
0.003 0.03 | +-brk
|
|
0.001 0.01 +-pthread_create
|
|
0.001 0.01 +-__pthread_create_2_1
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
At first sight this may not be what is expected and some explanation is in
|
|
place.
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@c TBD: Revise this text when we have user and machine mode.
|
|
@c ----------------------------------------------------------------------------
|
|
The top function is the pseudo-function @code{<Total>} that we have seen
|
|
before. It is introduced and shown here to provide the total value of the
|
|
metric(s).
|
|
|
|
We also see function @code{<static>@@0x47960} in the call tree and apparently
|
|
it is from @code{libgp-collector.so}, a library that is internal to
|
|
@ToolName{}.
|
|
The @code{<static>} marker, followed by the program counter, is shown if the
|
|
name of the function cannot be found. This function is part of the
|
|
implementation of the data collection process and should be hidden to the
|
|
user. This is part of a planned future enhancement.
|
|
|
|
In general, if a view has a function that does not appear to be part of the
|
|
user code, or seems odd anyhow, the @command{objects} and @command{fsingle}
|
|
@IndexSubentry{Options, @code{-objects}}
|
|
@IndexSubentry{Commands, @code{objects}}
|
|
@IndexSubentry{Options, @code{-fsingle}}
|
|
@IndexSubentry{Commands, @code{fsingle}}
|
|
commands are very useful
|
|
to find out more about load objects in general, but also to help identify
|
|
an unknown entry in the function overview. @xref{Load Objects and Functions}.
|
|
|
|
Another thing to note is that there are two main branches. The one under
|
|
@code{<static>@@0x47960} and the second one under @code{__libc_start_main}.
|
|
This reflects the fact that this is a multithreaded program and the
|
|
threaded part shows up as a separate branch in the call tree.
|
|
|
|
The way to interpret this structure is as follows. The program starts
|
|
under control of @code{__libc_start_main}. This executes the main program
|
|
called @code{main}, which at the top level executes functions
|
|
@code{init_data}, @code{allocate_data}, and @code{pthread_create}.
|
|
The latter function creates and executes the additional thread(s).
|
|
|
|
For this multithreaded part of the code, we need to look at the branch
|
|
under function @code{start_thread} that calls the driver code for the
|
|
matrix-vector multiplication (@code{driver_mxv}), which executes the function
|
|
that performs the actual multiplication (@code{mxv_core}).
|
|
|
|
There are two things worth noting for the call tree feature:
|
|
|
|
@itemize
|
|
|
|
@item
|
|
This is a dynamic tree and since sampling is used, it most likely looks
|
|
slighlty different across seemingly identical profile runs. In case the
|
|
run times are short, it is worth considering to use a high resolution
|
|
through the
|
|
@IndexSubentry{Options, @code{-p}}
|
|
@samp{-p}
|
|
option. For example use @samp{-p hi} to increase the sampling rate.
|
|
|
|
@item
|
|
In case hardware event counters have been enabled
|
|
(@xref{Profile Hardware Event Counters}), these values are also displayed
|
|
in the call tree view.
|
|
|
|
@end itemize
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node More Information on the Experiment
|
|
@subsection More Information on the Experiment
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The experiment directory not only contains performance related data. Several
|
|
system characteristics, the profiling command executed, plus some global
|
|
performance statistics are stored and can be displayed.
|
|
|
|
@IndexSubentry{Options, @code{-header}}
|
|
@IndexSubentry{Commands, @code{header}}
|
|
The @command{header} command displays information about the experiment(s).
|
|
For example, this is command is used to extract this data from for our
|
|
experiment directory:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -header mxv.1.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
The above command prints the following information. Note that some of the
|
|
lay-out and the information has been modified. Directory paths have been
|
|
replaced @code{<apath>} for example. Textual changes are
|
|
marked with the @samp{<} and @samp{>} symbols.
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Experiment: mxv.1.thr.er
|
|
No errors
|
|
No warnings
|
|
Archive command ` /usr/bin/gprofng-archive -n -a on --outfile
|
|
<apath>/archive.log <apath>/mxv.1.thr.er'
|
|
|
|
Target command (64-bit): './mxv-pthreads -m 8000 -n 4000 -t 1'
|
|
Process pid 2750071, ppid 2750069, pgrp 2749860, sid 2742080
|
|
Current working directory: <apath>
|
|
Collector version: `2.40.00'; experiment version 12.4 (64-bit)
|
|
Host `<the-host-name>', OS `Linux <version>', page size 4096,
|
|
architecture `x86_64'
|
|
4 CPUs, clock speed 2294 MHz.
|
|
Memory: 3506491 pages @ 4096 = 13697 MB.
|
|
Data collection parameters:
|
|
Clock-profiling, interval = 997 microsecs.
|
|
Periodic sampling, 1 secs.
|
|
Follow descendant processes from: fork|exec|combo
|
|
|
|
Experiment started <date and time>
|
|
|
|
Experiment Ended: 9.801216173
|
|
Data Collection Duration: 9.801216173
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The output above may assist in troubleshooting, or to verify some of the
|
|
operational conditions and we recommend to include this command when
|
|
generating a profile.
|
|
|
|
@IndexSubentry{Options, @code{-C}}
|
|
Related to this command there is a useful option to record comment(s) in
|
|
an experiment.
|
|
To this end, use the @samp{-C} option on the @CollectApp{} tool to
|
|
specify a comment string. Up to ten comment lines can be included.
|
|
These comments are displayed with the @command{header} command on
|
|
the @DisplayText{} tool.
|
|
|
|
@IndexSubentry{Options, @code{-overview}}
|
|
@IndexSubentry{Commands, @code{overview}}
|
|
The @command{overview} command displays information on the experiment(s) and
|
|
also shows a summary of the values for the metric(s) used. This is an example
|
|
how to use it on the newly created experiment directory:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -overview mxv.1.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Experiment(s):
|
|
|
|
Experiment :mxv.1.thr.er
|
|
Target : './mxv-pthreads -m 8000 -n 4000 -t 1'
|
|
Host : <hostname> (<ISA>, Linux <version>)
|
|
Start Time : <date and time>
|
|
Duration : 9.801 Seconds
|
|
|
|
Metrics:
|
|
|
|
Experiment Duration (Seconds): [9.801]
|
|
Clock Profiling
|
|
[X]Total CPU Time - totalcpu (Seconds): [*9.703]
|
|
|
|
Notes: '*' indicates hot metrics, '[X]' indicates currently enabled
|
|
metrics.
|
|
The metrics command can be used to change selections. The
|
|
metric_list command lists all available metrics.
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
This command provides a dashboard overview that helps to easily identify
|
|
where the time is spent and in case hardware event counters are used, it
|
|
shows their total values.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Control the Sampling Frequency
|
|
@subsection Control the Sampling Frequency
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@cindex Sampling frequency
|
|
So far we did not go into details on the frequency of the sampling process,
|
|
but in some cases it is useful to change the default of 10 milliseconds.
|
|
|
|
The advantage of increasing the sampling frequency is that functions that
|
|
do not take much time per invocation are more accurately captured. The
|
|
downside is that more data is gathered. This has an impact on the overhead
|
|
of the collection process and more disk space is required.
|
|
|
|
In general this is not an immediate concern, but with heavily threaded
|
|
applications that run for an extended period of time, increasing the
|
|
frequency may have a more noticeable impact.
|
|
|
|
@IndexSubentry{Options, @code{-p}}
|
|
The @code{-p} option on the @CollectApp{} tool is used to enable or disable
|
|
clock based profiling, or to explicitly set the sampling rate.
|
|
@cindex Sampling interval
|
|
This option takes one of the following keywords:
|
|
|
|
@table @code
|
|
|
|
@item off
|
|
Disable clock based profiling.
|
|
|
|
@item on
|
|
Enable clock based profiling with a per thread sampling interval of 10 ms.
|
|
This is the default.
|
|
|
|
@item lo
|
|
Enable clock based profiling with a per thread sampling interval of 100 ms.
|
|
|
|
@item hi
|
|
Enable clock based profiling with a per thread sampling interval of 1 ms.
|
|
|
|
@item @var{value}
|
|
@cindex Sampling interval
|
|
Enable clock based profiling with a per thread sampling interval of
|
|
@var{value}.
|
|
|
|
@end table
|
|
|
|
It may seem unnecessary to have an option to disable clock based profiling,
|
|
but there is a good reason to support this.
|
|
By default, clock profiling is enabled when conducting hardware event counter
|
|
experiments (@xref{Profile Hardware Event Counters}).
|
|
With the @code{-p off} option, this can be disabled.
|
|
|
|
If an explicit value is set for the sampling, the number can be an integer or
|
|
a floating-point number.
|
|
A suffix of @samp{u} for microseconds, or @samp{m} for milliseconds is
|
|
supported. If no suffix is used, the value is assumed to be in milliseconds.
|
|
For example, the following command sets the sampling rate to
|
|
5123.4 microseconds:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng collect app -p 5123.4u ./mxv-pthreads -m 8000 -n 4000 -t 1
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
If the value is smaller than the clock profiling minimum, a warning message
|
|
is issued and it is set to the minimum.
|
|
In case it is not a multiple of the clock profiling resolution, it is
|
|
silently rounded down to the nearest multiple of the clock resolution.
|
|
If the value exceeds the clock profiling maximum, is negative, or zero, an
|
|
error is reported.
|
|
|
|
@IndexSubentry{Options, @code{-header}}
|
|
@IndexSubentry{Commands, @code{header}}
|
|
@noindent
|
|
Note that the @code{header} command echoes the sampling rate used.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Information on Load Objects
|
|
@subsection Information on Load Objects
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
It may happen that the function view shows a function that is not known to
|
|
the user. This can easily happen with library functions for example.
|
|
Luckily there are three commands that come in handy then.
|
|
|
|
@IndexSubentry{Options, @code{-objects}}
|
|
@IndexSubentry{Commands, @code{objects}}
|
|
@IndexSubentry{Options, @code{-fsingle}}
|
|
@IndexSubentry{Commands, @code{fsingle}}
|
|
@IndexSubentry{Options, @code{-fsummary}}
|
|
@IndexSubentry{Commands, @code{fsummary}}
|
|
These commands are @command{objects}, @command{fsingle}, and
|
|
@command{fsummary}. They provide details on
|
|
@cindex Load objects
|
|
load objects (@xref{Load Objects and Functions}).
|
|
|
|
The @command{objects} command lists all load objects that have been referenced
|
|
during the performance experiment.
|
|
Below we show the command and the result for our profile job. Like before,
|
|
some path names in the output have been shortened and replaced by the
|
|
@IndexSubentry{Miscellaneous, @code{<apath>}}
|
|
@code{<apath>} symbol that represents an absolute directory path.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -objects mxv.1.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
The output includes the name and path of the target executable:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
<Unknown> (<Unknown>)
|
|
<mxv-pthreads> (<apath>/mxv-pthreads)
|
|
<libdl-2.28.so> (/usr/lib64/libdl-2.28.so)
|
|
<librt-2.28.so> (/usr/lib64/librt-2.28.so)
|
|
<libc-2.28.so> (/usr/lib64/libc-2.28.so)
|
|
<libpthread-2.28.so> (/usr/lib64/libpthread-2.28.so)
|
|
<libm-2.28.so> (/usr/lib64/libm-2.28.so)
|
|
<libgp-collector.so> (/usr/lib64/gprofng/libgp-collector.so)
|
|
<ld-2.28.so> (/usr/lib64/ld-2.28.so)
|
|
<DYNAMIC_FUNCTIONS> (DYNAMIC_FUNCTIONS)
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@IndexSubentry{Options, @code{-fsingle}}
|
|
@IndexSubentry{Commands, @code{fsingle}}
|
|
The @command{fsingle} command may be used to get more details on a specific
|
|
entry in the function view, say. For example, the command below provides
|
|
additional information on the @code{pthread_create} function shown in the
|
|
function overview.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -fsingle pthread_create mxv.1.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
Below the output from this command. It has been somewhat modified to match
|
|
the display requirements.
|
|
|
|
@smallexample
|
|
@verbatim
|
|
+ gprofng display text -fsingle pthread_create mxv.1.thr.er
|
|
pthread_create
|
|
Exclusive Total CPU Time: 0. ( 0. %)
|
|
Inclusive Total CPU Time: 0.001 ( 0.0%)
|
|
Size: 258
|
|
PC Address: 8:0x00049f60
|
|
Source File: (unknown)
|
|
Object File: (unknown)
|
|
Load Object: /usr/lib64/gprofng/libgp-collector.so
|
|
Mangled Name:
|
|
Aliases:
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
In this table we not only see how much time was spent in this function, we
|
|
also see where it originates from. In addition to this, the size and start
|
|
address are given as well. If the source code location is known it is also
|
|
shown here.
|
|
|
|
@IndexSubentry{Options, @code{-fsummary}}
|
|
@IndexSubentry{Commands, @code{fsummary}}
|
|
The related @code{fsummary} command displays the same information as
|
|
@code{fsingle}, but for all functions in the function overview,
|
|
including @code{<Total>}:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -fsummary mxv.1.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
<Total>
|
|
Exclusive Total CPU Time: 9.703 (100.0%)
|
|
Inclusive Total CPU Time: 9.703 (100.0%)
|
|
Size: 0
|
|
PC Address: 1:0x00000000
|
|
Source File: (unknown)
|
|
Object File: (unknown)
|
|
Load Object: <Total>
|
|
Mangled Name:
|
|
Aliases:
|
|
|
|
mxv_core
|
|
Exclusive Total CPU Time: 9.226 ( 95.1%)
|
|
Inclusive Total CPU Time: 9.226 ( 95.1%)
|
|
Size: 80
|
|
PC Address: 2:0x00001d56
|
|
Source File: <apath>/src/mxv.c
|
|
Object File: mxv.1.thr.er/archives/mxv-pthreads_ss_pf53V__5
|
|
Load Object: <apath>/mxv-pthreads
|
|
Mangled Name:
|
|
Aliases:
|
|
|
|
... etc ...
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Support for Multithreading
|
|
@section Support for Multithreading
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
In this chapter the support for multithreading is introduced and discussed.
|
|
As is shown below, nothing needs to be changed when collecting the performance
|
|
data.
|
|
|
|
The difference is that additional commands are available to get more
|
|
information on the multithreading details, plus that several filters allow
|
|
the user to zoom in on specific threads.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Creating a Multithreading Experiment
|
|
@subsection Creating a Multithreading Experiment
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
We demonstrate the support for multithreading using the same code and settings
|
|
as before, but this time 2 threads are used:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ exe=mxv-pthreads
|
|
$ m=8000
|
|
$ n=4000
|
|
$ gprofng collect app -O mxv.2.thr.er ./$exe -m $m -n $n -t 2
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
First of all, in as far as @ProductName{} is concerned, no changes are needed.
|
|
Nothing special is needed to profile a multithreaded job when using
|
|
@ToolName{}.
|
|
|
|
The same is true when displaying the performance results. The same commands
|
|
that were used before work unmodified. For example, this is all that is
|
|
needed to get a function overview:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -limit 5 -functions mxv.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@noindent
|
|
This produces the following familiar looking output:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Print limit set to 5
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Incl. Total Name
|
|
CPU CPU
|
|
sec. % sec. %
|
|
9.464 100.00 9.464 100.00 <Total>
|
|
8.961 94.69 8.961 94.69 mxv_core
|
|
0.224 2.37 0.469 4.95 init_data
|
|
0.105 1.11 0.177 1.88 erand48_r
|
|
0.073 0.77 0.073 0.77 __drand48_iterate
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Commands Specific to Multithreading
|
|
@subsection Commands Specific to Multithreading
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The function overview shown above shows the results aggregated over all the
|
|
threads. The interesting new element is that we can also look at the
|
|
performance data for the individual threads.
|
|
|
|
@IndexSubentry{Options, @code{-thread_list}}
|
|
@IndexSubentry{Commands, @code{thread_list}}
|
|
The @command{thread_list} command displays how many threads have been used:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -thread_list mxv.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
This produces the following output, showing that three threads have
|
|
been used:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 all 3
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The output confirms there is one experiment and that by default all
|
|
threads are selected.
|
|
|
|
It may seem surprising to see three threads here, since we used the
|
|
@code{-t 2} option, but it is common for a Pthreads program to use one
|
|
additional thread.
|
|
Typically, there is one main thread that runs from start to finish.
|
|
It handles the sequential portions of the code, as well as thread
|
|
management related tasks.
|
|
It is no different in the example code. At some point, the main thread
|
|
creates and activates the two threads that perform the multiplication
|
|
of the matrix with the vector. Upon completion of this computation,
|
|
the main thread continues.
|
|
|
|
@IndexSubentry{Options, @code{-threads}}
|
|
@IndexSubentry{Commands, @code{threads}}
|
|
The @command{threads} command is simple, yet very powerful. It shows the
|
|
total value of the metrics for each thread.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -threads mxv.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@noindent
|
|
The command above produces the following overview:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Objects sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
9.464 100.00 <Total>
|
|
4.547 48.05 Process 1, Thread 3
|
|
4.414 46.64 Process 1, Thread 2
|
|
0.502 5.31 Process 1, Thread 1
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The first line gives the total CPU time accumulated over the threads
|
|
selected. This is followed by the metric value(s) for each thread.
|
|
|
|
From this it is clear that the main thread is responsible for a
|
|
little over 5% of the total CPU time, while the other two threads
|
|
take 47-48% each.
|
|
|
|
This view is ideally suited to verify if there are any load balancing
|
|
issues and also to find the most time consuming thread(s).
|
|
|
|
@IndexSubentry{Filters, Thread selection}
|
|
While useful, often more information than this is needed. This is
|
|
@IndexSubentry{Options, @code{-thread_select}}
|
|
@IndexSubentry{Commands, @code{thread_select}}
|
|
where the thread selection filter comes in. Through the
|
|
@command{thread_select}
|
|
command, one or more threads may be selected.
|
|
@xref{The Selection List} how to define the selection list.
|
|
|
|
Since it is most common to use this command in a script, we do so as
|
|
well here. Below the script we are using:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
# Define the metrics
|
|
metrics e.%totalcpu
|
|
# Limit the output to 5 lines
|
|
limit 5
|
|
# Get the function overview for thread 1
|
|
thread_select 1
|
|
functions
|
|
# Get the function overview for thread 2
|
|
thread_select 2
|
|
functions
|
|
# Get the function overview for thread 3
|
|
thread_select 3
|
|
functions
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
The definition of the metrics and the output limit have been shown and
|
|
explained earlier. The new command to focus on is @command{thread_select}.
|
|
|
|
This command takes a list (@xref{The Selection List}) to select specific
|
|
threads. In this case, the individual thread numbers that were
|
|
obtained earlier with the @command{thread_list} command are selected.
|
|
|
|
This restricts the output of the @command{functions} command to the thread
|
|
number(s) specified. This means that the script above shows which
|
|
function(s) each thread executes and how much CPU time they consumed.
|
|
Both the exclusive timings and their percentages are given.
|
|
|
|
Note that technically this command is a filter and persistent. The
|
|
selection remains active until changed through another thread selection
|
|
command, or when it is reset with the @samp{all} selection list.
|
|
|
|
@noindent
|
|
This is the relevant part of the output for the first thread:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 1 3
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
0.502 100.00 <Total>
|
|
0.224 44.64 init_data
|
|
0.105 20.83 erand48_r
|
|
0.073 14.48 __drand48_iterate
|
|
0.067 13.29 drand48
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
As usual, the comment lines are echoed. This is followed by a confirmation
|
|
of the selection. The first table shows that one experiment is loaded and
|
|
that thread 1 out of the three threads has been selected. What is
|
|
displayed next is the function overview for this particular thread. Due to
|
|
the @code{limit 5} command, there are only five functions in this list.
|
|
|
|
Clearly, this thread handles the data initialization part and as we know
|
|
from the call tree output, function @code{init_data} executes the 3 other
|
|
functions shown in this profile.
|
|
|
|
Below are the overviews for threads 2 and 3 respectively. It is seen that all
|
|
of the CPU time is spent in function @code{mxv_core} and that this time
|
|
is approximately the same for both threads.
|
|
|
|
@smallexample
|
|
@verbatim
|
|
# Get the function overview for thread 2
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 2 3
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
4.414 100.00 <Total>
|
|
4.414 100.00 mxv_core
|
|
0. 0. <static>@0x48630 (<libgp-collector.so>)
|
|
0. 0. driver_mxv
|
|
0. 0. start_thread
|
|
|
|
# Get the function overview for thread 3
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 3 3
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
4.547 100.00 <Total>
|
|
4.547 100.00 mxv_core
|
|
0. 0. <static>@0x48630 (<libgp-collector.so>)
|
|
0. 0. driver_mxv
|
|
0. 0. start_thread
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
When analyzing the performance of a multithreaded application, it is sometimes
|
|
useful to know whether threads have mostly executed on the same core, say, or
|
|
if they have wandered across multiple cores. This sort of stickiness is
|
|
usually referred to as
|
|
@cindex Thread affinity
|
|
@emph{thread affinity}.
|
|
|
|
Similar to the commands for the threads, there are several commands related
|
|
to the usage of the cores, or @emph{CPUs} as they are called in @ToolName{}
|
|
(@xref{The Concept of a CPU in gprofng}).
|
|
|
|
@IndexSubentry{Options, @code{-cpu_list}}
|
|
@IndexSubentry{Commands, @code{cpu_list}}
|
|
Similar to the @command{thread_list} command, the @command{cpu_list} command
|
|
displays how many CPUs have been used.
|
|
@IndexSubentry{Options, @code{-cpus}}
|
|
@IndexSubentry{Commands, @code{cpus}}
|
|
The equivalent of the @command{threads} threads command, is the @command{cpus}
|
|
command, which shows the numbers of the CPUs that were used and the metric
|
|
values for each one of them. Both commands are demonstrated below.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -cpu_list -cpus mxv.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@noindent
|
|
This command produces the following output:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
+ gprofng display text -cpu_list -cpus mxv.2.thr.er
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 all 4
|
|
Objects sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
9.464 100.00 <Total>
|
|
4.414 46.64 CPU 2
|
|
2.696 28.49 CPU 0
|
|
1.851 19.56 CPU 1
|
|
0.502 5.31 CPU 3
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The first table shows that there is only one experiment and that all of the
|
|
four CPUs have been selected. The second table shows the exclusive metrics
|
|
for each of the CPUs that have been used.
|
|
|
|
As also echoed in the output, the data is sorted with respect to the
|
|
exclusive CPU time, but it is very easy to sort the data by the CPU id
|
|
@IndexSubentry{Options, -sort}
|
|
@IndexSubentry{Commands, sort}
|
|
by using the @command{sort} command:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -cpu_list -sort name -cpus mxv.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@noindent
|
|
With the @command{sort} added, the output is as follows:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 all 4
|
|
Current Sort Metric: Name ( name )
|
|
Objects sorted by metric: Name
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
9.464 100.00 <Total>
|
|
2.696 28.49 CPU 0
|
|
1.851 19.56 CPU 1
|
|
4.414 46.64 CPU 2
|
|
0.502 5.31 CPU 3
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
While the table with thread times shown earlier may point at a load imbalance
|
|
in the application, this overview has a different purpose.
|
|
|
|
For example, we see that 4 CPUs have been used, but we know that the
|
|
application uses 3 threads only.
|
|
We will now demonstrate how filters can be used to help answer the
|
|
question why 4 CPUs are used, while the application has 3 threads only.
|
|
This means that at least one thread has executed on more than one CPU.
|
|
|
|
Recall the thread level timings:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
9.464 100.00 <Total>
|
|
4.547 48.05 Process 1, Thread 3
|
|
4.414 46.64 Process 1, Thread 2
|
|
0.502 5.31 Process 1, Thread 1
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
Compared to the CPU timings above, it seems very likely that thread 3 has
|
|
used more than one CPU, because the thread and CPU timings are the same
|
|
for both other threads.
|
|
|
|
The command below selects thread number 3 and then requests the CPU
|
|
utilization for this thread:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -thread_select 3 -sort name -cpus mxv.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
The output shown below confirms that thread 3 is selected and then displays
|
|
the CPU(s) that have been used by this thread:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 3 3
|
|
|
|
Objects sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
4.547 100.00 <Total>
|
|
2.696 59.29 CPU 0
|
|
1.851 40.71 CPU 1
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The results show that this thread has used CPU 0 nearly 60% of the time
|
|
and CPU 1 for the remaining 40%.
|
|
|
|
To confirm that this is the only thread that has used more than one CPU, the
|
|
same approach can be used for threads 1 and 2:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
$ gprofng display text -thread_select 1 -cpus mxv.2.thr.er
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 1 3
|
|
Objects sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
0.502 100.00 <Total>
|
|
0.502 100.00 CPU 3
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@smallexample
|
|
@verbatim
|
|
$ gprofng display text -thread_select 2 -cpus mxv.2.thr.er
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 2 3
|
|
Objects sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
4.414 100.00 <Total>
|
|
4.414 100.00 CPU 2
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@noindent
|
|
The output above shows that indeed threads 1 and 2 each have used a single
|
|
CPU only.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node View Multiple Experiments
|
|
@section View Multiple Experiments
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
One thing we did not cover sofar is that @ToolName{} fully supports the
|
|
analysis of multiple experiments. The @DisplayText{} tool accepts a list of
|
|
experiments. The data can either be aggregated across the experiments, or
|
|
used in a comparison.
|
|
|
|
The default is to aggregate the metric values across the experiments that have
|
|
been loaded. The @command{compare} command can be used to enable the
|
|
@IndexSubentry{Options, @code{-compare}}
|
|
@IndexSubentry{Commands, @code{compare}}
|
|
comparison of results.
|
|
|
|
In this section both modes are illustrated with an example.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Aggregation of Experiments
|
|
@subsection Aggregation of Experiments
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
If the data for multiple experiments is aggregrated, the @DisplayText{} tool
|
|
shows the combined results.
|
|
For example, below is the script to show the function view for the data
|
|
aggregated over two experiments, drop the first experiment and then show
|
|
the function view for the second experiment only.
|
|
We will call it @file{my-script-agg}.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
# Define the metrics
|
|
metrics e.%totalcpu
|
|
# Limit the output to 5 lines
|
|
limit 5
|
|
# Get the list with experiments
|
|
experiment_list
|
|
# Get the function overview for all
|
|
functions
|
|
# Drop the first experiment
|
|
drop_exp mxv.2.thr.er
|
|
# Get the function overview for exp #2
|
|
functions
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@IndexSubentry{Options, @code{-experiment_list}}
|
|
@IndexSubentry{Commands, @code{experiment_list}}
|
|
With the exception of the @command{experiment_list} command, all commands
|
|
used have been discussed earlier.
|
|
|
|
The @command{experiment_list} command provides a list of the experiments
|
|
that have been loaded. This may be used to get the experiment IDs and
|
|
to verify the correct experiments are loaded for the aggregation.
|
|
|
|
@noindent
|
|
Below is an example that loads two experiments and uses the above
|
|
script to display different function views.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -script my-script-agg mxv.2.thr.er mxv.4.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@noindent
|
|
This produces the following output:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
# Define the metrics
|
|
Current metrics: e.%totalcpu:name
|
|
Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
|
|
# Limit the output to 5 lines
|
|
Print limit set to 5
|
|
# Get the list with experiments
|
|
ID Sel PID Experiment
|
|
== === ======= ============
|
|
1 yes 1339450 mxv.2.thr.er
|
|
2 yes 3579561 mxv.4.thr.er
|
|
# Get the function overview for all
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
20.567 100.00 <Total>
|
|
19.553 95.07 mxv_core
|
|
0.474 2.30 init_data
|
|
0.198 0.96 erand48_r
|
|
0.149 0.72 drand48
|
|
|
|
# Drop the first experiment
|
|
Experiment mxv.2.thr.er has been dropped
|
|
# Get the function overview for exp #2
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
11.104 100.00 <Total>
|
|
10.592 95.39 mxv_core
|
|
0.249 2.24 init_data
|
|
0.094 0.84 erand48_r
|
|
0.082 0.74 drand48
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The first five lines should look familiar. The five lines following echo
|
|
the comment line in the script and show the overview of the experiments.
|
|
This confirms two experiments have been loaded and that both are active.
|
|
This is followed by the function overview. The timings have been summed
|
|
up and the percentages are adjusted accordingly.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Comparison of Experiments
|
|
@subsection Comparison of Experiments
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The support for multiple experiments really shines in comparison mode.
|
|
@cindex Compare experiments
|
|
In comparison mode, the data for the various experiments is shown side by
|
|
side, as illustrated below where we compare the results for the multithreaded
|
|
experiments using two and four threads respectively.
|
|
|
|
This
|
|
feature is controlled through the
|
|
@IndexSubentry{Options, @code{-compare}}
|
|
@IndexSubentry{Commands, @code{compare}}
|
|
@code{compare} command.
|
|
|
|
The comparison mode is enabled through @command{compare on} and with
|
|
@command{compare off} it is disabled again.
|
|
In addition to @samp{on}, or @samp{off}, this command also supports
|
|
the @samp{delta} and @samp{ratio} keywords.
|
|
|
|
This is the script that will be used in our example. It sets the comparison
|
|
mode to @samp{on}:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
# Define the metrics
|
|
metrics e.%totalcpu
|
|
# Limit the output to 5 lines
|
|
limit 5
|
|
# Set the comparison mode to differences
|
|
compare on
|
|
# Get the function overview
|
|
functions
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
Assuming this script file is called @file{my-script-comp}, this is how
|
|
it is used to display the differences:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -script my-script-comp mxv.2.thr.er mxv.4.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@noindent
|
|
This produces the output shown below. The data for the first experiment
|
|
is shown as absolute numbers. The timings for the other experiment are
|
|
shown as a delta relative to these reference numbers:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
|
|
mxv.2.thr.er mxv.4.thr.er
|
|
Excl. Total Excl. Total Name
|
|
CPU CPU
|
|
sec. % sec. %
|
|
9.464 100.00 11.104 100.00 <Total>
|
|
8.961 94.69 10.592 95.39 mxv_core
|
|
0.224 2.37 0.249 2.24 init_data
|
|
0.105 1.11 0.094 0.84 erand48_r
|
|
0.073 0.77 0.060 0.54 __drand48_iterate
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
This table is already helpful to more easily compare (two) profiles, but
|
|
there is more that we can do here.
|
|
|
|
By default, in comparison mode, all measured values are shown. Often
|
|
profiling is about comparing performance data. It is therefore
|
|
sometimes more useful to look at differences or ratios, using one
|
|
experiment as a reference.
|
|
|
|
The values shown are relative to this difference. For example if a ratio
|
|
is below one, it means the reference value was higher.
|
|
|
|
In the example below, we use the same two experiments used in the comparison
|
|
above. The script is also nearly identical. The only change is that we now
|
|
use the @samp{delta} keyword.
|
|
|
|
As before, the number of lines is restricted to 5 and we focus on
|
|
the exclusive timings plus percentages. For the comparison part we are
|
|
interested in the differences.
|
|
|
|
This is the script that produces such an overview:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
# Define the metrics
|
|
metrics e.%totalcpu
|
|
# Limit the output to 5 lines
|
|
limit 5
|
|
# Set the comparison mode to differences
|
|
compare delta
|
|
# Get the function overview
|
|
functions
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
Assuming this script file is called @file{my-script-comp2}, this is how we
|
|
get the table displayed on our screen:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -script my-script-comp2 mxv.2.thr.er mxv.4.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
Leaving out some of the lines printed, but we have seen before, we get
|
|
the following table:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
mxv.2.thr.er mxv.4.thr.er
|
|
Excl. Total Excl. Total Name
|
|
CPU CPU
|
|
sec. % delta %
|
|
9.464 100.00 +1.640 100.00 <Total>
|
|
8.961 94.69 +1.631 95.39 mxv_core
|
|
0.224 2.37 +0.025 2.24 init_data
|
|
0.105 1.11 -0.011 0.84 erand48_r
|
|
0.073 0.77 -0.013 0.54 __drand48_iterate
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
It is now easier to see that the CPU times for the most time consuming
|
|
functions in this code are practically the same.
|
|
|
|
It is also possible to show ratio's through the @command{compare ratio}
|
|
@IndexSubentry{Options, @code{-compare}}
|
|
@IndexSubentry{Commands, @code{compare}}
|
|
command. The first colum is used as a reference and the values for
|
|
the other columns with metrics are derived by dividing the value by
|
|
the reference. The result for such a comparison is shown below:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
mxv.2.thr.er mxv.4.thr.er
|
|
Excl. Total Excl. Total CPU Name
|
|
CPU
|
|
sec. % ratio %
|
|
9.464 100.00 x 1.173 100.00 <Total>
|
|
8.961 94.69 x 1.182 95.39 mxv_core
|
|
0.224 2.37 x 1.111 2.24 init_data
|
|
0.105 1.11 x 0.895 0.84 erand48_r
|
|
0.073 0.77 x 0.822 0.54 __drand48_iterate
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
Note that the comparison feature is supported at the function, source, and
|
|
disassembly level. There is no practical limit on the number of experiments
|
|
that can be used in a comparison.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Profile Hardware Event Counters
|
|
@section Profile Hardware Event Counters
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
Many processors provide a set of hardware event counters and @ToolName{}
|
|
provides support for this feature.
|
|
@xref{Hardware Event Counters Explained} for those readers that are not
|
|
familiar with such counters and like to learn more.
|
|
|
|
In this section we explain how to get the details on the event counter
|
|
support for the processor used in the experiment(s), and show several
|
|
examples.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Getting Information on the Counters Supported
|
|
@subsection Getting Information on the Counters Supported
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The first step is to check if the processor used for the experiments is
|
|
supported by @ToolName{}.
|
|
@IndexSubentry{Options, @code{-h}}
|
|
The @code{-h} option on @CollectApp{} will show the event counter
|
|
information:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng collect app -h
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
In case the counters are supported, a list with the events is printed.
|
|
Otherwise, a warning message will be issued.
|
|
|
|
For example, below we show this command and the output on an Intel Xeon
|
|
Platinum 8167M (aka ``Skylake'') processor. The output has been split
|
|
into several sections and each section is commented upon separately.
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Run "gprofng collect app --help" for a usage message.
|
|
|
|
Specifying HW counters on `Intel Arch PerfMon v2 on Family 6 Model 85'
|
|
(cpuver=2499):
|
|
|
|
-h {auto|lo|on|hi}
|
|
turn on default set of HW counters at the specified rate
|
|
-h <ctr_def> [-h <ctr_def>]...
|
|
-h <ctr_def>[,<ctr_def>]...
|
|
specify HW counter profiling for up to 4 HW counters
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The first line shows how to get a usage overview. This is followed by
|
|
some information on the target processor.
|
|
The next five lines explain in what ways the @code{-h} option can be
|
|
used to define the events to be monitored.
|
|
|
|
The first version shown above enables a default set of counters. This
|
|
default depends on the processor this command is executed on. The
|
|
keyword following the @code{-h} option defines the sampling rate:
|
|
|
|
@table @code
|
|
|
|
@item auto
|
|
Match the sample rate of used by clock profiling. If the latter is disabled,
|
|
Use a per thread sampling rate of approximately 100 samples per second.
|
|
This setting is the default and preferred.
|
|
|
|
@item on
|
|
Use a per thread sampling rate of approximately 100 samples per second.
|
|
|
|
@item lo
|
|
Use a per thread sampling rate of approximately 10 samples per second.
|
|
|
|
@item hi
|
|
Use a per thread sampling rate of approximately 1000 samples per second.
|
|
|
|
@end table
|
|
|
|
The second and third variant define the events to be monitored. Note
|
|
that the number of simultaneous events supported is printed. In this
|
|
case we can monitor four events in a single profiling job.
|
|
|
|
It is a matter of preference whether you like to use the @code{-h}
|
|
option for each event, or use it once, followed by a comma separated
|
|
list.
|
|
|
|
There is one slight catch though. The counter definition below has
|
|
mandatory comma (@code{,}) between the event and the rate. While a
|
|
default can be used for the rate, the comma cannot be omitted.
|
|
This may result in a somewhat awkward counter definition in case
|
|
the default sampling rate is used.
|
|
|
|
For example, the following two commands are equivalent. Note
|
|
the double comma in the second command. This is not a typo.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng collect app -h cycles -h insts ...
|
|
$ gprofng collect app -h cycles,,insts ...
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
In the first command this comma is not needed, because a
|
|
comma (``@code{,}'') immediately followed by white space may
|
|
be omitted.
|
|
|
|
This is why we prefer the this syntax and in the remainder will
|
|
use the first version of this command.
|
|
|
|
@IndexSubentry{Hardware event counters, counter definition}
|
|
The counter definition takes an event name, plus optionally one or
|
|
more attributes, followed by a comma, and optionally the sampling rate.
|
|
The output section below shows the formal definition.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
<ctr_def> == <ctr>[[~<attr>=<val>]...],[<rate>]
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
The printed help then explains this syntax. Below we have summarized
|
|
and expanded this output:
|
|
|
|
@table @code
|
|
|
|
@item @var{<ctr>}
|
|
The counter name must be selected from the available counters listed
|
|
as part of the output printed with the @code{-h} option.
|
|
On most systems, if a counter is not listed, it may still be specified
|
|
by its numeric value.
|
|
|
|
@item @var{~<attr>=<val>}
|
|
This is an optional attribute that depends on the processor. The list
|
|
of supported attributes is printed in the output. Examples of
|
|
attributes are ``user'', or ``system''. The value can given in decimal
|
|
or hexadecimal format.
|
|
Multiple attributes may be specified, and each must be preceded
|
|
by a ~.
|
|
|
|
@item @var{<rate>}
|
|
|
|
The sampling rate is one of the following:
|
|
|
|
@table @code
|
|
|
|
@item auto
|
|
This is the default and matches the rate used by clock profiling.
|
|
If clock profiling is disabled, use @samp{on}.
|
|
|
|
@item on
|
|
Set the per thread maximum sampling rate to ~100 samples/second
|
|
|
|
@item lo
|
|
Set the per thread maximum sampling rate to ~10 samples/second
|
|
|
|
@item hi
|
|
Set the per thread maximum sampling rate to ~1000 samples/second
|
|
|
|
@item @var{<interval>}
|
|
Define the sampling interval.
|
|
@xref{Control the Sampling Frequency} how to define this.
|
|
|
|
@end table
|
|
|
|
@end table
|
|
|
|
After the section with the formal definition of events and counters, a
|
|
processor specific list is displayed. This part starts with an overview
|
|
of the default set of counters and the aliased names supported
|
|
@emph{on this specific processor}.
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Default set of HW counters:
|
|
|
|
-h cycles,,insts,,llm
|
|
|
|
Aliases for most useful HW counters:
|
|
|
|
alias raw name type units regs description
|
|
|
|
cycles unhalted-core-cycles CPU-cycles 0123 CPU Cycles
|
|
insts instruction-retired events 0123 Instructions Executed
|
|
llm llc-misses events 0123 Last-Level Cache Misses
|
|
br_msp branch-misses-retired events 0123 Branch Mispredict
|
|
br_ins branch-instruction-retired events 0123 Branch Instructions
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@noindent
|
|
The definitions given above may or may not be available on other processors.
|
|
|
|
The table above shows the default set of counters defined for this processor,
|
|
and the aliases. For each alias the full ``raw'' name is given, plus the
|
|
unit of the number returned by the counter (CPU cycles, or a raw count),
|
|
the hardware counter the event is allowed to be mapped onto, and a short
|
|
description.
|
|
|
|
The last part of the output contains all the events that can be monitored:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Raw HW counters:
|
|
|
|
name type units regs description
|
|
|
|
unhalted-core-cycles CPU-cycles 0123
|
|
unhalted-reference-cycles events 0123
|
|
instruction-retired events 0123
|
|
llc-reference events 0123
|
|
llc-misses events 0123
|
|
branch-instruction-retired events 0123
|
|
branch-misses-retired events 0123
|
|
ld_blocks.store_forward events 0123
|
|
ld_blocks.no_sr events 0123
|
|
ld_blocks_partial.address_alias events 0123
|
|
dtlb_load_misses.miss_causes_a_walk events 0123
|
|
dtlb_load_misses.walk_completed_4k events 0123
|
|
|
|
<many lines deleted>
|
|
|
|
l2_lines_out.silent events 0123
|
|
l2_lines_out.non_silent events 0123
|
|
l2_lines_out.useless_hwpf events 0123
|
|
sq_misc.split_lock events 0123
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
As can be seen, these names are not always easy to correlate to a specific
|
|
event of interest. The processor manual should provide more clarity on this.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Examples Using Hardware Event Counters
|
|
@subsection Examples Using Hardware Event Counters
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The previous section may give the impression that these counters are hard to
|
|
use, but as we will show now, in practice it is quite simple.
|
|
|
|
With the information from the @code{-h} option, we can easily set up our first
|
|
event counter experiment.
|
|
|
|
We start by using the default set of counters defined for our processor and we
|
|
use 2 threads:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ exe=mxv-pthreads
|
|
$ m=8000
|
|
$ n=4000
|
|
$ exp=mxv.hwc.def.2.thr.er
|
|
$ gprofng collect app -O $exp -h auto ./$exe -m $m -n $n -t 2
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@IndexSubentry{Options, @code{-h}}
|
|
@IndexSubentry{Hardware event counters, @code{auto} option}
|
|
The new option here is @code{-h auto}. The @code{auto} keyword enables
|
|
hardware event counter profiling and selects the default set of counters
|
|
defined for this processor.
|
|
|
|
As before, we can display the information, but there is one practical hurdle
|
|
to take. Unless we like to view all metrics recorded, we would need to know
|
|
the names of the events that have been enabled. This is tedious and also not
|
|
portable in case we would like to repeat this experiment on another processor.
|
|
|
|
@IndexSubentry{Hardware event counters, @code{hwc} metric}
|
|
This is where the special @code{hwc} metric comes very handy. It
|
|
automatically expands to the active set of hardware event counters used
|
|
in the experiment(s).
|
|
|
|
With this, it is very easy to display the event counter values. Note that
|
|
although the regular clock based profiling was enabled, we only want to see
|
|
the counter values. We also request to see the percentages and limit the
|
|
output to the first 5 lines:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ exp=mxv.hwc.def.2.thr.er
|
|
$ gprofng display text -metrics e.%hwc -limit 5 -functions $exp
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Current metrics: e.%cycles:e+%insts:e+%llm:name
|
|
Current Sort Metric: Exclusive CPU Cycles ( e.%cycles )
|
|
Print limit set to 5
|
|
Functions sorted by metric: Exclusive CPU Cycles
|
|
|
|
Excl. CPU Excl. Instructions Excl. Last-Level Name
|
|
Cycles Executed Cache Misses
|
|
sec. % % %
|
|
2.691 100.00 7906475309 100.00 122658983 100.00 <Total>
|
|
2.598 96.54 7432724378 94.01 121745696 99.26 mxv_core
|
|
0.035 1.31 188860269 2.39 70084 0.06 erand48_r
|
|
0.026 0.95 73623396 0.93 763116 0.62 init_data
|
|
0.018 0.66 76824434 0.97 40040 0.03 drand48
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
As we have seen before, the first few lines echo the settings.
|
|
This includes a list with the hardware event counters used by
|
|
default.
|
|
|
|
The table that follows makes it very easy to get an overview where the
|
|
time is spent and how many of the target events have occurred.
|
|
|
|
As before, we can drill down deeper and see the same metrics at the source
|
|
line and instruction level. Other than using @code{hwc} in the metrics
|
|
definitions, nothing has changed compared to the previous examples:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ exp=mxv.hwc.def.2.thr.er
|
|
$ gprofng display text -metrics e.hwc -source mxv_core $exp
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
This is the relevant part of the output. Since the lines get very long,
|
|
we have somewhat modified the lay-out:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Excl. CPU Excl. Excl.
|
|
Cycles Instructions Last-Level
|
|
sec. Executed Cache Misses
|
|
<Function: mxv_core>
|
|
0. 0 0 32. void __attribute__ ((noinline))
|
|
mxv_core(...)
|
|
0. 0 0 33. {
|
|
0. 0 0 34. for (uint64_t i=...) {
|
|
0. 0 0 35. double row_sum = 0.0;
|
|
## 1.872 7291879319 88150571 36. for (int64_t j=0; j<n; j++)
|
|
0.725 140845059 33595125 37. row_sum += A[i][j]*b[j];
|
|
0. 0 0 38. c[i] = row_sum;
|
|
39. }
|
|
0. 0 0 40. }
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
In a smiliar way we can display the event counter values at the instruction
|
|
level. Again we have modified the lay-out due to page width limitations:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ exp=mxv.hwc.def.2.thr.er
|
|
$ gprofng display text -metrics e.hwc -disasm mxv_core $exp
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Excl. CPU Excl. Excl.
|
|
Cycles Instructions Last-Level
|
|
sec. Executed Cache Misses
|
|
<Function: mxv_core>
|
|
0. 0 0 [33] 4021ba: mov 0x8(%rsp),%r10
|
|
34. for (uint64_t i=...) {
|
|
0. 0 0 [34] 4021bf: cmp %rsi,%rdi
|
|
0. 0 0 [34] 4021c2: jbe 0x37
|
|
0. 0 0 [34] 4021c4: ret
|
|
35. double row_sum = 0.0;
|
|
36. for (int64_t j=0; j<n; j++)
|
|
37. row_sum += A[i][j]*b[j];
|
|
0. 0 0 [37] 4021c5: mov (%r8,%rdi,8),%rdx
|
|
0. 0 0 [36] 4021c9: mov $0x0,%eax
|
|
0. 0 0 [35] 4021ce: pxor %xmm1,%xmm1
|
|
0.002 12804230 321394 [37] 4021d2: movsd (%rdx,%rax,8),%xmm0
|
|
0.141 60819025 3866677 [37] 4021d7: mulsd (%r9,%rax,8),%xmm0
|
|
0.582 67221804 29407054 [37] 4021dd: addsd %xmm0,%xmm1
|
|
## 1.871 7279075109 87989870 [36] 4021e1: add $0x1,%rax
|
|
0.002 12804210 80351 [36] 4021e5: cmp %rax,%rcx
|
|
0. 0 0 [36] 4021e8: jne 0xffffffffffffffea
|
|
38. c[i] = row_sum;
|
|
0. 0 0 [38] 4021ea: movsd %xmm1,(%r10,%rdi,8)
|
|
0. 0 0 [34] 4021f0: add $0x1,%rdi
|
|
0. 0 0 [34] 4021f4: cmp %rdi,%rsi
|
|
0. 0 0 [34] 4021f7: jb 0xd
|
|
0. 0 0 [35] 4021f9: pxor %xmm1,%xmm1
|
|
0. 0 0 [36] 4021fd: test %rcx,%rcx
|
|
0. 0 80350 [36] 402200: jne 0xffffffffffffffc5
|
|
0. 0 0 [36] 402202: jmp 0xffffffffffffffe8
|
|
39. }
|
|
40. }
|
|
0. 0 0 [40] 402204: ret
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
So far we have used the default settings for the event counters. It is
|
|
quite straightforward to select specific counters. For sake of the
|
|
example, let's assume we would like to count how many branch instructions
|
|
and retired memory load instructions that missed in the L1 cache have been
|
|
executed. We also want to count these events with a high resolution.
|
|
|
|
This is the command to do so:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ exe=mxv-pthreads
|
|
$ m=8000
|
|
$ n=4000
|
|
$ exp=mxv.hwc.sel.2.thr.er
|
|
$ hwc1=br_ins,hi
|
|
$ hwc2=mem_load_retired.l1_miss,hi
|
|
$ gprofng collect app -O $exp -h $hwc1 -h $hwc2 $exe -m $m -n $n -t 2
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
As before, we get a table with the event counts. Due to the very
|
|
long name for the second counter, we have somewhat modified the
|
|
output.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -limit 10 -functions mxv.hwc.sel.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
Excl. Incl. Excl. Branch Excl. Name
|
|
Total Total Instructions mem_load_retired.l1_miss
|
|
CPU sec. CPU sec. Events
|
|
2.597 2.597 1305305319 4021340 <Total>
|
|
2.481 2.481 1233233242 3982327 mxv_core
|
|
0.040 0.107 19019012 9003 init_data
|
|
0.028 0.052 23023048 15006 erand48_r
|
|
0.024 0.024 19019008 9004 __drand48_iterate
|
|
0.015 0.067 11011009 2998 drand48
|
|
0.008 0.010 0 3002 _int_malloc
|
|
0.001 0.001 0 0 brk
|
|
0.001 0.002 0 0 sysmalloc
|
|
0. 0.001 0 0 __default_morecore
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@IndexSubentry{Options, @code{-compare}}
|
|
@IndexSubentry{Commands, @code{compare}}
|
|
When using event counters, the values could be very large and it is not easy
|
|
to compare the numbers. As we will show next, the @code{ratio} feature is
|
|
very useful when comparing such profiles.
|
|
|
|
To demonstrate this, we have set up another event counter experiment where
|
|
we would like to compare the number of last level cache miss and the number
|
|
of branch instructions executed when using a single thread, or two threads.
|
|
|
|
These are the commands used to generate the experiment directories:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ exe=./mxv-pthreads
|
|
$ m=8000
|
|
$ n=4000
|
|
$ exp1=mxv.hwc.comp.1.thr.er
|
|
$ exp2=mxv.hwc.comp.2.thr.er
|
|
$ gprofng collect app -O $exp1 -h llm -h br_ins $exe -m $m -n $n -t 1
|
|
$ gprofng collect app -O $exp2 -h llm -h br_ins $exe -m $m -n $n -t 2
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
The following script has been used to get the tables. Due to lay-out
|
|
restrictions, we have to create two tables, one for each counter.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
# Limit the output to 5 lines
|
|
limit 5
|
|
# Define the metrics
|
|
metrics name:e.llm
|
|
# Set the comparison to ratio
|
|
compare ratio
|
|
functions
|
|
# Define the metrics
|
|
metrics name:e.br_ins
|
|
# Set the comparison to ratio
|
|
compare ratio
|
|
functions
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
Note that we print the name of the function first, followed by the counter
|
|
data.
|
|
The new element is that we set the comparison mode to @code{ratio}. This
|
|
divides the data in a column by its counterpart in the reference experiment.
|
|
|
|
This is the command using this script and the two experiment directories as
|
|
input:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -script my-script-comp-counters \
|
|
mxv.hwc.comp.1.thr.er \
|
|
mxv.hwc.comp.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
By design, we get two tables, one for each counter:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Functions sorted by metric: Exclusive Last-Level Cache Misses
|
|
|
|
mxv.hwc.comp.1.thr.er mxv.hwc.comp.2.thr.er
|
|
Name Excl. Last-Level Excl. Last-Level
|
|
Cache Misses Cache Misses
|
|
ratio
|
|
<Total> 122709276 x 0.788
|
|
mxv_core 121796001 x 0.787
|
|
init_data 723064 x 1.055
|
|
erand48_r 100111 x 0.500
|
|
drand48 60065 x 1.167
|
|
|
|
Functions sorted by metric: Exclusive Branch Instructions
|
|
|
|
mxv.hwc.comp.1.thr.er mxv.hwc.comp.2.thr.er
|
|
Name Excl. Branch Excl. Branch
|
|
Instructions Instructions
|
|
ratio
|
|
<Total> 1307307316 x 0.997
|
|
mxv_core 1235235239 x 0.997
|
|
erand48_r 23023033 x 0.957
|
|
drand48 20020009 x 0.600
|
|
__drand48_iterate 17017028 x 0.882
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
A ratio less than one in the second column, means that this counter
|
|
value was smaller than the value from the reference experiment shown
|
|
in the first column.
|
|
|
|
This kind of presentation of the results makes it much easier to
|
|
quickly interpret the data.
|
|
|
|
We conclude this section with thread-level event counter overviews,
|
|
but before we go into this, there is an important metric we need to
|
|
mention.
|
|
|
|
@c -- TBD Explain <Total> for IPC
|
|
|
|
@IndexSubentry{Hardware event counters, IPC}
|
|
In case it is known how many instructions and CPU cycles have been executed,
|
|
the value for the IPC (``Instructions Per Clockycle'') can be computed.
|
|
@xref{Hardware Event Counters Explained}.
|
|
This is a derived metric that gives an indication how well the processor
|
|
is utilized. The inverse of the IPC is called CPI.
|
|
@IndexSubentry{Hardware event counters, CPI}
|
|
|
|
The @DisplayText{} command automatically computes the IPC and CPI values
|
|
if an experiment contains the event counter values for the instructions
|
|
and CPU cycles executed. These are part of the metric list and can be
|
|
displayed, just like any other metric.
|
|
|
|
@IndexSubentry{Options, @code{-metric_list}}
|
|
@IndexSubentry{Commands, @code{metric_list}}
|
|
This can be verified through the @command{metric_list} command. If we go
|
|
back to our earlier experiment with the default event counters, we get
|
|
the following result.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -metric_list mxv.hwc.def.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Current metrics: e.totalcpu:i.totalcpu:e.cycles:e+insts:e+llm:name
|
|
Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
|
|
Available metrics:
|
|
Exclusive Total CPU Time: e.%totalcpu
|
|
Inclusive Total CPU Time: i.%totalcpu
|
|
Exclusive CPU Cycles: e.+%cycles
|
|
Inclusive CPU Cycles: i.+%cycles
|
|
Exclusive Instructions Executed: e+%insts
|
|
Inclusive Instructions Executed: i+%insts
|
|
Exclusive Last-Level Cache Misses: e+%llm
|
|
Inclusive Last-Level Cache Misses: i+%llm
|
|
Exclusive Instructions Per Cycle: e+IPC
|
|
Inclusive Instructions Per Cycle: i+IPC
|
|
Exclusive Cycles Per Instruction: e+CPI
|
|
Inclusive Cycles Per Instruction: i+CPI
|
|
Size: size
|
|
PC Address: address
|
|
Name: name
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
Among the other metrics, we see the new metrics for the IPC and CPI
|
|
listed.
|
|
|
|
In the script below, we use this information and add the IPC and CPI
|
|
to the metrics to be displayed. We also use a the thread filter to
|
|
display these values for the individual threads.
|
|
|
|
This is the complete script we have used. Other than a different selection
|
|
of the metrics, there are no new features.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
# Define the metrics
|
|
metrics e.insts:e.%cycles:e.IPC:e.CPI
|
|
# Sort with respect to cycles
|
|
sort e.cycles
|
|
# Limit the output to 5 lines
|
|
limit 5
|
|
# Get the function overview for all threads
|
|
functions
|
|
# Get the function overview for thread 1
|
|
thread_select 1
|
|
functions
|
|
# Get the function overview for thread 2
|
|
thread_select 2
|
|
functions
|
|
# Get the function overview for thread 3
|
|
thread_select 3
|
|
functions
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
In the metrics definition on the second line, we explicitly request the
|
|
counter values for the instructions (@code{e.insts}) and CPU cycles
|
|
(@code{e.cycles}) executed. These names can be found in output from the
|
|
@IndexSubentry{Options, @code{-metric_list}}
|
|
@IndexSubentry{Commands, @code{metric_list}}
|
|
@command{metric_list} command above.
|
|
In addition to these metrics, we also request the IPC and CPI to be shown.
|
|
|
|
@IndexSubentry{Options, @code{-limit}}
|
|
@IndexSubentry{Commands, @code{limit}}
|
|
As before, we used the @command{limit} command to control the number of
|
|
functions displayed. We then request an overview for all the threads,
|
|
followed by three sets of two commands to select a thread and display the
|
|
function overview.
|
|
|
|
The script above is used as follows:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -script my-script-ipc mxv.hwc.def.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@noindent
|
|
This script produces four tables. We list them separately below,
|
|
and have left out the additional output.
|
|
|
|
@noindent
|
|
The first table shows the accumulated values across the three
|
|
threads that have been active.
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Functions sorted by metric: Exclusive CPU Cycles
|
|
|
|
Excl. Excl. CPU Excl. Excl. Name
|
|
Instructions Cycles IPC CPI
|
|
Executed sec. %
|
|
7906475309 2.691 100.00 1.473 0.679 <Total>
|
|
7432724378 2.598 96.54 1.434 0.697 mxv_core
|
|
188860269 0.035 1.31 2.682 0.373 erand48_r
|
|
73623396 0.026 0.95 1.438 0.696 init_data
|
|
76824434 0.018 0.66 2.182 0.458 drand48
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@noindent
|
|
This shows that IPC of this program is completely dominated
|
|
by function @code{mxv_core}. It has a fairly low IPC value
|
|
of 1.43.
|
|
|
|
@noindent
|
|
The next table is for thread 1 and shows the values for the
|
|
main thread.
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 1 3
|
|
Functions sorted by metric: Exclusive CPU Cycles
|
|
|
|
Excl. Excl. CPU Excl. Excl. Name
|
|
Instructions Cycles IPC CPI
|
|
Executed sec. %
|
|
473750931 0.093 100.00 2.552 0.392 <Total>
|
|
188860269 0.035 37.93 2.682 0.373 erand48_r
|
|
73623396 0.026 27.59 1.438 0.696 init_data
|
|
76824434 0.018 18.97 2.182 0.458 drand48
|
|
134442832 0.013 13.79 5.250 0.190 __drand48_iterate
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@noindent
|
|
Although this thread hardly uses any CPU cycles, the overall IPC
|
|
of 2.55 is not all that bad.
|
|
|
|
@noindent
|
|
Last, we show the tables for threads 2 and 3:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 2 3
|
|
Functions sorted by metric: Exclusive CPU Cycles
|
|
|
|
Excl. Excl. CPU Excl. Excl. Name
|
|
Instructions Cycles IPC CPI
|
|
Executed sec. %
|
|
3716362189 1.298 100.00 1.435 0.697 <Total>
|
|
3716362189 1.298 100.00 1.435 0.697 mxv_core
|
|
0 0. 0. 0. 0. collector_root
|
|
0 0. 0. 0. 0. driver_mxv
|
|
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 3 3
|
|
Functions sorted by metric: Exclusive CPU Cycles
|
|
|
|
Excl. Excl. CPU Excl. Excl. Name
|
|
Instructions Cycles IPC CPI
|
|
Executed sec. %
|
|
3716362189 1.300 100.00 1.433 0.698 <Total>
|
|
3716362189 1.300 100.00 1.433 0.698 mxv_core
|
|
0 0. 0. 0. 0. collector_root
|
|
0 0. 0. 0. 0. driver_mxv
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
It is seen that both execute the same number of instructions and
|
|
take about the same number of CPU cycles. As a result, the IPC is
|
|
the same for both threads.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@c TBD @node Additional Features
|
|
@c TBD @section Additional Features
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@c TBD @node More Filtering Capabilities
|
|
@c TBD @subsection More Filtering Capabilities
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@c TBD Cover @code{samples} and @code{seconds}
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Java Profiling
|
|
@section Java Profiling
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@IndexSubentry{Options, @code{-j}}
|
|
@IndexSubentry{Java profiling, @code{-j on/off}}
|
|
The @CollectApp{} command supports Java profiling. The @code{-j on} option
|
|
can be used for this, but since this feature is enabled by default, there is
|
|
no need to set this explicitly. Java profiling may be disabled through the
|
|
@code{-j off} option.
|
|
|
|
The program is compiled as usual and the experiment directory is created
|
|
similar to what we have seen before. The only difference with a C/C++
|
|
application is that the program has to be explicitly executed by java.
|
|
|
|
For example, this is how to generate the experiment data for a Java
|
|
program that has the source code stored in file @code{Pi.java}:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ javac Pi.java
|
|
$ gprofng collect app -j on -O pi.demo.er java Pi < pi.in
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
Regarding which java is selected to generate the data, @ToolName{}
|
|
first looks for the JDK in the path set in either the
|
|
@IndexSubentry{Java profiling, @code{JDK_HOME}}
|
|
@code{JDK_HOME} environment variable, or in the
|
|
@IndexSubentry{Java profiling, @code{JAVA_PATH}}
|
|
@code{JAVA_PATH} environment variable. If neither of these variables is
|
|
set, it checks for a JDK in the search path (set in the PATH
|
|
environment variable). If there is no JDK in this path, it checks for
|
|
the java executable in @code{/usr/java/bin/java}.
|
|
|
|
In case additional options need to be passed on to the JVM, the
|
|
@IndexSubentry{Options, @code{-J}}
|
|
@IndexSubentry{Java profiling, @code{-J <string>}}
|
|
@code{-J <string>} option can be used. The string with the
|
|
option(s) has to be delimited by quotation marks in case
|
|
there is more than one argument.
|
|
|
|
The @DisplayText{} command may be used to view the performance data. There is
|
|
no need for any special options and the same commands as previously discussed
|
|
are supported.
|
|
|
|
@IndexSubentry{Options, @code{-viewmode}}
|
|
@IndexSubentry{Commands, @code{viewmode}}
|
|
@IndexSubentry{Java profiling, different view modes}
|
|
The @code{viewmode} command
|
|
@xref{The Viewmode}
|
|
is very useful to examine the call stacks.
|
|
|
|
For example, this is how one can see the native call stacks. For
|
|
lay-out purposes we have restricted the list to the first five entries:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -limit 5 -viewmode machine -calltree pi.demo.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Print limit set to 5
|
|
Viewmode set to machine
|
|
Functions Call Tree. Metric: Attributed Total CPU Time
|
|
|
|
Attr. Name
|
|
Total
|
|
CPU sec.
|
|
1.381 +-<Total>
|
|
1.171 +-Pi.calculatePi(double)
|
|
0.110 +-collector_root
|
|
0.110 | +-JavaMain
|
|
0.070 | +-jni_CallStaticVoidMethod
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@noindent
|
|
Note that the selection of the viewmode is echoed in the output.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node The gprofng Tools
|
|
@chapter The gprofng Tools
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
Several tools are included in @ProductName{}. In subsequent chapters these
|
|
are discussed in detail. Below a brief description is given, followed by an
|
|
overview of the environment variables that are supported.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Tools Overview
|
|
@section Tools Overview
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The following tools are supported by @ProductName{}:
|
|
|
|
@table @code
|
|
|
|
@item @CollectApp{}
|
|
@IndexSubentry{@code{gprofng}, @code{collect app}}
|
|
|
|
Collects the performance data and stores the results in an experiment
|
|
directory. There are many options on this tool, but quite often the
|
|
defaults are sufficient.
|
|
An experiment directory is required for the subsequent analysis of
|
|
the results.
|
|
|
|
@item @DisplayText{}
|
|
@IndexSubentry{@code{gprofng}, @code{display text}}
|
|
|
|
Generates performance reports in ASCII format. Commandline
|
|
options, and/or commands in a script file are used to control the contents
|
|
and lay-out of the generated report(s).
|
|
|
|
@item @DisplayHTML{}
|
|
@IndexSubentry{@code{gprofng}, @code{display html}}
|
|
|
|
Takes one or more experiment directories and generates a directory with
|
|
HTML files. Starting from the index.html file, the performance data
|
|
may be examined in a browser.
|
|
|
|
@item @DisplaySRC{}
|
|
@IndexSubentry{@code{gprofng}, @code{display src}}
|
|
|
|
Displays the source code, interleaved with the disassembled instructions.
|
|
|
|
@item @Archive{}
|
|
@IndexSubentry{@code{gprofng}, @code{archive}}
|
|
|
|
Archives an experiment directory by (optionally) including source code and
|
|
object files, as well as the shared libraries that have been used.
|
|
|
|
@item @GUI{}
|
|
@IndexSubentry{@code{gprofng}, @code{display gui}}
|
|
|
|
This is an optional component that can be installed in addition to the
|
|
command line gprofng tools listed above. It supports the graphical
|
|
analysis of one or more experiments that have been created using
|
|
@CollectApp{}.
|
|
|
|
The GUI part of @ProductName{} is a GNU project. This is the link to the
|
|
@url{https://savannah.gnu.org/projects/gprofng-gui, gprofng GUI page}.
|
|
This page contains more information (e.g. how to clone the repo).
|
|
There is also a
|
|
@url{https://ftp.gnu.org/gnu/gprofng-gui, tar file distribution directory}
|
|
with tar files that include everything that is needed to build and install
|
|
the GUI code. Various versions are available here.
|
|
Be aware that in order to build and use the gprofng GUI, Java needs to be
|
|
installed first. The minimum Java version required is Java 8.
|
|
|
|
@end table
|
|
|
|
@c -- A new section -----------------------------------------------------------
|
|
@node The gprofng.rc file with default settings
|
|
@section The gprofng.rc file with default settings
|
|
@c ----------------------------------------------------------------------------
|
|
The @file{gprofng.rc}
|
|
@cindex gprofng.rc
|
|
file is used to define default settings for the @DisplayText{}, @Archive{},
|
|
and @DisplaySRC{} tools, but the user can override these defaults through
|
|
local configuration settings when building and installing from the source
|
|
code.
|
|
|
|
There are three files that are checked when the tool starts up. The first
|
|
file has pre-defined settings and comes with the installation, but through
|
|
a hidden file called @file{.gprofng.rc}, the user can (re)define the defaults.
|
|
|
|
These are the locations and files that are checked upon starting the above
|
|
mentioned tools:
|
|
|
|
@enumerate
|
|
|
|
@item
|
|
The system-wide filename is called @file{gprofng.rc} and is located in
|
|
the @file{/etc} subdirectory in case an RPM was used for the installation.
|
|
|
|
If @ProductName{} has been built from the source, this file is in
|
|
subdirectory @file{etc} in the top level installation directory.
|
|
|
|
@item
|
|
The user's home directory may have a hidden file called @file{.gprofng.rc}.
|
|
|
|
@item
|
|
The directory where @DisplayText{} (or @DisplaySRC{}) is invoked from, may
|
|
have a hidden file called @file{.gprofng.rc}.
|
|
|
|
@end enumerate
|
|
|
|
The settings of each file override the settings of the file(s) read before it.
|
|
Defaults in the system-wide file are overruled by the file in the user home
|
|
directory (if any) and any settings in the @file{.gprofng.rc} file in the
|
|
current directory override those.
|
|
|
|
Note that the settings in these files only affect the defaults. Unlike
|
|
the commands used in a script file, they are not commands for the tools.
|
|
|
|
@c -- TBD indxobj_define,
|
|
@c -- TBD object_show,
|
|
@c -- TBD object_hide,
|
|
@c -- TBD object_api,
|
|
|
|
The @file{.gprofng.rc} configuration files can contain the
|
|
@command{addpath},
|
|
@command{compare},
|
|
@command{dthresh},
|
|
@command{name},
|
|
@command{pathmap},
|
|
@command{printmode},
|
|
@command{sthresh},
|
|
and
|
|
@command{viewmode}
|
|
commands as described in this user guide.
|
|
|
|
They can also contain the following commands, @emph{which cannot be used on
|
|
either the command line, or in a script file}:
|
|
|
|
@table @code
|
|
|
|
@item dmetrics @var{metric-spec}
|
|
@IndexSubentry{Commands, @code{dmetrics}}
|
|
|
|
Specify the default metrics to be displayed or printed in the function list.
|
|
The syntax and use of the metric list is described in section
|
|
@ref{Metric Definitions}.
|
|
The order of the metric keywords in the list determines the order in which
|
|
the metrics are presented.
|
|
|
|
Default metrics for the @code{callers-callees} list are derived from the
|
|
function list default metrics by adding the corresponding attributed metric
|
|
before the first occurrence of each metric name in the list.
|
|
|
|
@item dsort @var{metric-spec}
|
|
@IndexSubentry{Commands, @code{dsort}}
|
|
|
|
Specify the default metric by which the function list is sorted. The sort
|
|
metric is the first metric in this list that matches a metric in any loaded
|
|
experiment, subject to the following conditions:
|
|
|
|
@itemize @bullet
|
|
|
|
@item
|
|
If the entry in @var{metric-spec} has a visibility string of an exclamation
|
|
point (@samp{!}), the first metric whose name matches is used, regardless of
|
|
whether it is visible.
|
|
|
|
@item
|
|
If the entry in @var{metric-spec} has any other visibility string, the first
|
|
visible metric whose name matches is used.
|
|
|
|
@end itemize
|
|
|
|
The syntax and use of the metric list is described in section
|
|
@ref{Metric Definitions}.
|
|
The default sort metric for the @code{callers-callees} list is the attributed
|
|
metric corresponding to the default sort metric for the function list.
|
|
|
|
@item en_desc @{on | off | =@var{regex}@}
|
|
@IndexSubentry{Commands, @code{en_desc}}
|
|
|
|
Set the mode for reading descendant experiments to @samp{on} (enable all
|
|
descendants) or @samp{off} to disable all descendants. If
|
|
@samp{=}@var{regex} is used, enable data from those experiments whose
|
|
executable name matches the regular expression.
|
|
|
|
The default setting is @samp{on} to follow all descendants. In reading
|
|
experiments with descendants, any sub-experiments that contain little or
|
|
no performance data are ignored by @DisplayText{}.
|
|
|
|
@end table
|
|
|
|
@c -- A new section -----------------------------------------------------------
|
|
@node Filters
|
|
@section Filters
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
Various filter commands are supported by @DisplayText{}.
|
|
Thanks to the use of filters, the user can zoom in on a certain area of
|
|
interest. With filters, it is possible to select one or more threads to
|
|
focus on, define a window in time, select specific call stacks, etc.
|
|
@IndexSubentry{Filters, Intro}
|
|
|
|
While already powerful by themselves, filters may be combined to further
|
|
narrow down the view into the data.
|
|
|
|
@IndexSubentry{Filters, Persistence}
|
|
It is important to note that filters are @emph{persistent}. A filter is
|
|
active until it is reset. This means that successive filter commands
|
|
increasingly narrow down the view until one or more are reset.
|
|
|
|
@noindent
|
|
An example is the following:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -thread_select 1 -functions \
|
|
-cpu_select 2 -functions @dots{}
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
This command selects thread 1 and requests the function view for this thread.
|
|
The third (@command{cpu_select 2}) command @emph{adds} the
|
|
constraint that only the events on CPU 2 are to be selected. This means
|
|
that the next function view selects events that were executed by thread 1 and
|
|
have been running on CPU 2.
|
|
|
|
@noindent
|
|
In contrast with this single command line, the two commands below look similar,
|
|
but behave very differently:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -thread_select 1 -functions @dots{}
|
|
$ gprofng display text -cpu_select 2 -functions @dots{}
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
The first command displays the function view for thread 1. The second
|
|
command shows the function view for CPU 2 for @emph{all} threads that have
|
|
been running on this CPU.
|
|
|
|
As the following example demonstrates, things get a little more tricky in
|
|
case a script file is used. Consider the following script file:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
thread_select 1
|
|
functions
|
|
cpu_select 2
|
|
functions
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
This script file displays the function view for thread 1 first. This is
|
|
followed by those functions that were executed by thread 1 @emph{and} have
|
|
been run on CPU 2.
|
|
|
|
If however, the script should behave like the two command line invocations
|
|
shown above, the thread selection filter needs to be reset before CPU 2 is
|
|
selected:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
thread_select 1
|
|
functions
|
|
# Reset the thread selection filter:
|
|
thread_select all
|
|
cpu_select 2
|
|
functions
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
In general, filters behave differently than commands or options. In
|
|
particular there may be an interaction between different filter definitions.
|
|
|
|
For example, as explained above, in the first script file the
|
|
@command{thread_select} and @command{cpu_select} commands interact.
|
|
|
|
For a list of all the predefined filters see @ref{Predefined Filters}.
|
|
|
|
@c -- A new section -----------------------------------------------------------
|
|
@node Supported Environment Variables
|
|
@section Supported Environment Variables
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
Various environment variables are supported. We refer to the man page for
|
|
gprofng(1) for an overview and description
|
|
(@xref{Man page for @command{gprofng}}).
|
|
|
|
@c -- A new chapter -----------------------------------------------------------
|
|
@node Performance Data Collection
|
|
@chapter Performance Data Collection
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The @CollectApp{} command is used to gather the application performance data
|
|
while the application executes.
|
|
At regular intervals, program execution is halted and the required data is
|
|
recorded.
|
|
@cindex Experiment directory
|
|
An experiment directory is created when the tool starts. This directory is
|
|
used to store the relevant information and forms the basis for a subsequent
|
|
analysis with one of the viewing tools.
|
|
|
|
@c -- A new section -----------------------------------------------------------
|
|
@node The @command{gprofng collect app} Tool
|
|
@section The @command{gprofng collect app} Tool
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
This is the command to collect the performance information for the target
|
|
application. The usage is as follows:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng collect app [OPTION(S)] TARGET [TARGET_ARGUMENTS]
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@noindent
|
|
Options to the command are passed in first. This is followed by the name of
|
|
the target, which is typically a binary executable or a script, followed by
|
|
any options that may be required by the target.
|
|
|
|
@c -- A new section -----------------------------------------------------------
|
|
@node View the Performance Information
|
|
@chapter View the Performance Information
|
|
@c ----------------------------------------------------------------------------
|
|
Various tools to view the performance data stored in one or more experiment
|
|
directories are available. In this chapter, these will all be covered in
|
|
detail.
|
|
|
|
@c -- A new section -----------------------------------------------------------
|
|
@node The @command{gprofng display text} Tool
|
|
@section The @command{gprofng display text} Tool
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
This tool displays the performance information in ASCII format. It supports
|
|
a variety of views into the data recorded.
|
|
These views can be specified in two ways and both may be used simultaneously:
|
|
|
|
@itemize @bullet
|
|
|
|
@item
|
|
Command line options start with a dash (@samp{-}) symbol and may take an
|
|
argument.
|
|
|
|
@item
|
|
Options may also be included in a file, the ``script file''. In this case,
|
|
the dash symbol should @emph{not} be included. Multiple script files can
|
|
be used on the same command line.
|
|
|
|
@end itemize
|
|
|
|
While they may appear as an option, they are really commands and this is
|
|
why they will be referred to as @emph{commands}
|
|
@cindex Commands
|
|
in the documentation.
|
|
|
|
As a general rule, @emph{the order of options matters} and if the same option,
|
|
or command, occurs multiple times, the rightmost setting is selected.
|
|
|
|
@c -- A new sub section -------------------------------------------------------
|
|
@node The @code{gprofng display text} Commands
|
|
@subsection The @code{gprofng display text} Commands
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The most commonly used commands are documented in the man page for this tool
|
|
(@xref{Man page for @command{gprofng display text}}). In this section we list
|
|
and describe all other commands that are supported.
|
|
|
|
@c -- A new sub subsection ----------------------------------------------------
|
|
@node Commands that List Experiment Details
|
|
@unnumberedsubsubsec Commands that List Experiment Details
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@table @code
|
|
|
|
@item experiment_ids
|
|
@ifclear man
|
|
@IndexSubentry{Options, @code{-experiment_ids}}
|
|
@IndexSubentry{Commands, @code{experiment_ids}}
|
|
@end ifclear
|
|
|
|
For each experiment that has been loaded, show the totals of the metrics
|
|
recorded, plus some other operational characteristics like the name of
|
|
the executable, PID, etc. The top line contains the accumulated totals
|
|
for the metrics.
|
|
|
|
@item experiment_list
|
|
@ifclear man
|
|
@IndexSubentry{Options, @code{-experiment_list}}
|
|
@IndexSubentry{Commands, @code{experiment_list}}
|
|
@end ifclear
|
|
|
|
Display the list of experiments that are loaded. Each experiment is listed
|
|
with an index, which is used when selecting samples, threads, or LWPs, and
|
|
a process id (PID), which can be used for advanced filtering.
|
|
|
|
@item cpu_list
|
|
@IndexSubentry{Options, @code{-cpu_list}}
|
|
@IndexSubentry{Commands, @code{cpu_list}}
|
|
|
|
Display the total number of CPUs that have been used during the experiment(s).
|
|
|
|
@item cpus
|
|
@IndexSubentry{Options, @code{-cpus}}
|
|
@IndexSubentry{Commands, @code{cpus}}
|
|
|
|
Show a list of CPUs that were used by the application, along with the metrics
|
|
that have been recorded. The CPUs are represented by a CPU number and show
|
|
the Total CPU time by default.
|
|
|
|
Note that since the data is sorted with respect to the default metric, it may
|
|
be useful to use the @command{sort name} command to show the list sorted with
|
|
respect to the CPU id.
|
|
|
|
@item GCEvents
|
|
@IndexSubentry{Options, @code{-GCEvents}}
|
|
@IndexSubentry{Commands, @code{GCEvents}}
|
|
|
|
This commands is for Java applications only. It shows any Garbage Collection
|
|
(GC) events that have occurred while the application was executing.
|
|
|
|
@item lwp_list
|
|
@IndexSubentry{Options, @code{-lwp_list}}
|
|
@IndexSubentry{Commands, @code{lwp_list}}
|
|
|
|
Displays the list of LWPs processed during the experiment(s).
|
|
|
|
@item processes
|
|
@IndexSubentry{Options, @code{-processes}}
|
|
@IndexSubentry{Commands, @code{processes}}
|
|
|
|
For each experiment that has been loaded, this command displays a list of
|
|
processes that were created by the application, along with their metrics.
|
|
The processes are represented by process ID (PID) numbers and show the
|
|
Total CPU time metric by default. If additional metrics are recorded in
|
|
an experiment, these are shown as well.
|
|
|
|
@item samples
|
|
@IndexSubentry{Options, @code{-samples}}
|
|
@IndexSubentry{Commands, @code{samples}}
|
|
|
|
Display a list of sample points and their metrics, which reflect the
|
|
microstates recorded at each sample point in the loaded experiment.
|
|
The samples are represented by sample numbers and show the Total CPU time
|
|
by default. Other metrics might also be displayed if enabled.
|
|
|
|
@item sample_list
|
|
@IndexSubentry{Options, @code{-sample_list}}
|
|
@IndexSubentry{Commands, @code{sample_list}}
|
|
|
|
For each experiment loaded, display the list of samples currently selected.
|
|
|
|
@item seconds
|
|
@IndexSubentry{Options, @code{-seconds}}
|
|
@IndexSubentry{Commands, @code{seconds}}
|
|
|
|
Show each second of the profiling run that was captured in the experiment,
|
|
along with the metrics collected in that second. The seconds view differs
|
|
from the samples view in that it shows periodic samples that occur every
|
|
second beginning at 0 and the interval cannot be changed.
|
|
|
|
The seconds view lists the seconds of execution with the Total CPU time by
|
|
default. Other metrics might also be displayed if the metrics are present
|
|
in the loaded experiments.
|
|
|
|
@item threads
|
|
@IndexSubentry{Options, @code{-threads}}
|
|
@IndexSubentry{Commands, @code{threads}}
|
|
|
|
Show a list of threads and their metrics. The threads are represented
|
|
by a process and thread pair and show the Total CPU time by default.
|
|
Other metrics might also be displayed by default if the metrics are
|
|
present in the loaded experiment.
|
|
|
|
@item thread_list
|
|
@IndexSubentry{Options, @code{-thread_list}}
|
|
@IndexSubentry{Commands, @code{thread_list}}
|
|
|
|
Display the list of threads currently selected for the analysis.
|
|
|
|
@end table
|
|
|
|
@noindent
|
|
@emph{The commands below are for use in scripts and interactive mode only.
|
|
They are not allowed on the command line.}
|
|
|
|
@table @code
|
|
|
|
@item add_exp @var{exp-name}
|
|
@IndexSubentry{Commands, @code{add_exp}}
|
|
|
|
Add the named experiment to the current session.
|
|
|
|
@item drop_exp @var{exp-name}
|
|
@IndexSubentry{Commands, @code{drop_exp}}
|
|
|
|
Drop the named experiment from the current session.
|
|
|
|
@item open_exp @var{exp-name}
|
|
@IndexSubentry{Commands, @code{open_exp}}
|
|
|
|
Drop all loaded experiments from the session, and then load the named
|
|
experiment.
|
|
|
|
@end table
|
|
|
|
@c -- A new sub subsection ----------------------------------------------------
|
|
@node Commands that Affect Listings and Output
|
|
@unnumberedsubsubsec Commands that Affect Listings and Output
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@table @code
|
|
|
|
@item dthresh @var{value}
|
|
@IndexSubentry{Options, @code{-dthresh}}
|
|
@IndexSubentry{Commands, @code{dthresh}}
|
|
|
|
Specify the threshold percentage for highlighting metrics in the annotated
|
|
disassembly code. If the value of any metric is equal to or greater than
|
|
@var{value} as a percentage of the maximum value of that metric for any
|
|
instruction line in the file, the line on which the metrics occur has a
|
|
@samp{##} marker inserted at the beginning of the line. The default is 75.
|
|
|
|
@item printmode @{text | html | @var{single-char}@}
|
|
@IndexSubentry{Options, @code{-printmode}}
|
|
@IndexSubentry{Commands, @code{printmode}}
|
|
|
|
Set the print mode. If the keyword is @code{text}, printing will be done in
|
|
tabular form using plain text. In case the @code{html} keyword is selected,
|
|
the output is formatted as an HTML table.
|
|
|
|
Alternatively, @var{single-char} may be used in a delimiter separated list,
|
|
with the single character @var{single-char} as the delimiter.
|
|
|
|
The printmode setting is used only for those commands that generate tables,
|
|
such as @command{functions}. The setting is ignored for other printing
|
|
commands, including those showing source and disassembly listings.
|
|
|
|
@item sthresh @var{value}
|
|
@IndexSubentry{Options, @code{-sthresh}}
|
|
@IndexSubentry{Commands, @code{sthresh}}
|
|
|
|
Specify the threshold percentage for highlighting metrics in the annotated
|
|
source code. If the value of any metric is equal to or greater than
|
|
@var{value} (as a percentage) of the maximum value of that metric for any
|
|
source line in the file, the line on which the metrics occur has a @samp{##}
|
|
marker inserted at the beginning of the line. The default is 75.
|
|
|
|
@end table
|
|
|
|
@c -- A new sub subsection ----------------------------------------------------
|
|
@node Predefined Filters
|
|
@unnumberedsubsubsec Predefined Filters
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The filters below use a list, the selection list, to define a sequence of
|
|
numbers. @xref{The Selection List}.
|
|
Note that this selection is persistent, but the filter can be reset by using
|
|
@samp{all} as the @var{selection-list}.
|
|
|
|
@table @code
|
|
|
|
@item cpu_select @var{selection-list}
|
|
@IndexSubentry{Options, @code{-cpu_select}}
|
|
@IndexSubentry{Commands, @code{cpu_select}}
|
|
|
|
Select the CPU ids specified in the @var{selection-list}.
|
|
|
|
@item lwp_select @var{selection-list}
|
|
@IndexSubentry{Options, @code{-lwp_select}}
|
|
@IndexSubentry{Commands, @code{lwp_select}}
|
|
|
|
Select the LWPs specified in the @var{selection-list}.
|
|
|
|
@item sample_select @var{selection-list}
|
|
@IndexSubentry{Options, @code{-sample-select}}
|
|
@IndexSubentry{Commands, @code{sample-select}}
|
|
|
|
@item thread_select @var{selection-list}
|
|
@IndexSubentry{Options, @code{-thread_select}}
|
|
@IndexSubentry{Commands, @code{thread_select}}
|
|
|
|
Select a series of threads, or just one, to be used in subsequent views.
|
|
The @var{selection-list} consists of a sequence of comma separated numbers.
|
|
This may include a range of the form @samp{n-m}.
|
|
|
|
@end table
|
|
|
|
|
|
@c -- A new sub subsection ----------------------------------------------------
|
|
@node Commands to Set and Change Search Paths
|
|
@unnumberedsubsubsec Commands to Set and Change Search Paths
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@table @code
|
|
|
|
@item addpath @var{path-list}
|
|
@IndexSubentry{Options, @code{-addpath}}
|
|
@IndexSubentry{Commands, @code{addpath}}
|
|
|
|
Append @var{path-list} to the current setpath settings. Note that multiple
|
|
@command{addpath} commands can be used in @file{.gprofng.rc} files, and will
|
|
be concatenated.
|
|
|
|
@item pathmap @var{old-prefix} @var{new-prefix}
|
|
@IndexSubentry{Options, @code{-pathmap}}
|
|
@IndexSubentry{Commands, @code{pathmap}}
|
|
|
|
If a file cannot be found using the path list set by @command{addpath}, or
|
|
the @command{setpath} command, one or more path remappings may be set with the
|
|
@command{pathmap} command.
|
|
|
|
With path mapping, the user can specify how to replace the leading component
|
|
in a full path by a different string.
|
|
|
|
With this command, any path name for a source file, object file, or shared
|
|
object that begins with the prefix specified with @var{old-prefix}, the
|
|
old prefix is replaced by the prefix specified with @var{new-prefix}.
|
|
The resulting path is used to find the file.
|
|
|
|
For example, if a source file located in directory @file{/tmp}
|
|
is shown in the @DisplayText{} output, but should instead be taken from
|
|
@file{/home/demo}, the following @file{pathmap} command redefines the
|
|
path:
|
|
|
|
@smallexample
|
|
$ gprofng diplay text -pathmap /tmp /home/demo -source ...
|
|
@end smallexample
|
|
|
|
Note that multiple @command{pathmap} commands can be supplied, and each is
|
|
tried until the file is found.
|
|
|
|
@item setpath @var{path-list}
|
|
@IndexSubentry{Options, @code{-setpath}}
|
|
@IndexSubentry{Commands, @code{setpath}}
|
|
|
|
Set the path used to find source and object files. The path is defined
|
|
through the @var{path-list} keyword. It is a colon separated list of
|
|
directories, jar files, or zip files.
|
|
If any directory has a colon character in it, escape it with a
|
|
backslash (@samp{\}).
|
|
|
|
The special directory name @code{$expts}, refers
|
|
to the set of current experiments in the order in which they were loaded.
|
|
You can abbreviate it with a single @samp{$} character.
|
|
|
|
The default path is @samp{$expts:..} which is the directories of the
|
|
loaded experiments and the current working directory.
|
|
|
|
Use @command{setpath} with no argument to display the current path.
|
|
|
|
Note that @command{setpath} commands @emph{are not allowed .gprofng.rc
|
|
configuration files}.
|
|
|
|
@end table
|
|
|
|
@c -- A new subsection --------------------------------------------------------
|
|
@c -- TBD @node Usage examples for @code{gprofng display text}
|
|
@c -- TBD @subsection Usage examples for @code{gprofng display text}
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@c -- TBD In this section we present usage examples.
|
|
|
|
@c -- A new chapter -----------------------------------------------------------
|
|
@c TBD @node The @code{gprofng display html} Tool
|
|
@c TBD @section The @code{gprofng display html} Tool
|
|
@c ----------------------------------------------------------------------------
|
|
@c TBD The options are documented in the man page for this tool. In this section we
|
|
@c TBD present usage examples.
|
|
|
|
@c -- A new chapter -----------------------------------------------------------
|
|
@c TBD @node Display Source Code
|
|
@c TBD @chapter Display Source Code
|
|
@c ----------------------------------------------------------------------------
|
|
@c TBD The options are documented in the man page for this tool. In this section we
|
|
@c TBD present usage examples.
|
|
|
|
|
|
@c -- A new chapter -----------------------------------------------------------
|
|
@c TBD @node Archive Experiment Data
|
|
@c TBD @chapter Archive Experiment Data
|
|
@c ----------------------------------------------------------------------------
|
|
@c TBD The options are documented in the man page for this tool. In this section we
|
|
@c TBD present usage examples.
|
|
|
|
@c -- A new chapter -----------------------------------------------------------
|
|
@node Terminology
|
|
@chapter Terminology
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
Throughout this manual, certain terminology specific to profiling tools,
|
|
or @ToolName{}, or even to this document only, is used. In this chapter
|
|
this terminology is explained in detail.
|
|
|
|
@menu
|
|
* The Program Counter:: What is a Program Counter?
|
|
* Inclusive and Exclusive Metrics:: An explanation of inclusive and exclusive metrics.
|
|
* Metric Definitions:: Definitions associated with metrics.
|
|
* The Viewmode:: Select the way call stacks are presented.
|
|
* The Selection List:: How to define a selection.
|
|
* Load Objects and Functions:: The components in an application.
|
|
* The Concept of a CPU in gprofng:: The definition of a CPU.
|
|
* Hardware Event Counters Explained:: What are event counters?
|
|
* apath:: Our generic definition of a path.
|
|
@end menu
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@node The Program Counter
|
|
@section The Program Counter
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@cindex PC
|
|
@cindex Program Counter
|
|
The @emph{Program Counter}, or PC for short, keeps track where program
|
|
execution is. The address of the next instruction to be executed is stored
|
|
in a special purpose register in the processor, or core.
|
|
|
|
@cindex Instruction pointer
|
|
The PC is sometimes also referred to as the @emph{instruction pointer}, but
|
|
we will use Program Counter or PC throughout this document.
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@node Inclusive and Exclusive Metrics
|
|
@section Inclusive and Exclusive Metrics
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
In the remainder, these two concepts occur quite often and for lack of a better
|
|
place, they are explained here.
|
|
|
|
@cindex Inclusive metric
|
|
The @emph{inclusive} value for a metric includes all values that are part of
|
|
the dynamic extent of the target function. For example if function @code{A}
|
|
calls functions @code{B} and @code{C}, the inclusive CPU time for @code{A}
|
|
includes the CPU time spent in @code{B} and @code{C}.
|
|
|
|
@cindex Exclusive metric
|
|
In contrast with this, the @emph{exclusive} value for a metric is computed
|
|
by excluding the metric values used by other functions called. In our
|
|
imaginary example, the exclusive CPU time for function @code{A} is the
|
|
time spent outside calling functions @code{B} and @code{C}.
|
|
|
|
@cindex Leaf function
|
|
In case of a @emph{leaf function}, the inclusive and exclusive values for the
|
|
metric are the same since by definition, it is not calling any other
|
|
function(s).
|
|
|
|
Why do we use these two different values? The inclusive metric shows the most
|
|
expensive path, in terms of this metric, in the application. For example, if
|
|
the metric is cache misses, the function with the highest inclusive metric
|
|
tells you where most of the cache misses come from.
|
|
|
|
Within this branch of the application, the exclusive metric points to the
|
|
functions that contribute and help to identify which part(s) to consider
|
|
for further analysis.
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@node Metric Definitions
|
|
@section Metric Definitions
|
|
@c ----------------------------------------------------------------------------
|
|
The metrics displayed in the various views are highly customizable. In this
|
|
section it is explained how to construct the metrics definition(s).
|
|
|
|
@IndexSubentry{Options, @code{-metrics}}
|
|
@IndexSubentry{Commands, @code{metrics}}
|
|
The @command{metrics} command takes a colon (@samp{:}) separated list, where
|
|
each item in the list consists of the following three fields:
|
|
@var{<flavor>}@var{<visibility>}@var{<metric-name>}.
|
|
|
|
@cindex Flavor field
|
|
@cindex Visibility field
|
|
@cindex Metric name field
|
|
@IndexSubentry{Metrics, Flavor field}
|
|
@IndexSubentry{Metrics, Visibility field}
|
|
@IndexSubentry{Metrics, Metric name field}
|
|
The @var{<flavor>} field is either @samp{e} for ``exclusive'', and/or
|
|
@samp{i} for ``inclusive''. The @var{<metric-name>} field is the name of
|
|
the metric and the @var{<visibility>} field consists of one ore more characters
|
|
from the following table:
|
|
|
|
@table @code
|
|
|
|
@item .
|
|
Show the metric as time. This applies to timing metrics and hardware event
|
|
counters that measure cycles. Interpret as @samp{+} for other metrics.
|
|
|
|
@item %
|
|
Show the metric as a percentage of the total value for this metric.
|
|
|
|
@item +
|
|
Show the metric as an absolute value. For hardware event counters this is
|
|
the event count. Interpret as @samp{.} for timing metrics.
|
|
|
|
@item !
|
|
Do not show any metric value. Cannot be used with other visibility characters.
|
|
This visibility is meant to be used in a @command{dmetrics} command to set
|
|
default metrics that override the built-in visibility defaults
|
|
for each type of metric.
|
|
|
|
@end table
|
|
|
|
Both the @var{<flavor>} and @var{<visibility>} strings may have more than one
|
|
character. If both strings have more than one character, the @var{<flavor>}
|
|
string is expanded first. For example, @code{ie.%user} is first expanded to
|
|
@code{i.%user:e.%user}, which is then expanded into
|
|
@code{i.user:i%user:e.user:e%user}.
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@node The Viewmode
|
|
@section The Viewmode
|
|
|
|
@cindex Viewmode
|
|
@IndexSubentry{Options, @code{-viewmode}}
|
|
@IndexSubentry{Commands, @code{viewmode}}
|
|
|
|
There are different ways to view a call stack in Java. In @ToolName{}, this
|
|
is called the @emph{viewmode} and the setting is controlled through a command
|
|
with the same name.
|
|
|
|
The @code{viewmode} command takes one of the following keywords:
|
|
|
|
@table @code
|
|
|
|
@item user
|
|
This is the default and shows the Java call stacks for Java threads.
|
|
No call stacks for any housekeeping threads are shown. The function
|
|
list contains a function
|
|
@IndexSubentry{Java profiling, @code{<JVM-System>}}
|
|
@code{<JVM-System>} that represents the aggregated time from non-Java
|
|
threads.
|
|
When the JVM software does not report a Java call stack, time is reported
|
|
against the function
|
|
@IndexSubentry{Java profiling, @code{<no Java callstack recorded>}}
|
|
@code{<no Java callstack recorded>}.
|
|
|
|
@item expert
|
|
Show the Java call stacks for Java threads when the Java code from the
|
|
user is executed and machine call stacks when JVM code is executed, or
|
|
when the JVM software does not report a Java call stack.
|
|
Show the machine call stacks for housekeeping threads.
|
|
|
|
@item machine
|
|
Show the actual native call stacks for all threads.
|
|
|
|
@end table
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@node The Selection List
|
|
@section The Selection List
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@cindex Selection list
|
|
@cindex List specification
|
|
Several commands allow the user to specify a sequence of numbers called the
|
|
@emph{selection list}. Such a list may for example be used to select specific
|
|
threads from all the threads that have been used when conducting the
|
|
experiment(s).
|
|
|
|
A selection list (or ``list'' in the remainder of this section) can be a
|
|
single number, a contiguous range of numbers with the start and end numbers
|
|
separated by a hyphen (@samp{-}), a comma-separated list of numbers and
|
|
ranges, or the @code{all} keyword that resets the filter.
|
|
@IndexSubentry{Filters, Reset to default}
|
|
Lists must not contain spaces.
|
|
|
|
Each list can optionally be preceded by an experiment list with a similar
|
|
format, separated from the list by a colon (:).
|
|
If no experiment list is included, the list applies to all experiments.
|
|
|
|
Multiple lists can be concatenated by separating the individual lists
|
|
by a plus sign.
|
|
|
|
These are some examples of various filters using a list:
|
|
|
|
@table @code
|
|
|
|
@item thread_select 1
|
|
Select thread 1 from all experiments.
|
|
|
|
@item thread_select all:1
|
|
Select thread 1 from all experiments.
|
|
|
|
@item thread_select 1:all
|
|
Select all the threads from the first experiment loaded.
|
|
|
|
@item thread_select 1:2+3:4
|
|
Select thread 2 from experiment 1 and thread 4 from experiment 3.
|
|
|
|
@item cpu_select all:1,3,5
|
|
Selects cores 1, 3, and 5 from all experiments.
|
|
|
|
@item cpu_select 1,2:all
|
|
Select all cores from experiments 1 and 2.
|
|
|
|
@end table
|
|
|
|
Recall that there are several list commands that show the mapping between the
|
|
numbers and the targets.
|
|
|
|
@IndexSubentry{Options, @code{-experiment_list}}
|
|
@IndexSubentry{Commands, @code{experiment_list}}
|
|
For example, the @command{experiment_list} command shows the name(s) of the
|
|
experiment(s) loaded and the associated number. In this example it is used
|
|
to get this information for a range of experiments:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -experiment_list mxv.?.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@noindent
|
|
This is the output, showing for each experiment the ID, the PID, and the name:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
ID Sel PID Experiment
|
|
== === ======= ============
|
|
1 yes 2750071 mxv.1.thr.er
|
|
2 yes 1339450 mxv.2.thr.er
|
|
3 yes 3579561 mxv.4.thr.er
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@node Load Objects and Functions
|
|
@section Load Objects and Functions
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
An application consists of various components. The source code files are
|
|
compiled into object files. These are then glued together at link time to form
|
|
the executable.
|
|
During execution, the program may also dynamically load objects.
|
|
|
|
@cindex Load object
|
|
A @emph{load object} is defined to be an executable, or shared object. A
|
|
shared library is an example of a load object in @ToolName{}.
|
|
|
|
Each load object, contains a text section with the instructions generated by
|
|
the compiler, a data section for data, and various symbol tables.
|
|
All load objects must contain an
|
|
@cindex ELF
|
|
ELF
|
|
symbol table, which gives the names and addresses of all the globally known
|
|
functions in that object.
|
|
|
|
Load objects compiled with the -g option contain additional symbolic
|
|
information that can augment the ELF symbol table and provide information
|
|
about functions that are not global, additional information about object
|
|
modules from which the functions came, and line number information relating
|
|
addresses to source lines.
|
|
|
|
The term
|
|
@cindex Function
|
|
@emph{function}
|
|
is used to describe a set of instructions that represent a high-level operation
|
|
described in the source code. The term also covers methods as used in C++
|
|
and in the Java programming language.
|
|
|
|
In the @ToolName{} context, functions are provided in source code format.
|
|
Normally their names appear in the symbol table representing a set of
|
|
addresses.
|
|
@cindex Program Counter
|
|
@cindex PC
|
|
If the Program Counter (PC) is within that set, the program is executing within
|
|
that function.
|
|
|
|
In principle, any address within the text segment of a load object can be
|
|
mapped to a function. Exactly the same mapping is used for the leaf PC and
|
|
all the other PCs on the call stack.
|
|
|
|
Most of the functions correspond directly to the source model of the program,
|
|
but there are exceptions. This topic is however outside of the scope of this
|
|
guide.
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@node The Concept of a CPU in gprofng
|
|
@section The Concept of a CPU in gprofng
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@cindex CPU
|
|
In @ProductName{}, there is the concept of a CPU. Admittedly, this is not the
|
|
best word to describe what is meant here and may be replaced in the future.
|
|
|
|
The word CPU is used in many of the displays.
|
|
In the context of @ProductName{}, it is meant to denote a part of the
|
|
processor that is capable of executing instructions and with its own state,
|
|
like the program counter.
|
|
|
|
For example, on a contemporary processor, a CPU could be a core. In case
|
|
hardware threads are supported within a core, a CPU is one of those
|
|
hardware threads.
|
|
|
|
To see which CPUs have been used in the experiment, use the @command{cpu}
|
|
command in @DisplayText{}.
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@node Hardware Event Counters Explained
|
|
@section Hardware Event Counters Explained
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@IndexSubentry{Hardware event counters, description}
|
|
For quite a number of years now, many microprocessors have supported hardware
|
|
event counters.
|
|
|
|
On the hardware side, this means that in the processor there are one or more
|
|
registers dedicated to count certain activities, or ``events''.
|
|
Examples of such events are the number of instructions executed, or the
|
|
number of cache misses at level 2 in the memory hierarchy.
|
|
|
|
While there is a limited set of such registers, the user can map events onto
|
|
them. In case more than one register is available, this allows for the
|
|
simultaenous measurement of various events.
|
|
|
|
A simple, yet powerful, example is to simultaneously count the number of CPU
|
|
cycles and the number of instructions excuted. These two numbers can then be
|
|
used to compute the
|
|
@cindex IPC
|
|
@emph{IPC} value. IPC stands for ``Instructions Per Clockcycle'' and each
|
|
processor has an architecturally defined maximum. For example, if this
|
|
maximum number is 2, it means the processor is capable of executing two
|
|
instructions every clock cycle.
|
|
|
|
Whether this is actually achieved, depends on several factors, including the
|
|
instruction characteristics.
|
|
However, in case the IPC value is well below this maximum in a time critical
|
|
part of the application and this cannot be easily explained, further
|
|
investigation is probably warranted.
|
|
|
|
@cindex CPI
|
|
A related metric is called @emph{CPI}, or ``Clockcycles Per Instruction''.
|
|
It is the inverse of the CPI and can be compared against the theoretical
|
|
value(s) of the target instruction(s). A significant difference may point
|
|
at a bottleneck.
|
|
|
|
One thing to keep in mind is that the value returned by a counter can either
|
|
be the number of times the event occured, or a CPU cycle count. In case of
|
|
the latter it is possible to convert this number to time.
|
|
|
|
@IndexSubentry{Hardware event counters, variable CPU frequency}
|
|
This is often easier to interpret than a simple count, but there is one
|
|
caveat to keep in mind. The CPU frequency may not have been constant while
|
|
the experimen was recorded and this impacts the time reported.
|
|
|
|
These event counters, or ``counters'' for short, provide great insight into
|
|
what happens deep inside the processor. In case higher level information does
|
|
not provide the insight needed, the counters provide the information to get
|
|
to the bottom of a performance problem.
|
|
|
|
There are some things to consider though.
|
|
|
|
@itemize @bullet
|
|
|
|
@item
|
|
The event definitions and names vary across processors and it may even happen
|
|
that some events change with an update.
|
|
Unfortunately and this is luckily rare, there are sometimes bugs causing the
|
|
wrong count to be returned.
|
|
|
|
@IndexSubentry{Hardware event counters, alias name}
|
|
In @ToolName{}, some of the processor specific event names have an alias
|
|
name. For example @code{insts} measures the instructions executed.
|
|
These aliases not only makes it easier to identify the functionality, but also
|
|
provide portability of certain events across processors.
|
|
|
|
@item
|
|
Another complexity is that there are typically many events one can monitor.
|
|
There may up to hundreds of events available and it could require several
|
|
experiments to zoom in on the root cause of a performance problem.
|
|
|
|
@item
|
|
There may be restrictions regarding the mapping of event(s) onto the
|
|
counters. For example, certain events may be restricted to specific
|
|
counters only. As a result, one may have to conduct additional experiments
|
|
to cover all the events of interest.
|
|
|
|
@item
|
|
The names of the events may also not be easy to interpret. In such cases,
|
|
the description can be found in the architecture manual for the processor.
|
|
|
|
@end itemize
|
|
|
|
Despite these drawbacks, hardware event counters are extremely useful and
|
|
may even turn out to be indispensable.
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@node apath
|
|
@section What is <apath>?
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
In most cases, @ToolName{} shows the absolute pathnames of directories. These
|
|
tend to be rather long, causing display issues in this document.
|
|
|
|
Instead of wrapping these long pathnames over multiple lines, we decided to
|
|
represent them by the @code{<apath>} symbol, which stands for ``an absolute
|
|
pathname''.
|
|
|
|
Note that different occurrences of @code{<apath>} may represent different
|
|
absolute pathnames.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Other Document Formats
|
|
@chapter Other Document Formats
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@emph{This chapter is applicable when building gprofng from the
|
|
binutils source.}
|
|
|
|
This document is written in Texinfo and the source text is made available as
|
|
part of the binutils distribution. The file name is @code{gprofng.texi} and
|
|
can be found in subdirectory @code{gprofng/doc} of the top level binutils
|
|
directory.
|
|
|
|
The default installation procedure creates a file in the @code{info} format and
|
|
stores it in the documentation section of binutils.
|
|
This source file can however also be used to generate the document in the
|
|
@code{html} and @code{pdf} formats. These may be easier to read and search.
|
|
|
|
To generate this documentation file in a different format, go to the directory
|
|
that was used to build the tools. The make file to build the other formats is
|
|
in the @code{gprofng/doc} subdirectory.
|
|
|
|
For example, if you have set the build directory to be @var{<my-build-dir>},
|
|
go to subdirectory @var{<my-build-dir>/gprofng/doc}.
|
|
|
|
This subdirectory has a single file called @file{Makefile} that can be used to
|
|
build the documentation in various formats. We recommend to use these
|
|
commands.
|
|
|
|
There are four commands to generate the documentation in the @code{html} or
|
|
@code{pdf} format. It is assumed that you are in directory @code{gprofng/doc}
|
|
under the main directory @var{<my-build-dir>}.
|
|
|
|
@table @code
|
|
|
|
@item make html
|
|
Create the html file in the current directory.
|
|
|
|
@item make pdf
|
|
Create the pdf file in the current directory.
|
|
|
|
@item make install-html
|
|
Create and install the html file in the binutils documentation directory.
|
|
|
|
@item make install-pdf
|
|
Create and install the pdf file in the binutils documentation directory.
|
|
|
|
@end table
|
|
|
|
For example, to install this document in the binutils documentation directory,
|
|
the commands below may be executed. In this notation, @var{<format>}
|
|
is one of @code{html}, or @code{pdf}:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
$ cd <my-build-dir>/gprofng/doc
|
|
$ make install-<format>
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The binutils installation directory is either the default @code{/usr/local} or
|
|
the one that has been set with the @code{--prefix} option as part of the
|
|
@code{configure} command. In this example we symbolize this location with
|
|
@code{<install>}.
|
|
|
|
The documentation directory is @code{<install>/share/doc/gprofng} in case
|
|
@code{html} or @code{pdf} is selected and @code{<install>/share/info} for the
|
|
file in the @code{info} format.
|
|
|
|
@noindent
|
|
Some things to note:
|
|
|
|
@itemize
|
|
|
|
@item
|
|
For the @code{pdf} file to be generated, the @code{texi2dvi} tool is required.
|
|
@cindex texi2dvi
|
|
It is for example available as part of the @code{texinfo-tex} package.
|
|
|
|
@item
|
|
Instead of generating a single file in the @code{html} format, it is also
|
|
possible to create a directory with individual files for the various chapters.
|
|
To do so, remove the use of @code{--no-split} in variable @code{MAKEINFOHTML}
|
|
in the make file in the @code{<my-build-dir/gprofng/doc} directory.
|
|
|
|
@end itemize
|
|
|
|
@c -- An appendix -------------------------------------------------------------
|
|
@node The gprofng Man Pages
|
|
@appendix The gprofng Man Pages
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
In this appendix the man pages for the various @ProductName{} tools are listed.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@c @node gprofng driver
|
|
@node Man page for @command{gprofng}
|
|
@section Man page for @command{gprofng}
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@include gprofng.texi
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@page
|
|
@node Man page for @command{gprofng collect app}
|
|
@section Man page for @command{gprofng collect app}
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@include gprofng-collect-app.texi
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@page
|
|
@node Man page for @command{gprofng display text}
|
|
@section Man page for @command{gprofng display text}
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@include gprofng-display-text.texi
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@page
|
|
@node Man page for @command{gprofng display html}
|
|
@section Man page for @command{gprofng display html}
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@include gprofng-display-html.texi
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@page
|
|
@node Man page for @command{gprofng display src}
|
|
@section Man page for @command{gprofng display src}
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@include gprofng-display-src.texi
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@page
|
|
@node Man page for @command{gprofng archive}
|
|
@section Man page for @command{gprofng archive}
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@include gprofng-archive.texi
|
|
|
|
@ifnothtml
|
|
@node Index
|
|
@unnumbered Index
|
|
@printindex cp
|
|
@end ifnothtml
|
|
|
|
@bye
|