Cray Performance Measurement and Analysis Tools (CPMAT)

An overview of how Cray Performance Measurement and Analysis Tools help users optimize their code for high performance computing.

The Cray Performance Measurement and Analysis Tools (CPMAT) suite reduces the time needed to port and tune applications. It provides an integrated infrastructure for measurement, analysis, and visualization of computation, communication, I/O, and memory utilization to help users optimize programs for faster execution and more efficient computing resource usage.

The toolset allows developers to perform sampling, profile, and trace experiments on executables, extracting information at the program, function, loop, and line level. It supports programs written in Fortran and C/C++ (including UPC) with MPI, OpenMP, CUDA, or a combination of these programming languages and models. Profiling applications built with Cray and GNU compilers are supported.

Performance analysis consists of three basic steps:

Instrument the program, to specify what kind of data to collect under what conditions.
Execute the instrumented executable to generate and capture data.
Analyze the resulting data.

There are three programming interfaces available:

perftools-lite-* - Simple interface that produces reports to stdout. There are four perftools-lite submodules:
- perftools-lite - Lowest overhead sampling experiment identifies key program bottlenecks.
- perftools-lite-events - Produces a summarized trace; a good tool for detailed MPI statistics, including synchronization overhead.
- perftools-lite-loops - Provides loop work estimates (must be used with CCE).
- perftools-lite-hbm - Reports memory traffic information (CCE, x86-64 systems only).
See the perftools-lite(4) manpage for details.
perftools - Advanced interface provides full-featured data collection and analysis capability, including full traces with timeline displays. It includes the following components:
- pat_build - Utility instruments programs for performance data collection.
- pat_report - After using pat_build to instrument the program, set run time environment variables, and executing the program, use pat_report to generate text reports from the resulting data and export the data for use in other applications. See the pat_report(1) manpage for details.
- CrayPat runtime library - Collects specified performance data during program execution. See the intro_craypat(1) manpage for details.
pat_run - Launches a dynamically linked program instrumented for performance analysis. Once successfully run, collected data may be explored further with the pat_report and Cray Apprentice2 tools. See the pat_run(1) manpage for details.

Also included:

PAPI - The PAPI library, from the Innovative Computing Laboratory at the University of Tennessee in Knoxville, is distributed with performance tools. PAPI allows applications or custom tools to interface with hardware performance counters made available by the processor, network, or accelerator vendor. Performance tools components use PAPI internally for CPU, GPU and network performance counter collection for derived metrics, observations, and performance reporting. A simplified user interface is provided for accessing counters, which does not require the source code modification of using PAPI directly.
Cray Apprentice2 - An interactive X Window System tool for visualizing and manipulating performance analysis data captured during program execution.
pat_view – Aggregates and presents multiple sampling experiments for program scaling analysis. See the pat_view(1) manpage for more information.
Reveal - Extends performance tools technology by combining performance statistics and program source code visualization with compiler optimization feedback to better identify and exploit parallelism, and to pinpoint memory bandwidth sensitivities in an application. Reveal lets users navigate source code to highlighted dependencies or bottlenecks during optimization. Using the program library provided by CCE and the performance data collected, the user can navigate source code to understand which high-level loops could benefit from OpenMP parallelism from loop-level optimizations such as exposing vector parallelism. Reveal provides dependency and variable scoping information for those loops and assists the user with creating parallel directives.

Use performance tools to:

Identify bottlenecks.
Find load-balance and synchronization issues.
Find communication overhead issues.
Identify loops for parallelization.
Map memory bandwidth utilization.
Optimize vectorization within application code.
Collect application energy consumption information.
Collect scaling information for application code.
Interpret performance data.