Cray Performance Measurement and Analysis Tools (CPMAT)

An overview of how Performance Measurement and Analysis Tools help users optimize their code for high performance computing.

The Cray Performance Measurement and Analysis Tools (CPMAT) suite reduces the time needed to port and tune applications. It provides an integrated infrastructure for measurement, analysis, and visualization of computation, communication, I/O, and memory utilization to help users optimize programs for faster execution and more efficient computing resource usage.

The toolset allows developers to perform sampling, profile, and trace experiments on executables, extracting information at the program, function, loop, and line level. Programs written in Fortran and C/C++ (including UPC) with MPI, OpenMP, or a combination of these programming languages and models are supported.

There are three programming interfaces available:

perftools-lite-* - Simple interface that produces reports to stdout. There are three perftools-lite sub-modules:
- perftools-lite - Lowest overhead sampling experiment identifies key program bottlenecks.
- perftools-lite-events - Produces a summarized trace; a good tool for detailed MPI statistics, including synchronization overhead.
- perftools-lite-loops - Provides loop work estimates (must be used with CCE).
See the perftools-lite(4) man page for details.
perftools - Advanced interface provides full-featured data collection and analysis capability, including full traces with timeline displays. It includes the following components:
- pat_build - Utility instruments programs for performance data collection.
- pat_report - After using pat_build to instrument the program, setting the run time environment variables as desired and then executing the program, use the pat_report command to generate text reports from the resulting data and export the data for use in other applications. See the pat_report(1) man page for details.
- CrayPat runtime library - Collects specified performance data during program execution. See the intro_craypat(1) man page for details.
pat_run - Launches a dynamically-linked program instrumented for performance analysis. Once successfully run, collected data may be explored further with the pat_report and Cray Apprentice2 tools. See the pat_run(1) man page for details.

Also included in CPMAT:

PAPI - The PAPI library, from the Innovative Computing Laboratory at the University of Tennessee in Knoxville, is distributed with CPMAT. PAPI allows applications or custom tools to interface with hardware performance counters made available by the processor, network or accelerator vendor. CPMAT components use PAPI internally for CPU, GPU and network performance counter collection for derived metrics, observations and performance reporting. CPMAT also provides a simplified user interface for accessing counters, which does not require source code modification as required when using of using PAPI directly.
Cray Apprentice2 - An interactive X Window System tool for visualizing and manipulating performance analysis data captured during program execution.
pat_view – Aggregates and presents multiple sampling experiments for program scaling analysis. See the pat_view(1) man page for more information
Cray Reveal - Cray Reveal extends Cray's existing performance measurement, analysis, and visualization technology by combining performance statistics and program source code visualization with compiler optimization feedback. Reveal can easily navigate through source code to highlighted dependencies or bottlenecks during the optimization phase of program development or porting. Using the program library provided by CCE and the performance data collected by the Cray performance, measurement, and analysis tools, the user can navigate through their source code to understand which high-level loops could benefit from OpenMP parallelism. Reveal provides dependency and variable scoping information for those loops and assists the user with creating parallel directives.

Use CPMAT utilities to:

Identify bottlenecks
Find load-balance and synchronization issues
Find communication overhead issues
Identify loops for parallelization
Map memory bandwidth utilization
Optimize vectorization within application code
Collect scaling information for application code
Interpret performance data