The energy Data Plugin

Collect compute node energy usage.

The energy plugin collects compute node energy usage data. The amount of data reported and the format in which it is written is determined by the value of arg set for the energy plugin within the cray_rur service settings.

If arg is not set or set to json-dict (default), the plugin reports the following extended energy data, written in JSON dictionary format:
cpu_energy_used
The total energy (joules) used by each node's CPU energy domain.

This statistic is nonzero only for nodes with Intel® Xeon Phi™ ("KNL"), Xeon® Scalable ("Skylake"), or later generation processors.

error
If a Python exception occurs during the post or staging scripts, the following data is reported:
traceback
Stack frame list
type
Python exception type
value
Python exception parameter
nid
NID on which exception occurred
cname
cname on which exception occurred
memory_energy_used
The total energy (joules) used by each node's memory energy domain.

This statistic is nonzero only for nodes with KNL, Skylake, or later generation processors.

nodes
Number of nodes in job.
nodes_cpu_throttled
Number of nodes experiencing CPU power/thermal throttling.
nodes_memory_throttled
Number of nodes experiencing memory power/thermal throttling.
nodes_power_capped
Number of nodes with nonzero power cap.
nodes_throttled
Number of nodes experiencing any of the following types of throttling:
  • CPU power/thermal throttling
  • Memory power/thermal throttling
nodes_with_changed_power_cap
Number of nodes with power caps that changed during execution.

On nodes with accelerators, this value includes the number of accelerators with power caps that changed.

max_power_cap
Maximum nonzero power cap.
energy_used
The total energy (joules) used across all nodes.

On nodes with accelerators, this value includes accel_energy_used, the total energy used by the accelerators.

max_power_cap_count
Number of nodes with the maximum nonzero power cap.
min_power_cap
Minimum nonzero power cap.
min_power_cap_count
Number of nodes with the minimum nonzero power cap.
On nodes with accelerators, the extended data also include the following data:
accel_energy_used
Total accelerator energy (joules) used.
nodes_accel_power_capped
Number of accelerators with nonzero power cap.
max_accel_power_cap
Maximum nonzero accelerator power cap.
max_accel_power_cap_count
Number of accelerators with the maximum nonzero power cap.
min_accel_power_cap
Minimum nonzero accelerator power cap.
min_accel_power_cap_count
Number of accelerators with the minimum nonzero power cap.
If arg contains the verbose option, a log per node is generated in addition to the standard summary log. The verbose logs include the following data:
cname
The cname of the node.
nid
The NID of the node.
energy_used
The total energy (joules) on the node.

On nodes with an accelerator, this value includes accel_energy_used.

On nodes with KNL, Skylake, or later generation processors, this value includes cpu_energy_used and cpu_memory_used.

cpu_energy_used
The total energy (joules) used in the node's CPU energy domain.

This statistic is nonzero only for nodes with KNL, Skylake, or later generation processors.

memory_energy_used
The total energy (joules) used in the node's memory energy domain.

This statistic is nonzero only for nodes with KNL, Skylake, or later generation processors.

cpu_throttled
Nonzero if the node experienced CPU power/thermal throttling.
memory_throttled
Nonzero if the node experienced memory power/thermal throttling.
start_power_cap
Power cap at start of execution, if set.
stop_power_cap
Power cap at end of execution, if set.
accel_energy_used
Total accelerator energy (joules) used.
start_accel_power_cap
Accelerator power cap at start of execution, if set.
stop_accel_power_cap
Accelerator power cap at end of execution, if set.
changed_power_cap
A power cap changed (includes changed accelerator power cap).

RUR extended energy output

This example shows extended energy data as written to /var/opt/cray/log/partition-current/messages-date on the SMW:
2017-02-03T15:44:23.583598-05:00 c0-0c0s7n1 RUR 6048 p1-20160906t093257 [RUR@34] uid: 12345, apid: 18554, jobid: 0, cmdname: /bin/cat, plugin: energy {"nodes_throttled": 0, "memory_energy_used": 120,"min_accel_power_cap_count": 0, "nodes_with_changed_power_cap": 0,"max_power_cap_count": 0, "energy_used": 1214, "max_power_cap": 0,"nodes_memory_throttled": 0, "accel_energy_used": 0,"max_accel_power_cap_count": 0, "nodes_accel_power_capped": 0,"min_power_cap": 0, "max_accel_power_cap": 0, "min_power_cap_count": 0,"min_accel_power_cap": 0, "nodes_power_capped": 0, "nodes": 4, "cpu_energy_used": 752, "nodes_cpu_throttled": 0}

If arg is set to json-list (deprecated), the plugin reports the following, written in JavaScript Object Notation (JSON) list format:

energy_used
The total energy (joules) used across all nodes.

On nodes with accelerators, this value includes accel_energy_used, the total energy used by the accelerators.

On nodes with KNL, Skylake, or later generation processors, this value includes cpu_energy_used and cpu_memory_used, the total energy used by the CPU and memory energy domains.

RUR energy output using json-list (deprecated)

This example shows default energy data as written to /var/opt/cray/log/partition-current/messages-date on the SMW:
2017-01-30T11:19:06.545114-05:00 c0-0c0s2n2 RUR 18657 p2-20130829t090349 [RUR@34] uid: 12345, apid: 10963, jobid: 0, cmdname: /opt/intel/vtune_xe_2013/bin64/amplxe-cl plugin: energy ['energy_used', 318]