The energy Data Plugin
Collect compute node energy usage.
The energy plugin collects compute node energy usage data. The amount of data reported and the format in which it is written is determined by the value of arg set for the energy plugin within the cray_rur service settings.
arg is not set or set to json-dict (default), the plugin reports the following extended energy data, written in JSON dictionary format:cpu_energy_used- The total energy (joules) used by each node's CPU energy domain.
This statistic is nonzero only for nodes with Intel® Xeon Phi™ ("KNL"), Xeon® Scalable ("Skylake"), or later generation processors.
error- If a Python exception occurs during the post or staging scripts, the following data is reported:
traceback- Stack frame list
type- Python exception type
value- Python exception parameter
nid- NID on which exception occurred
cname- cname on which exception occurred
memory_energy_used- The total energy (joules) used by each node's memory energy domain.
This statistic is nonzero only for nodes with KNL, Skylake, or later generation processors.
nodes- Number of nodes in job.
nodes_cpu_throttled- Number of nodes experiencing CPU power/thermal throttling.
nodes_memory_throttled- Number of nodes experiencing memory power/thermal throttling.
nodes_power_capped- Number of nodes with nonzero power cap.
nodes_throttled- Number of nodes experiencing any of the following types of throttling:
- CPU power/thermal throttling
- Memory power/thermal throttling
nodes_with_changed_power_cap- Number of nodes with power caps that changed during execution.
On nodes with accelerators, this value includes the number of accelerators with power caps that changed.
max_power_cap- Maximum nonzero power cap.
energy_used- The total energy (joules) used across all nodes.
On nodes with accelerators, this value includes
accel_energy_used, the total energy used by the accelerators. max_power_cap_count- Number of nodes with the maximum nonzero power cap.
min_power_cap- Minimum nonzero power cap.
min_power_cap_count- Number of nodes with the minimum nonzero power cap.
accel_energy_used- Total accelerator energy (joules) used.
nodes_accel_power_capped- Number of accelerators with nonzero power cap.
max_accel_power_cap- Maximum nonzero accelerator power cap.
max_accel_power_cap_count- Number of accelerators with the maximum nonzero power cap.
min_accel_power_cap- Minimum nonzero accelerator power cap.
min_accel_power_cap_count- Number of accelerators with the minimum nonzero power cap.
arg contains the verbose option, a log per node is generated in addition to the standard summary log. The verbose logs include the following data:cname- The cname of the node.
nid- The NID of the node.
energy_used- The total energy (joules) on the node.
On nodes with an accelerator, this value includes
accel_energy_used.On nodes with KNL, Skylake, or later generation processors, this value includes
cpu_energy_usedandcpu_memory_used. cpu_energy_used- The total energy (joules) used in the node's CPU energy domain.
This statistic is nonzero only for nodes with KNL, Skylake, or later generation processors.
memory_energy_used- The total energy (joules) used in the node's memory energy domain.
This statistic is nonzero only for nodes with KNL, Skylake, or later generation processors.
cpu_throttled- Nonzero if the node experienced CPU power/thermal throttling.
memory_throttled- Nonzero if the node experienced memory power/thermal throttling.
start_power_cap- Power cap at start of execution, if set.
stop_power_cap- Power cap at end of execution, if set.
accel_energy_used- Total accelerator energy (joules) used.
start_accel_power_cap- Accelerator power cap at start of execution, if set.
stop_accel_power_cap- Accelerator power cap at end of execution, if set.
changed_power_cap- A power cap changed (includes changed accelerator power cap).
RUR extended energy output
This example shows extended energy data as written to /var/opt/cray/log/partition-current/messages-date on the SMW:2017-02-03T15:44:23.583598-05:00 c0-0c0s7n1 RUR 6048 p1-20160906t093257 [RUR@34] uid: 12345, apid: 18554, jobid: 0, cmdname: /bin/cat, plugin: energy {"nodes_throttled": 0, "memory_energy_used": 120,"min_accel_power_cap_count": 0, "nodes_with_changed_power_cap": 0,"max_power_cap_count": 0, "energy_used": 1214, "max_power_cap": 0,"nodes_memory_throttled": 0, "accel_energy_used": 0,"max_accel_power_cap_count": 0, "nodes_accel_power_capped": 0,"min_power_cap": 0, "max_accel_power_cap": 0, "min_power_cap_count": 0,"min_accel_power_cap": 0, "nodes_power_capped": 0, "nodes": 4, "cpu_energy_used": 752, "nodes_cpu_throttled": 0}
If arg is set to json-list (deprecated), the plugin reports the following, written in JavaScript Object Notation (JSON) list format:
energy_used- The total energy (joules) used across all nodes.
On nodes with accelerators, this value includes
accel_energy_used, the total energy used by the accelerators.On nodes with KNL, Skylake, or later generation processors, this value includes
cpu_energy_usedandcpu_memory_used, the total energy used by the CPU and memory energy domains.
RUR energy output using json-list (deprecated)
This example shows default energy data as written to /var/opt/cray/log/partition-current/messages-date on the SMW:2017-01-30T11:19:06.545114-05:00 c0-0c0s2n2 RUR 18657 p2-20130829t090349 [RUR@34] uid: 12345, apid: 10963, jobid: 0, cmdname: /opt/intel/vtune_xe_2013/bin64/amplxe-cl plugin: energy ['energy_used', 318]