The taskstats Data Plugin

Collect process accounting data.

The taskstats plugin collects process accounting data. The amount of data reported and the format in which it is written is determined by the value of arg set for the taskstats plugin within the cray_rur service settings.

If arg is not set or set to json-dict (default), the plugin reports the following basic process accounting data similar to that provided by UNIX process accounting or getrusage. This data is written in JSON dictionary format. If arg is set to json-list (deprecated), the data is written in JSON list format.

These values are sums across all nodes, except for the max_rss, which is the maximum value of any individual process across all nodes.

core: Set to 1 if core dump occurred
exitcode: Lists all unique exit codes
max_rss: Maximum value used by any individual process across all nodes
rchar: Characters read by process
stime: System time
utime: User time
wchar: Characters written by process

RUR `taskstats` output

This example shows taskstats output as written to /var/opt/cray/log/partition-current/messages-date on the SMW.

For a job that exits normally:

2017-02-02T11:09:49.457770-05:00 c0-0c1s1n2 RUR 2417 p0-20161101t153028 [RUR@34] uid: 12345, apid: 86989, jobid: 0, cmdname: /lus/tmp/rur01.2338/./CPU01-2338 plugin: taskstats {"utime": 10000000, "stime": 0, "max_rss": 940, "rchar": 107480, "wchar": 90, "exitcode:signal": ["0:0"], "core": 0}

For a job that core dumps:

2017-02-02T11:12:45.020716-05:00 c0-0c1s1n2 RUR 3731 p0-20131101t153028 [RUR@34] uid: 12345, apid: 86996, jobid: 0, cmdname: /lus/tmp/rur01.3657/./exit04-3657 plugin: taskstats {"utime": 4000, "stime": 144000, "max_rss": 7336, "rchar": 252289, "wchar": 741, "exitcode:signal": ["0:9", "139:0", "0:11", "0:0"], "core": 1}

If arg is set to xpacct, the plugin also provides the following extended process accounting data similar to that which was collected by the deprecated Cray System Accounting (CSA).

abortinfo: If abnormal termination occurs, a list of abort_info fields is reported
apid: Application ID as defined by application launcher
bkiowait: Total delay time (ns) waiting for synchronous block I/O to complete
btime: UNIX time when process started
comm: String containing process name. May be different than the header, which is the process run by the launcher.
coremem: Integral of RSS used by process in MB-usec
ecode: Process exit code
etime: Total elapsed time in microseconds
gid: Group ID
jid: Job ID - the PAGG job container used on the compute node
majfault: Number of major page faults
minfault: Number of minor page faults
nice: POSIX nice value of process
nid: String containing node ID
pgswapcnt: Number of pages swapped; should be 0 on Cray compute nodes
pid: Process ID
pjid: Parent job ID - the PAGG job container on the MOM node
ppid: Parent process ID
prid: Job project ID
rcalls: Number of read system calls
rchar: Characters read by process
rss: RSS highwater mark
sched: Scheduling discipline used on node
uid: User ID
vm: Integral of virtual memory used by process in MB-usecs²
wcalls: Number of write system calls
wchar: Characters written by process

RUR extended `taskstats` output

This example shows RUR extended taskstats output:

2017-02-03T10:29:38.285378-05:00 c0-0c0s1n1 RUR 24393 p1-20131018t081133 [RUR@34] uid: 12345, apid: 370583, jobid: 0, cmdname: /bin/cat, plugin: taskstats {"btime": 1386061749, "etime": 8000, "utime": 0, "stime": 4000, "coremem": 442, "max_rss": 564, "max_vm": 564, "pgswapcnt": 63, "minfault": 15, "majfault": 48, "rchar": 2608, "wchar": 686, "rcalls": 19, "wcalls": 7, "bkiowait": 1000, "exitcode:signal": [0], "core": 0]

If arg is set to xpacct, per-process, the plugin reports extended accounting data for every compute node process rather than a summary of all processes for an application. per-process must be set in combination with xpacct.

CAUTION: If per-process is set and many processes are run on each node, the volume of data generated and stored on disk can become an issue.

RUR per-process `taskstats` output

This example shows RUR per-process taskstats output.

2017-02-03T13:25:34.446167-06:00 c0-0c2s0n2 RUR 7623 p3-20131202t090205 [RUR@34] uid: 12345, apid: 1560, jobid: 0, cmdname: ./it.sh, plugin: taskstats {"uid": 12345, "wcalls": 37, "pid": 2997, "vm": 16348, "jid": 395136991233, "bkiowait": 1201616, "majfault": 1, "etime": 0, "btime": 1386098731, "gid": 0, "ppid": 2992, "utime": 0, "nice": 0, "sched": 0, "nid": "92", "prid": 0, "comm": "mount", "stime": 4000, "wchar": 3465, "rss": 1028, "minfault": 352, "coremem": 1109, "ecode": 0, "rcalls": 22, "pjid": 7045, "pgswapcnt": 0, "rchar": 12208} 

2017-02-03T13:25:34.949138-06:00 c0-0c2s0n2 RUR 7623 p3-20131202t090205 [RUR@34] uid: 12345, apid: 1560, jobid: 0, cmdname: ./it.sh, plugin: taskstats {"uid": 12345, "wcalls": 0, "pid": 2998, "vm": 20268, "jid": 395136991233, "bkiowait": 0, "majfault": 0, "etime": 0, "btime": 1386098731, "gid": 0, "ppid": 2992, "utime": 0, "nice": 0, "sched": 0, "nid": "92", "prid": 0, "apid": 1560, "comm": "ls", "stime": 4000, "wchar": 0, "rss": 1040, "minfault": 360, "coremem": 3140, "ecode": 0, "rcalls": 19, "pjid": 7045, "pgswapcnt": 0, "rchar": 10629}

¹ The current memory usage is added to these counters (i.e., coremem, vm) every time. A tick is charged to a task's system time. Therefore, at the end we will have memory usage multiplied by system time and an average usage per system time unit can be calculated.

² The current memory usage is added to these counters (i.e., coremem, vm) every time. A tick is charged to a task's system time. Therefore, at the end we will have memory usage multiplied by system time and an average usage per system time unit can be calculated.

The taskstats Data Plugin

RUR taskstats output

RUR extended taskstats output

RUR per-process taskstats output

RUR `taskstats` output

RUR extended `taskstats` output

RUR per-process `taskstats` output