The taskstats Data Plugin

Collect process accounting data.

The taskstats plugin collects process accounting data. The amount of data reported and the format in which it is written is determined by the value of arg set for the taskstats plugin within the cray_rur service settings.

If arg is not set or set to json-dict (default), the plugin reports the following basic process accounting data similar to that provided by UNIX process accounting or getrusage. This data is written in JSON dictionary format. If arg is set to json-list (deprecated), the data is written in JSON list format.

These values are sums across all nodes, except for the max_rss, which is the maximum value of any individual process across all nodes.

core
Set to 1 if core dump occurred
exitcode
Lists all unique exit codes
max_rss
Maximum value used by any individual process across all nodes
rchar
Characters read by process
stime
System time
utime
User time
wchar
Characters written by process

RUR taskstats output

This example shows taskstats output as written to /var/opt/cray/log/partition-current/messages-date on the SMW.

For a job that exits normally:

2017-02-02T11:09:49.457770-05:00 c0-0c1s1n2 RUR 2417 p0-20161101t153028 [RUR@34] uid: 12345, apid: 86989, jobid: 0, cmdname: /lus/tmp/rur01.2338/./CPU01-2338 plugin: taskstats {"utime": 10000000, "stime": 0, "max_rss": 940, "rchar": 107480, "wchar": 90, "exitcode:signal": ["0:0"], "core": 0}

For a job that core dumps:

2017-02-02T11:12:45.020716-05:00 c0-0c1s1n2 RUR 3731 p0-20131101t153028 [RUR@34] uid: 12345, apid: 86996, jobid: 0, cmdname: /lus/tmp/rur01.3657/./exit04-3657 plugin: taskstats {"utime": 4000, "stime": 144000, "max_rss": 7336, "rchar": 252289, "wchar": 741, "exitcode:signal": ["0:9", "139:0", "0:11", "0:0"], "core": 1}

If arg is set to xpacct, the plugin also provides the following extended process accounting data similar to that which was collected by the deprecated Cray System Accounting (CSA).

abortinfo
If abnormal termination occurs, a list of abort_info fields is reported
apid
Application ID as defined by application launcher
bkiowait
Total delay time (ns) waiting for synchronous block I/O to complete
btime
UNIX time when process started
comm
String containing process name. May be different than the header, which is the process run by the launcher.
coremem
Integral of RSS used by process in MB-usec
ecode
Process exit code
etime
Total elapsed time in microseconds
gid
Group ID
jid
Job ID - the PAGG job container used on the compute node
majfault
Number of major page faults
minfault
Number of minor page faults
nice
POSIX nice value of process
nid
String containing node ID
pgswapcnt
Number of pages swapped; should be 0 on Cray compute nodes
pid
Process ID
pjid
Parent job ID - the PAGG job container on the MOM node
ppid
Parent process ID
prid
Job project ID
rcalls
Number of read system calls
rchar
Characters read by process
rss
RSS highwater mark
sched
Scheduling discipline used on node
uid
User ID
vm
Integral of virtual memory used by process in MB-usecs2
wcalls
Number of write system calls
wchar
Characters written by process

RUR extended taskstats output

This example shows RUR extended taskstats output:
2017-02-03T10:29:38.285378-05:00 c0-0c0s1n1 RUR 24393 p1-20131018t081133 [RUR@34] uid: 12345, apid: 370583, jobid: 0, cmdname: /bin/cat, plugin: taskstats {"btime": 1386061749, "etime": 8000, "utime": 0, "stime": 4000, "coremem": 442, "max_rss": 564, "max_vm": 564, "pgswapcnt": 63, "minfault": 15, "majfault": 48, "rchar": 2608, "wchar": 686, "rcalls": 19, "wcalls": 7, "bkiowait": 1000, "exitcode:signal": [0], "core": 0]

If arg is set to xpacct, per-process, the plugin reports extended accounting data for every compute node process rather than a summary of all processes for an application. per-process must be set in combination with xpacct.

CAUTION: If per-process is set and many processes are run on each node, the volume of data generated and stored on disk can become an issue.

RUR per-process taskstats output

This example shows RUR per-process taskstats output.
2017-02-03T13:25:34.446167-06:00 c0-0c2s0n2 RUR 7623 p3-20131202t090205 [RUR@34] uid: 12345, apid: 1560, jobid: 0, cmdname: ./it.sh, plugin: taskstats {"uid": 12345, "wcalls": 37, "pid": 2997, "vm": 16348, "jid": 395136991233, "bkiowait": 1201616, "majfault": 1, "etime": 0, "btime": 1386098731, "gid": 0, "ppid": 2992, "utime": 0, "nice": 0, "sched": 0, "nid": "92", "prid": 0, "comm": "mount", "stime": 4000, "wchar": 3465, "rss": 1028, "minfault": 352, "coremem": 1109, "ecode": 0, "rcalls": 22, "pjid": 7045, "pgswapcnt": 0, "rchar": 12208} 

2017-02-03T13:25:34.949138-06:00 c0-0c2s0n2 RUR 7623 p3-20131202t090205 [RUR@34] uid: 12345, apid: 1560, jobid: 0, cmdname: ./it.sh, plugin: taskstats {"uid": 12345, "wcalls": 0, "pid": 2998, "vm": 20268, "jid": 395136991233, "bkiowait": 0, "majfault": 0, "etime": 0, "btime": 1386098731, "gid": 0, "ppid": 2992, "utime": 0, "nice": 0, "sched": 0, "nid": "92", "prid": 0, "apid": 1560, "comm": "ls", "stime": 4000, "wchar": 0, "rss": 1040, "minfault": 360, "coremem": 3140, "ecode": 0, "rcalls": 19, "pjid": 7045, "pgswapcnt": 0, "rchar": 10629}
1 The current memory usage is added to these counters (i.e., coremem, vm) every time. A tick is charged to a task's system time. Therefore, at the end we will have memory usage multiplied by system time and an average usage per system time unit can be calculated.
2 The current memory usage is added to these counters (i.e., coremem, vm) every time. A tick is charged to a task's system time. Therefore, at the end we will have memory usage multiplied by system time and an average usage per system time unit can be calculated.