Guidance for the Filesystem Test
Quick guide for test.
The NHC Filesystem test can take an explicit argument (the mount point of the file system) or no argument. If an argument is provided, the Filesystem test is referred to as an explicit Filesystem test. If no argument is given, the Filesystem test is referred to as an implicit Filesystem test.
The explicit Filesystem test will test the file system located at the specified mount point.
The implicit Filesystem test will test each file system listed in the /etc/fstab file on each compute node. The implicit Filesystem test is enabled by default in the NHC configuration file.
The Filesystem test will determine whether a file system is mounted read-only or read-write. If the file system is mounted read-write, then NHC will attempt to write to it. If it is mounted read-only, then NHC will attempt to read the directory entities "." and ".." in the file system to guarantee, at a minimum, that the file system is readable.
Some file systems are mounted on the compute nodes as read-write file systems, while their underlying permissions are read-only. As an example, for an auto-mounted file system, the base mount-point may have read-only permissions; however, it could be mounted as read-write. It would be mounted as read-write, so that the auto-mounted sub-mount-points could be mounted as read-write. The read-only permissions prevent tampering with the base mount-point. In a case such as this, the Filesystem test would see that the base mount-point had been mounted as a read-write file system. The Filesystem test would try to write to this file system, but the write would fail due to the read-only permissions. Because the write fails, the Filesystem test would fail, and NHC would incorrectly decide that the compute node is unhealthy because it could not write to this file system. For this reason, file systems that are mounted on compute nodes as read-write file systems, but are in reality read-only file systems, should be excluded from the implicit Filesystem test.
The administrator can exclude tests by adding an "Excluding: file system mount point" entry in the NHC configuration file. See the NHC configuration file for further details and an example.
A file system is deemed a critical file system if it is needed to run applications. All systems will likely need at least one shared file system for reading and writing input and output data. Such a file system would be a critical file system. File systems that are not needed to run applications or read and write data would be deemed as noncritical file systems. The administrator must determine the criticality of each file system.
Cray recommends the following:
- Exclude noncritical file systems from the implicit
Filesystemtest. See the NHC configuration file for further details and an example. - If there are critical file systems that do not appear in the /etc/fstab file on the compute nodes (such file systems would not be tested by the implicit
Filesystemtest), these critical file systems should be checked through explicitFilesystemtests. Add explicitFilesystemtests to the NHC configuration file by providing the mount point of the file system as the final argument to theFilesystemtest. See the NHC configuration file for further details and an example. - If a file system that is mounted as read-write but it has read-only permissions, exclude it from the implicit
Filesystemtest. NHC does not support such file systems. - Client mounts may fail as a system is booting because not all routes have had sufficient time to be established. The retry ensures a mount attempt will be made after all routes are up.