DVS Troubleshooting

Troubleshooting procedures for common Cray Data Virtualization Service (DVS) issues.

Here are some issues that could arise when using DVS.

DVS Does Not Start after Data Store Moved to External Lustre File System

If DVS fails after the Cray system's data store is moved to a shared external Lustre file system, verify that DVS has the correct lnd_name.

lnd_name uniquely identifies the LNet network that DVS will use. DVS communicates it to the LNet service when DVS is being initialized. It must match the cray_lnet.settings.local_lnet.data.lnet_name value set in the cray_lnet service for DVS to boot properly. To find that value, search the CLE config set (this example searches in config set p0 and finds lnet_name = gni4):
smw# cfgset search --term lnet_name \
--state all --service cray_lnet p0
# 1 match for 'lnet_name' from cray_lnet_config.yaml
#----------------------------------------------------
cray_lnet.settings.local_lnet.data.lnet_name: gni4
If lnd_name does not match lnet_name from the cray_lnet service, change it. Because lnd_name is a kernel module parameter that cannot be set using the configurator, add these lines to <simple sync path>/etc/modprobe.d/dvs-local.conf, substituting for gnix the value found from the config set search:
# Set identifier of LNet network DVS will use
options dvsipc_lnet lnd_name=gnix
For information about what <simple sync path> should be, see the procedure Change Kernel Module Parameters Prior to Boot using Modprobe.d Files and Simple Sync, which is found in Configure DVS using Modprobe or Proc Files.

ALPS Kills a Process to Avoid Potential Data Loss

DVS forwards file system writes from clients to servers. The data written on the DVS server may reside in the server's page cache for an indeterminate time before the Linux kernel writes the data to backing store. If the server crashes before the data is written to backing store, this data is lost. To prevent silent data loss, DVS kills the processes on the clients that wrote the data. If the Application Level Placement Scheduler (ALPS) was used to launch the application, the system displays the following message to the terminal before aprun exits: "DVS server failure detected: killing process to avoid potential data loss."
To avoid this error message, do one of the following:
  • Add the datasync option to the options setting of the client_mount setting for that client mount in the cray_dvs service, or use the DVS_DATASYNC user environment variable. This avoids the error message because each write operation is followed by fsync before it is considered complete. However, be aware that this also exacts a substantial performance penalty.
  • Add the nokillprocess option to the options field of the client_mount setting for that client mount in the cray_dvs service or set the DVS_KILLPROCESS user environment variable to off. When a server fails, processes that have written data to the server are not killed. If a process continues to perform operations with an open file descriptor that had been used to write data to the server, the operations fail (with errno set to EHOSTDOWN). A new open of the file is allowed, and subsequent operations with the corresponding file descriptor function normally.

Application Hangs as a Result of NFS File Locking

Applications may hang when NFS file systems are projected through DVS and file locking is used. To avoid this issue, add the nolock option to the options field of the client_mount setting for the NFS client mount in the cray_dvs service. See the nfs(5) man page for more information on the nolock option.

DVS Ignores User Environment Variables

If the nouserenv option has not been specified when configuring a DVS client mount, and a DVS user environment variable that was set does not override the associated DVS mount option, it appears as if DVS is ignoring user environment variables. This can be caused by the addition of a large number of user environment variables. Due to the nature of Linux, if a user adds a large number of user environment variables (large enough that the kernel needs to store that information somewhere other than the usual location), DVS may not be able to find and apply those user environment variables, producing unexpected results.

To define a large number of user environment variables, Cray recommends that users include those definitions in the user's shell so that they are available at startup and stored where DVS can always locate them.