Start an Analytics Cluster and Run Open Source Analytics (OSA) Jobs Using the start_analytics Command

Purpose and uses of the start_analytics script.

The start_analytics command starts an analytics cluster, which can be used to run OSA components. It can be considered as an entry point to the OSA components. The start_analytics command normally starts an analytics cluster within the nodes of a user's job allocation.

Executing the start_analytics command presents a Bash shell on one of the cluster nodes where Spark and/or the analytics programming environment commands can be executed.

Certain environment variables may be set before running the start_analytics command to modify the behavior of the analytics cluster. Setting values for these variables is optional. Furthermore, these variables have reasonable default values.

CAUTION: If is it required to set these environment variables, they must be set prior to running start_analytics. Setting them at a later point will have no effect.

For more information, see the start_analytics man page.

Useful Environment Variables

MINERVA_USE_LOGIN - If this environment variable is set, the interactive shell will run on the login rather than a compute node. In some environments, this may allow better external connectivity for build and environment tools that need to download new packages.
SPARK_LOOPBACK_SIZE - Sets the size of the per-node loopback mounted local file system used by Spark for local storage. The default value of this variable is 256 GB.
SPARK_EVENT_DIR - Sets the location for Spark event logs.

Dependencies

Necessary libraries and drivers should have already been installed in a default location, such as /opt. Certain OSA components may depend on this location. Some of the libraries and/or drivers may be installed at alternate locations and their path specified at run time via command line flags given to start_analytics. For example, -nccl and -cudnn-libs. See start_analytics --help or the start_analytics man page for more information.

$ start_analytics --nccl /alternate/path/to/nccl/lib

Development Mode

The start_analytics script also features the -d option that starts a single analytics container on the current login node. No job allocation is required in this case and Spark can still be used in local mode. This is useful for performing development work, such as creating Conda environments, building applications, running single node tests etc. In addition, the -d option enables performing development tasks with full access to the analytics environment, without having to wait for a job allocation. Since this option may provide better access to the external network in some environments, it can be useful for downloading new packages for builds.

Resource Allocation

Before an analytics cluster can be started, the desired number of nodes needs to be allocated using the system's workload manager. If N number of nodes are allocated, one node will be allocated as a master and one node will be allocated as the interactive node. In addition, N-2 worker containers will be launched when the start_analytics command is run.