Apache Spark Support

Overview of using Spark

Apache™ Spark™ is a fast and general engine for data processing. It provides high-level APIs in Java, R, Scala and Python, and an optimized engine.

Spark core and ecosystem components currently supported on the Urika-GXUrika-XC system include:

Spark Core, DataFrames, and Resilient Distributed Datasets (RDDs) - Spark Core provides distributed task dispatching, scheduling, and basic I/O functionalities.
Spark SQL, DataSets, and DataFrames - The Spark SQL component is a layer on top of Spark Core for processing structured data.
Spark Streaming - The Spark Streaming component leverages Spark Core's fast scheduling capabilities to perform streaming analytics.
MLlib Machine Learning Library - MLlib is a distributed machine learning framework on top of Spark.
GraphX - GraphX is a distributed graph processing framework on top of Spark. It provides an API for expressing graph computations.

This section provides a quick guide to using Apache Spark on Urika-GX. Please refer to the official Apache Spark documentation for detailed information about Spark, as well as documentation of the Spark APIs, programming model, and configuration parameters.

When executing start_analytics, it is necessary to create this directory: ~/.urikacs/sparkHistory for each user. An example directory: /home/users/username/.urikacs/sparkHistory

Urika-XCUrika-CS ships with Spark 2.4.3.

Run Spark Applications

The Urika-GXUrika-XCUrika-CS software stack includes Spark configured and deployed to run under Mesosin a Shifter container, with a per-node cache for local temporary storagein a Singularity container, with a per-node cache for local temporary storage.

Mesos on Urika-GX is configured with three Mesos masters using ZooKeeper. To connect to Mesos, Spark’s Master is set to:

mesos://zk://zoo1:2181,zoo2:2181,zoo3:2181/mesos

This is the default setting on Urika-GX and is configured via the Spark start up scripts installed on the system.

Spark on Urika-GX uses coarse-grained mode by default, but fine-grained can be enabled by setting spark.mesos.coarse to false in SparkConf.

To launch Spark applications or interactive shells, use the Spark launch wrapper scripts on /opt/cray/spark2/default/bin on login nodes. These scripts will be located in the user's path as long as the appropriate Spark module is loaded (it will be spark/2.3.0 by default when users log in to the login nodes). These wrapper scripts include:To launch Spark applications or interactive shells, use the standard Spark launch scripts from the interactive container that is created when an analytics cluster is launched using start_analytics. These scripts include:

spark-shell
spark-submit
spark-sql
pyspark
sparkR
run-example

The Spark start up scripts will by default start up a 128 core interactive Spark instanceSpark instance across all cores in the allocation Spark cluster using all worker nodes and cores of the Urika-CS node allocation. To request a smaller or larger instance, pass the --total-executor-cores No_of_Desired_cores command-line flag. Memory allocated to Spark executors and drivers can be controlled with the --driver-memory and --executor-memory flags. By default, 1632 Gigabytes are allocated to the driver, and 9632 Gigabytes are allocated to each executor, but this will be overridden if a different value is specified via the command-line, or if a property file is used.

By default, spark-shell will start a small, 32 core interactive Spark instance to allow small-scale experimentation and debugging. To create a larger instance in the Spark shell, pass the --total-executor-cores No_of_Desired_cores command-line flag to spark-shell. To request a smaller or larger instance, again pass the --total-executor-cores No_of_Desired_cores command-line flag. Memory allocated to Spark executors and drivers can be controlled with the --driver-memory and --executor-memory flags. By default, 16 gigabytes are allocated to the driver, and 96 gigabytes are allocated to each executor, but this will be overridden if a different value is specified via the command-line, or if a property file is used.

Further details about starting and running Spark applications are available at http://spark.apache.org

Build Spark Applications

Spark 2.4.3 builds with Scala 2.11.8.

Urika-GXUrika-XCUrika-CS ships with Maven installed for building Java applications (including applications utilizing Spark’s Java APIs), and Scala Build Tool (sbt) for building Scala Applications (including applications using Spark’s Scala APIs). To build a Spark application with these tools, add a dependence on Spark to the build file. For Scala applications built with sbt, add this dependence to the .sbt file, such as in the following example:

scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.3"

For Java applications built with Maven, add the necessary dependence to the pom.xml file, such as in the following example:

<dependencies>
    <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11.</artifactId>
      <version>2.4.3</version>
    </dependency>
</dependencies>

For detailed information on building Spark applications, please refer to the current version of Spark's programming guide at http://spark.apache.org.

Conda Environments

When the system is running in the default mode, PySpark on Urika-GXUrika-XCUrika-CS is aware of Conda environments. If there is an active Conda environment (the name of the environment is prepended to the Unix shell prompt), the PySpark shell will detect and utilize the environment's Python. To override this behavior, manually set the PYSPARK_PYTHON environment variable to point to the preferred Python. For more information, see Enable Anaconda Python and the Conda Environment Manager.

When the system is running in the secure mode, Spark jobs (running on Kubernetes) are not aware of Conda environments or user Python versions.

Spark Configuration Differences

Spark’s default configurations on Urika-GXUrika-XCUrika-CS have a few differences from the standard Spark configuration:

Changes to improve execution over a high-speed interconnect - The presence of the high-speed network on the system changes some of the tradeoffs between compute time and communication time. Because of this, the default settings of spark.shuffle.compress has been changed to false and that of spark.locality.wait has been changed to 1. This results in improved execution times for some applications. If an application is running out of memory or temporary space, try changing this back to true.
Increases to default memory allocation - Spark’s standard default memory allocation is 1 Gigabyte to each executor, and 1 Gigabyte to the driver. Due to large memory nodes, these defaults were changed to 9632 Gigabytes for each executor and 1632 Gigabytes for the driver.
Mesos coarse-grained mode - Urika-GX ships with this mode enabled as coarse-grained mode significantly lowers startup overheads.
Local temporary cache - Spark on Urika-XC is configured to utilize a per node loopback filesystem provided by Shifter for its local temporary storage.

Limitations

Spark Shells using Kubernetes (i.e., those launched under the secure service mode) will be limited to 16 cores and 60 GiB memory and this cannot be overridden at the command line. This is due to a limitation of the lack of native Spark Shell support in the Spark on Kubernetes project that Cray has provided a workaround for in this release.