Apache Spark Support
Overview of using Spark
Spark core and ecosystem components currently supported on the Urika-GXUrika-XC system include:
- Spark Core, DataFrames, and Resilient Distributed Datasets (RDDs) - Spark Core provides distributed task dispatching, scheduling, and basic I/O functionalities.
- Spark SQL, DataSets, and DataFrames - The Spark SQL component is a layer on top of Spark Core for processing structured data.
- Spark Streaming - The Spark Streaming component leverages Spark Core's fast scheduling capabilities to perform streaming analytics.
- MLlib Machine Learning Library - MLlib is a distributed machine learning framework on top of Spark.
- GraphX - GraphX is a distributed graph processing framework on top of Spark. It provides an API for expressing graph computations.
- When executing
start_analytics, it is necessary to create this directory:~/.urikacs/sparkHistoryfor each user. An example directory:/home/users/username/.urikacs/sparkHistory
Urika-XCUrika-CS ships with Spark 2.4.3.
Run Spark Applications
The Urika-GXUrika-XCUrika-CS software stack includes Spark configured and deployed to run under Mesosin a Shifter container, with a per-node cache for local temporary storagein a Singularity container, with a per-node cache for local temporary storage.mesos://zk://zoo1:2181,zoo2:2181,zoo3:2181/mesosThis is the default setting on Urika-GX and is configured via the Spark start up scripts installed on the system.
spark.mesos.coarse to false in SparkConf.
- spark-shell
- spark-submit
- spark-sql
- pyspark
- sparkR
- run-example
The Spark start up scripts will by default start up a 128 core interactive Spark instanceSpark instance across all cores in the allocation
Spark cluster using all worker nodes and cores of the Urika-CS node allocation. To request a smaller or larger instance, pass the --total-executor-cores No_of_Desired_cores command-line flag. Memory allocated to Spark executors and drivers can be controlled with the --driver-memory and --executor-memory flags. By default, 1632 Gigabytes are allocated to the driver, and 9632 Gigabytes are allocated to each executor, but this will be overridden if a different value is specified via the command-line, or if a property file is used.
By default, spark-shell will start a small, 32 core interactive Spark instance to allow small-scale experimentation and debugging. To create a larger instance in the Spark shell, pass the --total-executor-cores No_of_Desired_cores command-line flag to spark-shell. To request a smaller or larger instance, again pass the --total-executor-cores No_of_Desired_cores command-line flag. Memory allocated to Spark executors and drivers can be controlled with the --driver-memory and --executor-memory flags. By default, 16 gigabytes are allocated to the driver, and 96 gigabytes are allocated to each executor, but this will be overridden if a different value is specified via the command-line, or if a property file is used.
Further details about starting and running Spark applications are available at http://spark.apache.org
Build Spark Applications
Spark 2.4.3 builds with Scala 2.11.8.Urika-GXUrika-XCUrika-CS ships with Maven installed for building Java applications (including applications utilizing Spark’s Java APIs), and Scala Build Tool (sbt) for building Scala Applications (including applications using Spark’s Scala APIs). To build a Spark application with these tools, add a dependence on Spark to the build file. For Scala applications built with sbt, add this dependence to the .sbt file, such as in the following example:
scalaVersion := "2.11.8" libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.3"
<dependencies>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11.</artifactId>
<version>2.4.3</version>
</dependency>
</dependencies>Conda Environments
When the system is running in the default mode, PySpark on Urika-GXUrika-XCUrika-CS is aware of Conda environments. If there is an active Conda environment (the name of the environment is prepended to the Unix shell prompt), the PySpark shell will detect and utilize the environment's Python. To override this behavior, manually set the PYSPARK_PYTHON environment variable to point to the preferred Python. For more information, see Enable Anaconda Python and the Conda Environment Manager.
When the system is running in the secure mode, Spark jobs (running on Kubernetes) are not aware of Conda environments or user Python versions.
Spark Configuration Differences
Spark’s default configurations on Urika-GXUrika-XCUrika-CS have a few differences from the standard Spark configuration:- Changes to improve execution over a high-speed interconnect - The presence of the high-speed network on the system changes some of the tradeoffs between compute time and communication time. Because of this, the default settings of
spark.shuffle.compresshas been changed tofalseand that ofspark.locality.waithas been changed to 1. This results in improved execution times for some applications. If an application is running out of memory or temporary space, try changing this back totrue. - Increases to default memory allocation - Spark’s standard default memory allocation is 1 Gigabyte to each executor, and 1 Gigabyte to the driver. Due to large memory nodes, these defaults were changed to 9632 Gigabytes for each executor and 1632 Gigabytes for the driver.
- Mesos coarse-grained mode - Urika-GX ships with this mode enabled as coarse-grained mode significantly lowers startup overheads.
- Local temporary cache - Spark on Urika-XC is configured to utilize a per node loopback filesystem provided by Shifter for its local temporary storage.
Limitations
Spark Shells using Kubernetes (i.e., those launched under the secure service mode) will be limited to 16 cores and 60 GiB memory and this cannot be overridden at the command line. This is due to a limitation of the lack of native Spark Shell support in the Spark on Kubernetes project that Cray has provided a workaround for in this release.