Step-by-step of Spark Submission (Scala Assembly Jar)

Step-by-step to Package Your Spark Application (Scala)

This article is to share step-by-step to prepare a jar file from your Spark application, in order to submit it to your Spark cluster.

Prerequirements
Prepare Your Application
- sbt Configuration File build.sbt
- Sample Application
Package Your Application
- If your application does NOT depend on other project
- If your application depends on project(s) other than Spark
References

Prerequirements

Spark
sbt

I have also added their bin directories to my system PATH using statement PATH=$PATH:/Users/XD/spark-2.2.1-bin-hadoop2.7/bin/, so that I can invoke the commands conveniently.

Preapre Your Application

Before we use sbt to package our application, make sure you have the correct directory structure. At least, there must be two components,

build.sbt
src/main/scala/[your script].

sbt Configuration File `build.sbt`

You need to be a bit careful about the configurations in build.sbt. A minimal build.sbt looks like below.

name := "Test"

version := "1.0"

scalaVersion := "2.11.0"

 libraryDependencies ++= Seq(
      "org.apache.spark" %% "spark-sql" % "2.3.0"
)

Do make sure your scalaVersion and your spark API version are compatible. For example, if you set scalaVersion to 2.10.0 while have spark-sql of version 2.3.0, you will encounter error sbt.librarymanagement.ResolveException: unresolved dependency: org.apache.spark#spark-sql_2.10;2.3.0: not found when you try to package your application later. To check the compatibility relationship between Scala and Spark API, Maven Repository may help.

In addition, it's also good to make sure the Spark API version you spcified in build.sbt is consistent with the Spark version you have on your machines, even though in most cases it would just work fine (I could submit jar file packaged with Spark API of version 2.3.0 to cluster on which Spark 2.1.0 is installed).

Sample Application

My sample code is based on the example code in the Spark Sample code SparkPi.

import scala.math.random

import org.apache.spark.sql.SparkSession

/** Computes an approximation to pi */
object SparkPi {
  def main(args: Array[String]) {

    val spark = SparkSession
      .builder
      .appName("Spark Pi")
      .getOrCreate()

    spark.sparkContext.setLogLevel("WARN")
    
    val slices = if (args.length > 0) args(0).toInt else 2
    val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
    val count = spark.sparkContext.parallelize(1 until n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y <= 1) 1 else 0
    }.reduce(_ + _)
    println("Pi is roughly " + 4.0 * count / (n - 1))
    spark.stop()
  }
}

Package Your Application

If your application does NOT depend on other project

After you set up your application properly, you can run sbt package in your applilcation root directory. If nothing goes wrong, a few new folders will be created, including project and target, and your JAR file will be created and located under target/scala-{Scala version you chose}/.

If your application depends on project(s) other than Spark

An application relying on Spark and Scala built-in libraries ONLY, like the one in the last section, is easy to handle. However, if our application depends on other projects, we will need to package them alongside our application in order to distribute the code to a Spark cluster. Otherwise, our executor(s) may not be able to find the required code [2] and we may encounter error like 'Exception in thread "main" java.lang.NoClassDefFoundError'.

To create an assembly jar containing our code and its dependencies, we can use sbt's plugin sbt-assembly.

Using sbt-assembly plugin is quite simple. For versions from sbt 1.0.0-M6, you just need to add it as a dependency in project/assembly.sbt:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")

(For older sbt versions, you can refer to https://github.com/sbt/sbt-assembly#setup)

We need to note that Spark and Hadoop should NOT be bundled into the assembly jar file since they are provided by the cluster manager at runtime [2]. We still need to list them in the build.sbt file, but lable them as "provided", like the example below. As a positive side-effect, this can help reduce the jar file size.

In addition, we can exclude Scala library jars (JARs that start with "scala-" and are included in the binary Scala distribution. They are also provided in the Spark environment) by adding a statement into build.sbt like the example below [3]. This can also help reduce the jar file size by at least a few megabytes.

name := "Test"

version := "1.0"

scalaVersion := "2.11.0"

libraryDependencies ++= Seq(
      "org.apache.spark" %% "spark-sql" % "2.3.0" % "provided"
)

assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)

Then we can run command sbt assembly in the root directory of your application. The JAR file will be created and located under target/scala-{Scala version you chose}/.

Ready to go! Now you can submit your application using spark-submit.

spark-submit --class SparkPi --master local[*] target/scala-2.11/test_2.11-1.0.jar 100

References

[1] Quick Start - Self-Contained Applications

[2] Submitting Applications - Bundling Your Application’s Dependencies

[3] sbt/sbt-assembly