Xiaodong DENG, http://XD-DENG.com
This article is to share step-by-step to prepare a jar file from your Spark application, in order to submit it to your Spark cluster.
I have also added their bin
directories to my system PATH using statement PATH=$PATH:/Users/XD/spark-2.2.1-bin-hadoop2.7/bin/
, so that I can invoke the commands conveniently.
Before we use sbt to package our application, make sure you have the correct directory structure. At least, there must be two components,
build.sbt
src/main/scala/[your script]
.build.sbt
You need to be a bit careful about the configurations in build.sbt
. A minimal build.sbt
looks like below.
name := "Test"
version := "1.0"
scalaVersion := "2.11.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % "2.3.0"
)
Do make sure your scalaVersion
and your spark API version are compatible. For example, if you set scalaVersion
to 2.10.0
while have spark-sql
of version 2.3.0
, you will encounter error sbt.librarymanagement.ResolveException: unresolved dependency: org.apache.spark#spark-sql_2.10;2.3.0: not found
when you try to package your application later. To check the compatibility relationship between Scala and Spark API, Maven Repository may help.
In addition, it's also good to make sure the Spark API version you spcified in build.sbt
is consistent with the Spark version you have on your machines, even though in most cases it would just work fine (I could submit jar file packaged with Spark API of version 2.3.0 to cluster on which Spark 2.1.0 is installed).
My sample code is based on the example code in the Spark Sample code SparkPi
.
import scala.math.random
import org.apache.spark.sql.SparkSession
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val spark = SparkSession
.builder
.appName("Spark Pi")
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.sparkContext.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y <= 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / (n - 1))
spark.stop()
}
}
After you set up your application properly, you can run sbt package
in your applilcation root directory. If nothing goes wrong, a few new folders will be created, including project
and target
, and your JAR file will be created and located under target/scala-{Scala version you chose}/
.
An application relying on Spark and Scala built-in libraries ONLY, like the one in the last section, is easy to handle. However, if our application depends on other projects, we will need to package them alongside our application in order to distribute the code to a Spark cluster. Otherwise, our executor(s) may not be able to find the required code [2] and we may encounter error like 'Exception in thread "main" java.lang.NoClassDefFoundError'.
To create an assembly jar
containing our code and its dependencies, we can use sbt's plugin sbt-assembly
.
Using sbt-assembly
plugin is quite simple. For versions from sbt 1.0.0-M6, you just need to add it as a dependency in project/assembly.sbt
:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")
(For older sbt
versions, you can refer to https://github.com/sbt/sbt-assembly#setup)
We need to note that Spark
and Hadoop
should NOT be bundled into the assembly jar file since they are provided by the cluster manager at runtime [2]. We still need to list them in the build.sbt
file, but lable them as "provided"
, like the example below. As a positive side-effect, this can help reduce the jar file size.
In addition, we can exclude Scala library jars (JARs that start with "scala-" and are included in the binary Scala distribution. They are also provided in the Spark environment) by adding a statement into build.sbt
like the example below [3]. This can also help reduce the jar file size by at least a few megabytes.
name := "Test"
version := "1.0"
scalaVersion := "2.11.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % "2.3.0" % "provided"
)
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
Then we can run command sbt assembly
in the root directory of your application. The JAR file will be created and located under target/scala-{Scala version you chose}/
.
Ready to go! Now you can submit your application using spark-submit
.
spark-submit --class SparkPi --master local[*] target/scala-2.11/test_2.11-1.0.jar 100
[1] Quick Start - Self-Contained Applications
[2] Submitting Applications - Bundling Your Application’s Dependencies
[3] sbt/sbt-assembly