TechnologyMay 20, 2017

Spark Application Dependency Management

Spark Application Dependency Management

This blog post was written for DataStax Enterprise 5.1.0. Refer to the DataStax documentation for your specific version of DSE.

Compiling and executing Apache Spark™ applications with custom dependencies can be a challenging task. Spark beginners can feel overwhelmed by the number of different solutions to this problem. Diversity of library versions, the number of different build tools and finally the build techniques, such as assembling fat JARs and dependency shading, can cause a headache.

In this blog post, we shed light on how to manage compile-time and runtime dependencies of a Spark Application that is compiled and executed against DataStax Enterprise (DSE) or open source Apache Spark (OSS).

Along the way we use a set of predefined bootstrap projects that can be adopted and used as a starting point for developing a new Spark Application. These examples are all about connecting, reading, and writing to and from a DataStax Enterprise or Apache Cassandra(R) system.

Quick Glossary:

Spark Driver: A user application that contains a Spark Context.
Spark Context: A Scala class that functions as the control mechanism for distributed work.
Spark Executor: A remote Java Virtual Machine (JVM) that performs work as orchestrated by the Spark Driver.
Runtime classpath: A list of all dependencies available during execution (in execution environment such as Apache Spark cluster). It's important to note that the runtime classpath of the Spark Driver is not necessarily identical to the runtime classpath of the Spark Executor.
Compile classpath: A full list of all dependencies available during compilation (specified with build tool syntax in a build file).

Choose language and build tool

First, git clone the DataStax repository https://github.com/datastax/SparkBuildExamples that provides the code that you are going to work with. Within cloned directories there are Spark Application bootstrap projects for Java and Scala, and for the most frequently used build tools:

  • Scala Build Tool (sbt)
  • Apache Maven™
  • Gradle

In the context of managing dependencies for the Spark Application, these build tools are equivalent. It is up to you to select the language and build tool that best fits you and your team.

For each build tool, the way the application is built is defined with declarative syntax embedded in files in the application’s directory:

  • Sbt: build.sbt
  • Apache Maven: pom.xml
  • Gradle: build.gradle

From now on we are going to refer to those files as a build files.

Choose execution environment

Two different execution environments are supported in the repository: DSE and OSS.

DSE

If you are planning to execute your Spark Application on a DSE cluster, use the dse bootstrap project which greatly simplifies dependency management.

It leverages the dse-spark-dependencies library which instructs a build tool to include all dependency JAR files that are distributed with DSE and are available in the DSE cluster runtime classpath. These JAR files include Apache Spark JARs and their dependencies, Apache Cassandra JARs, Spark Cassandra Connector JAR, and many others. Everything that is needed to build your bootstrap Spark Application is supplied by the dse-spark-dependencies dependency. To view the list of all dse-spark-dependencies dependencies, visit our public repo and inspect the pom files that are relevant to your DSE cluster version.

An example of an DSE built.sbt:

libraryDependencies += "com.datastax.dse" % "dse-spark-dependencies" % "5.1.1" % "provided"

Using this managed dependency will automatically match your compile time dependencies with the DSE dependencies on the runtime classpath. This means there is no possibility in the execution environment for dependency version conflicts, unresolved dependencies etc.

Note: The DSE version must match the one in your cluster, please see “Execution environment version” section for details.

DSE projects templates are built with sbt 0.13.13 or later. In case of unresolved dependencies errors, update sbt and then clean ivy cache (with rm ~/.ivy2/cache/com.datastax.dse/dse-spark-dependencies/ command).

OSS

If you are planning to execute your Spark Application on an open source Apache Spark cluster, use the oss bootstrap project. For the oss bootstrap project, all compilation classpath dependencies must be manually specified in build files.

An example of an OSS built.sbt:

libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-hive" % sparkVersion % "provided",
"com.datastax.spark" %% "spark-cassandra-connector" % connectorVersion % "provided"
)

For OSS, you must specify these four dependencies for the compilation classpath.

During execution, the Spark runtime classpath already contains the org.apache.spark.* dependencies, so all we need to do is to add spark-cassandra-connector as an extra dependency. The DataStax spark-cassandra-connector doesn’t exist in the Spark cluster by default. The most common method to include this additional dependency is to use --packages argument for the spark-submit command. An example of --packages argument usage is shown in the “Execute” section below.

The Apache Spark versions in the build file must match the Spark version in your Spark cluster. See next section for details.

Execution environment versions

It is possible that your DSE or OSS cluster version is different than the one specified in bootstrap project.

DSE

If you are a DSE user then checkout the SparkBuildExamples version that matches your DSE cluster version, for example:

git checkout <DSE_version>
# example: git checkout 5.0.6

If you are a DSE 4.8.x user then checkout 4.8.13 or newer 4.8.x version.

OSS

If you are planning to execute your application against a Spark cluster different than the one specified in a bootstrap project build file, adjust all dependencies version listed there. Fortunately, the main component versions are variables. See the example below and adjust following according to your needs.

Sbt

val sparkVersion = "2.0.2"
val connectorVersion = "2.0.0"

 

Maven

<properties>
  <spark.version>2.0.2</spark.version>
  <connector.version>2.0.0</connector.version>
</properties>

 

Gradle

def sparkVersion = "2.0.2"
def connectorVersion = "2.0.0"

Let’s say that your Spark cluster has 1.5.1 version. Go to version compatibility table, there you can see compatible Apache Cassandra versions and Spark Cassandra Connector versions. In this example, our Apache Spark 1.5.1 cluster is compatible with 1.5.x Spark Cassandra Connector, the newest one is 1.5.2 (newest versions can be found on Releases page). Adjust the variables accordingly and you are good to go!

Build

The build command differs for each build tool. The bootstrap projects can be built with the following commands.

Sbt

sbt clean assembly
# produces jar in path: target/scala-2.11/writeRead-assembly-0.1.jar

Maven

mvn clean package
# produces jar in path: target/writeRead-0.1.jar

Gradle

gradle clean shadowJar
# produces jar in path: build/libs/writeRead-0.1.jar

Execute

The spark-submit command differs between environments. In DSE environment, the command is simplified to autodetect parameters like --master. In addition, various other Apache Cassandra and DSE specific parameters are added to the default SparkConf. Use the following commands to execute the JAR that you built. Refer to the Spark docs for details about spark-submit command.

DSE

dse spark-submit --class com.datastax.spark.example.WriteRead <path_to_produced_jar>

OSS

spark-submit --conf spark.cassandra.connection.host=<cassandra_host> --class com.datastax.spark.example.WriteRead --packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.0 --master <master_url> <path_to_produced_jar>

Note the usage of --packages to include the spark-cassandra-connector on the runtime classpath for all application JVMs.

Provide additional dependencies

Now that you have successfully built and executed this simple application, it’s time to see how extra dependencies can be added to your Spark Application.

Let’s say your application grows with time and there is a need to incorporate an external dependency to add functionality to your application. For this argument, let the new dependency  be commons-math3.

To supply this dependency to the compilation classpath, we must provide proper configuration entries in build files.

There are two ways to provide additional dependencies to runtime classpath assembling or manually providing all dependencies with the spark-submit command.

Assembly

Assembling is a way of directly including dependencies classes in the resulting JAR file (sometimes called fat-jar or uber-jar) as if these dependency classes were developed along with your application. When the user code is shipped to Apache Spark Executors, these dependency classes are included in the application JAR on the runtime classpath. To see an example, uncomment the following sections in any of your build files.

Sbt

libraryDependencies += "org.apache.commons" %% "commons-math3" % "3.6.1"

 

Maven

<dependency>
  <groupId>org.apache.commons</groupId>
  <artifactId>commons-math3</artifactId>
  <version>3.6.1</version>
</dependency>

 

Gradle

assembly "org.apache.commons:commons-math3:3.6.1"

Now you can use commons-math3 classes in your application code. When your development is finished, you can create a JAR file using the build command and submit it without any modifications to the spark-submit command. If you are curious to see where the additional dependency is, use any archive application to open the produced JAR to see that commons-math3 classes are included.

When assembling, you might run into conflicts where multiple jars attempt to include a file with the same filename but different contents. There are several solutions to this problem, most common are: removing one of the conflicting dependencies or shading (which is described later in this blog post). If all else fails, most plugins have a variety of other merge strategies for handling these situations. For example, the  https://github.com/sbt/sbt-assembly#merge-strategy.

Manually adding JARs to the runtime classpath

If you don’t want to assembly a fat JAR (maybe the number of additional dependencies produced a 100MB JAR file and you consider this size unusable), use an alternate way to provide additional dependencies to runtime classpath.

Mark some of the dependencies with provided keyword to exclude them from the assembly JAR.

Sbt

libraryDependencies += "org.apache.commons" %% "commons-math3" % "3.6.1" % "provided"

 

Maven

<dependency>
  <groupId>org.apache.commons</groupId>
  <artifactId>commons-math3</artifactId>
  <version>3.6.1</version>
  <scope>provided</scope>
</dependency>

 

Gradle

provided "org.apache.commons:commons-math3:3.6.1"

After building a JAR, manually specify additional dependencies with spark-submit command during application submission. Add or extend existing --packages argument of spark-submit command. Note that multiple dependencies are separated by commas. For example:

--packages org.apache.commons:commons-math3:3.6.1,com.datastax.spark:spark-cassandra-connector_2.11:2.0.0

User dependencies conflicting with Spark dependencies

What if you want to use different version of a dependency than the version that is present in the execution environment?

For example, a Spark cluster already has commons-csv in its runtime classpath and the developer needs a different version in their application. Maybe the Spark version is old and doesn’t contain all the needed functionality. Maybe the new version is not backward compatible and breaks Spark Application execution.

This is a common problem and there is a solution: shading.

Shading

Shading is a build technique where dependency classes are packaged with application JAR files (like in assembling) but additionally package structure of this classes is altered. This process happens at compile time and is transparent to the developer. Shading simply substitutes all dependency references in a Spark Application with the same (functionality-wise) classes but located in different packages. For example, the class org.apache.commons.csv.CSVParser for Spark Application becomes shaded.org.apache.commons.csv.CSVParser.

To see shading in action uncomment following sections in build file of your choice. This will embed old commons-csv in resulting jar but with prepended package “shaded”.

Sbt

assembly "org.apache.commons:commons-csv:1.0"

and

assemblyShadeRules in assembly := Seq( 
 ShadeRule.rename("org.apache.commons.csv.**" -> "shaded.org.apache.commons.csv.@1").inAll 
)

Maven

<dependency>
  <groupId>org.apache.commons</groupId>
  <artifactId>commons-csv</artifactId>
  <version>1.0</version>
</dependency>

and

<relocations>
  <relocation>
    <pattern>org.apache.commons.csv</pattern>
    <shadedPattern>shaded.org.apache.commons.csv</shadedPattern>
  </relocation>
</relocations>

 

Gradle

libraryDependencies += "org.apache.commons" % "commons-csv" % "1.0"

and

shadowJar {
  relocate 'org.apache.commons.csv', 'shaded.org.apache.commons.csv'
}

After building the JAR, you can look into its content and see that commons-csv is embedded in shaded directory.

Summary

In this article, you learned how to manage compile-time and runtime dependencies of a simple Apache Spark application that connects to an Apache Cassandra database by using the Spark Cassandra Connector. You learned how Scala and Java projects are structured with sbt, gradle, and maven build tools. You also learned different ways to provide additional dependencies and how to resolve dependency conflicts with shading.



 

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.