Job Dependencies
Overview
The container images provided by Stackable include Apache Spark and PySpark applications and libraries.
In addition, they include commonly used libraries to connect to storage systems supporting the hdfs://
, s3a://
and abfs://
protocols. These systems are commonly used to store data processed by Spark applications.
Sometimes the applications need to integrate with additional systems or use processing algorithms not included in the Apache Spark distribution. This guide explains how you can provision your Spark jobs with additional dependencies to support these requirements.
Dependency provisioning
There are multiple ways to submit Apache Spark jobs with external dependencies. Each has its own advantages and disadvantages and the choice of one over the other depends on existing technical and managerial constraints.
To provision job dependencies in Spark workloads, you construct the SparkApplication
with one of the following dependency specifications:
-
Custom Spark images
-
Dependency volumes
-
Maven/Java packages
-
Python packages
Custom Spark images
With this method, you submit a SparkApplication
for which the sparkImage
refers to the full custom image name. It is recommended to start the custom image from one of the Stackable Spark images to ensure compatibility with the operator.
Below is an example of a custom image that includes a JDBC driver:
FROM oci.stackable.tech/sdp/spark-k8s:4.0.0-stackable0.0.0-dev (1)
RUN curl --fail -o /stackable/spark/jars/postgresql-42.6.0.jar "https://jdbc.postgresql.org/download/postgresql-42.6.0.jar" (2)
1 | Start from an existing Stackable image. |
2 | Download the JDBC driver and place it in the Spark JARs directory. |
Build your custom image and push it to your container registry.
docker build -t my-registry/spark-k8s:4.0.0-psql .
docker push my-registry/spark-k8s:4.0.0-psql
And the following snippet showcases an application that uses the custom image:
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
name: spark-jdbc
spec:
sparkImage:
custom: "my-registry/spark-k8s:4.0.0-psql" (1)
productVersion: "4.0.0" (2)
...
1 | Reference to your custom image.. |
2 | Apache Spark version bundled in your custom image. |
Dependency volumes
With this method, the job dependencies are provisioned from a PersistentVolume
as shown in this example:
---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
name: example-sparkapp-pvc
namespace: default
spec:
sparkImage:
productVersion: 4.0.0
mode: cluster
mainApplicationFile: s3a://my-bucket/app.jar (1)
mainClass: org.example.App (2)
sparkConf: (3)
"spark.driver.extraClassPath": "/dependencies/jars/*"
"spark.executor.extraClassPath": "/dependencies/jars/*"
volumes:
- name: job-deps (4)
persistentVolumeClaim:
claimName: pvc-ksv
job:
config:
volumeMounts:
- name: job-deps
mountPath: /dependencies (5)
driver:
config:
volumeMounts:
- name: job-deps
mountPath: /dependencies (6)
executor:
replicas: 3
config:
volumeMounts:
- name: job-deps
mountPath: /dependencies (7)
1 | Job artifact located on S3. |
2 | Name of the main class to run. |
3 | The job dependencies provisioned from the volume below are added to the class path of the driver and executors. |
4 | A PersistentVolumeClaim created by the user prior to submitting the Spark job. |
5 | The volume containing the dependencies is mounted in the job pod. |
6 | The volume containing the dependencies is mounted in the driver pod. |
7 | The volume containing the dependencies is mounted in the executor pods. |
The Spark operator has no control over the contents of the dependency volume. It is your responsibility to make sure all required dependencies are installed in the correct versions. |
A PersistentVolumeClaim
and the associated PersistentVolume
can be defined and provisioned like this:
---
apiVersion: v1
kind: PersistentVolume (1)
metadata:
name: pv-ksv
spec:
storageClassName: standard
accessModes:
- ReadWriteOnce
capacity:
storage: 2Gi
hostPath:
path: /some-host-location
---
apiVersion: v1
kind: PersistentVolumeClaim (2)
metadata:
name: pvc-ksv
spec:
volumeName: pv-ksv (3)
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: batch/v1
kind: Job (4)
metadata:
name: aws-deps
spec:
template:
spec:
restartPolicy: Never
volumes:
- name: job-deps
persistentVolumeClaim:
claimName: pvc-ksv (5)
containers:
- name: aws-deps
volumeMounts:
- name: job-deps
mountPath: /stackable/spark/dependencies
1 | Create a volume. This definition, the size and type of the volume are highly dependent on the type of cluster you are using. |
2 | Create a persistent volume claim. This allows the volume to be populated with the necessary dependencies and later on referenced by the Spark job. |
3 | The volume name is referenced by the PersistentVolumeClaim . |
4 | Create a job that mounts the volume and populates it with the necessary dependencies. This must job can be run before submitting the Spark job. |
5 | The job references the PersistentVolumeClaim created above. |
Maven packages
The last and most flexible way to provision dependencies is to use the built-in spark-submit
support for Maven package coordinates.
The downside of this method is that job dependencies are downloaded every time the job is submitted and this has several implications you must be aware of.
For example, the job submission time will be longer than with the other methods.
Network connectivity problems may lead to job submission failures.
And finally, not all type of dependencies can be provisioned this way.
Most notably, JDBC drivers cannot be provisioned this way since the JVM will only look for them at startup time.
The snippet below showcases how to add Apache Iceberg support to a Spark (version 3.4.x) application.
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
name: spark-iceberg
spec:
sparkConf:
spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog: org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type: hive
spark.sql.catalog.local: org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.local.type: hadoop
spark.sql.catalog.local.warehouse: /tmp/warehouse
deps:
packages:
- org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1 (1)
...
1 | Maven package coordinates for Apache Iceberg. This is downloaded from the central Maven repository and made available to the Spark application. |
As mentioned above, not all dependencies can be provisioned this way. JDBC drivers are notorious for not being supported by this method but other types of dependencies may also not work. If a jar file can be provisioned using its Maven coordinates or not, depends a lot on the way it is loaded by the JVM. In such cases, consider building your own custom Spark image as shown above.
Python packages
When submitting PySpark jobs, users can specify additional Python requirements that are installed before the driver and executor pods are created.
Here is an example:
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
name: pyspark-report
spec:
mainApplicationFile: /app/run.py (1)
deps:
requirements:
- tabulate==0.8.9 (2)
...
1 | The main application file. In this example it is assumed that the file is part of a custom image. |
2 | A Python package that is used by the application and installed when the application is submitted. |