Job Dependencies

Overview

The container images provided by Stackable include Apache Spark and PySpark applications and libraries. In addition, they include commonly used libraries to connect to storage systems supporting the hdfs://, s3a:// and abfs:// protocols. These systems are commonly used to store data processed by Spark applications.

Sometimes the applications need to integrate with additional systems or use processing algorithms not included in the Apache Spark distribution. This guide explains how you can provision your Spark jobs with additional dependencies to support these requirements.

Dependency provisioning

There are multiple ways to submit Apache Spark jobs with external dependencies. Each has its own advantages and disadvantages and the choice of one over the other depends on existing technical and managerial constraints.

To provision job dependencies in Spark workloads, you construct the SparkApplication with one of the following dependency specifications:

Custom Spark images
Dependency volumes
Maven/Java packages
Python packages

Custom Spark images

With this method, you submit a SparkApplication for which the sparkImage refers to the full custom image name. It is recommended to start the custom image from one of the Stackable Spark images to ensure compatibility with the operator.

Below is an example of a custom image that includes a JDBC driver:

FROM oci.stackable.tech/sdp/spark-k8s:4.0.0-stackable0.0.0-dev (1)

RUN curl --fail -o /stackable/spark/jars/postgresql-42.6.0.jar "https://jdbc.postgresql.org/download/postgresql-42.6.0.jar" (2)

1	Start from an existing Stackable image.
2	Download the JDBC driver and place it in the Spark JARs directory.

Build your custom image and push it to your container registry.

docker build -t my-registry/spark-k8s:4.0.0-psql .
docker push my-registry/spark-k8s:4.0.0-psql

And the following snippet showcases an application that uses the custom image:

apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: spark-jdbc
spec:
  sparkImage:
    custom: "my-registry/spark-k8s:4.0.0-psql" (1)
    productVersion: "4.0.0" (2)
...

1	Reference to your custom image..
2	Apache Spark version bundled in your custom image.

Dependency volumes

With this method, the job dependencies are provisioned from a PersistentVolume as shown in this example:

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: example-sparkapp-pvc
  namespace: default
spec:
  sparkImage:
    productVersion: 4.0.0
  mode: cluster
  mainApplicationFile: s3a://my-bucket/app.jar (1)
  mainClass: org.example.App (2)
  sparkConf: (3)
    "spark.driver.extraClassPath": "/dependencies/jars/*"
    "spark.executor.extraClassPath": "/dependencies/jars/*"
  volumes:
    - name: job-deps (4)
      persistentVolumeClaim:
        claimName: pvc-ksv
  job:
    config:
      volumeMounts:
        - name: job-deps
          mountPath: /dependencies (5)
  driver:
    config:
      volumeMounts:
        - name: job-deps
          mountPath: /dependencies (6)
  executor:
    replicas: 3
    config:
      volumeMounts:
        - name: job-deps
          mountPath: /dependencies (7)

1	Job artifact located on S3.
2	Name of the main class to run.
3	The job dependencies provisioned from the volume below are added to the class path of the driver and executors.
4	A `PersistentVolumeClaim` created by the user prior to submitting the Spark job.
5	The volume containing the dependencies is mounted in the job pod.
6	The volume containing the dependencies is mounted in the driver pod.
7	The volume containing the dependencies is mounted in the executor pods.

The Spark operator has no control over the contents of the dependency volume. It is your responsibility to make sure all required dependencies are installed in the correct versions.

A PersistentVolumeClaim and the associated PersistentVolume can be defined and provisioned like this:

---
apiVersion: v1
kind: PersistentVolume  (1)
metadata:
  name: pv-ksv
spec:
  storageClassName: standard
  accessModes:
    - ReadWriteOnce
  capacity:
    storage: 2Gi
  hostPath:
    path: /some-host-location
---
apiVersion: v1
kind: PersistentVolumeClaim (2)
metadata:
  name: pvc-ksv
spec:
  volumeName: pv-ksv (3)
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: batch/v1
kind: Job (4)
metadata:
  name: aws-deps
spec:
  template:
    spec:
      restartPolicy: Never
      volumes:
        - name: job-deps
          persistentVolumeClaim:
            claimName: pvc-ksv (5)
      containers:
        - name: aws-deps
          volumeMounts:
            - name: job-deps
              mountPath: /stackable/spark/dependencies

1	Create a volume. This definition, the size and type of the volume are highly dependent on the type of cluster you are using.
2	Create a persistent volume claim. This allows the volume to be populated with the necessary dependencies and later on referenced by the Spark job.
3	The volume name is referenced by the `PersistentVolumeClaim`.
4	Create a job that mounts the volume and populates it with the necessary dependencies. This must job can be run before submitting the Spark job.
5	The job references the `PersistentVolumeClaim` created above.

Maven packages

The last and most flexible way to provision dependencies is to use the built-in spark-submit support for Maven package coordinates. The downside of this method is that job dependencies are downloaded every time the job is submitted and this has several implications you must be aware of. For example, the job submission time will be longer than with the other methods. Network connectivity problems may lead to job submission failures. And finally, not all type of dependencies can be provisioned this way. Most notably, JDBC drivers cannot be provisioned this way since the JVM will only look for them at startup time.

The snippet below showcases how to add Apache Iceberg support to a Spark (version 3.4.x) application.

apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: spark-iceberg
spec:
  sparkConf:
    spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
    spark.sql.catalog.spark_catalog: org.apache.iceberg.spark.SparkSessionCatalog
    spark.sql.catalog.spark_catalog.type: hive
    spark.sql.catalog.local: org.apache.iceberg.spark.SparkCatalog
    spark.sql.catalog.local.type: hadoop
    spark.sql.catalog.local.warehouse: /tmp/warehouse
  deps:
    packages:
      - org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1 (1)
...

1	Maven package coordinates for Apache Iceberg. This is downloaded from the central Maven repository and made available to the Spark application.

As mentioned above, not all dependencies can be provisioned this way. JDBC drivers are notorious for not being supported by this method but other types of dependencies may also not work. If a jar file can be provisioned using its Maven coordinates or not, depends a lot on the way it is loaded by the JVM. In such cases, consider building your own custom Spark image as shown above.

Python packages

When submitting PySpark jobs, users can specify additional Python requirements that are installed before the driver and executor pods are created.

Here is an example:

apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: pyspark-report
spec:
  mainApplicationFile: /app/run.py (1)
  deps:
    requirements:
      - tabulate==0.8.9  (2)
...

1	The main application file. In this example it is assumed that the file is part of a custom image.
2	A Python package that is used by the application and installed when the application is submitted.