The Spark Python template image serves as a base image to build your own Python application to run on a Spark cluster. See big-data-europe/docker-spark README for a description how to setup a Spark cluster.
You can build and launch your Python application on a Spark cluster by extending this image with your sources. The template uses pip to manage the dependencies of your
project, so make sure you have a requirements.txt
file in the root of your application specifying all the dependencies.
- Create a Dockerfile in the root folder of your project (which also contains a
requirements.txt
) - Extend the Spark Python template Docker image
- Configure the following environment variables (unless the default value satisfies):
SPARK_MASTER_NAME
(default: spark-master)SPARK_MASTER_PORT
(default: 7077)SPARK_APPLICATION_PYTHON_LOCATION
(default: /app/app.py)SPARK_APPLICATION_ARGS
- Build and run the image
docker build --rm -t bde/spark-app .
docker run --name my-spark-app -e ENABLE_INIT_DAEMON=false --link spark-master:spark-master --net dockerspark_default -d bde/spark-app
The sources in the project folder will be automatically added to /app
if you directly extend the Spark Python template image. Otherwise you will have to add the sources by yourself in your Dockerfile with the command:
COPY . /app
If you overwrite the template's CMD
in your Dockerfile, make sure to execute the /template.sh
script at the end.
FROM bde2020/spark-python-template:3.3.0-hadoop3.3
MAINTAINER You <[email protected]>
ENV SPARK_APPLICATION_PYTHON_LOCATION /app/entrypoint.py
ENV SPARK_APPLICATION_ARGS "foo bar baz"