Spark on AWS

These instructions show how to deploy a Spark application on Amazon Web Services.

Easy way

Compile your application as an uber .jar file and upload it to an S3 bucket.
In the EMR console select Create cluster.
Set launch mode to Step execution.
Choose Spark application type and click Configure:
- Select your .jar file that is located on S3
- Set spark-submit options to --class <package.class>
Set the rest of the parameters (instance type & count, roles) and create the cluster.
You can track the driver's progress and output under the Application history tab.

It takes ~10 mins to spin up the cluster before your application is executed.

Alternative way

This seems to be a deprecated way of deploying a Spark application -- instead of using the EMR solution, this option manages the cluster environment using spark-ec2 scripts (this was the only option for Spark 1.6).

Although being more tedious and inconvenient, this option allows to take advantage of the AWS free tier by using t2.micro instances.

0. Create an AWS account.
This tutorial uses the free tier provided for the first year of AWS usage. However, it is recommended to create a billing alarm in case the usage accidentally exceeds the free tier limits.

1. Create a key pair.

Go to EC2 Console → Key pairs → Create key pair
Download the private key .pem file and set the correct permissions:

chmod 400 <path to the private key>

2. Add an IAM role.

Go to Security Credentials → Roles → Create role
Select role type AWS Service → EC2 → EC2
Continue without selecting any security policy
Set the role name to spark-app

3. Create a security group.

Go to Security Credentials → Groups → Add new group
Select Administrator access policy

4. Add a user.

Go to Security Credentials → Users → Add user
Select Programmatic access
Add the new user to the previously created group

5. Generate access keys.

Go to Security Credentials → Access keys (access key ID and secret access key)
Generate the keys and export them as shell variables:

export AWS_ACCESS_KEY_ID=…
export AWS_SECRET_ACCESS_KEY=…

6. Get start-up scripts.

git clone -b java8 https://github.com/igorpisarev/spark-ec2.git

NOTE: this is a patched version that enables support of Java 8. Will switch back once it is added to the main repo.

7. Launch Spark cluster.

./spark-ec2 \
    --key-pair=<name of the key pair> \
    --identity-file=<path to the private key> \
    --user-data=java8.sh \
    --spark-ec2-git-repo=https://github.com/igorpisarev/spark-ec2 \
    --spark-ec2-git-branch=java8 \
    --slaves=1 \
    --instance-type=t2.micro \
    --spark-version=2.1.0 \
    --instance-profile-name=spark-app \
    launch spark-test

This may take ~10 mins for the cluster nodes to enter the ssh-ready state and set everything up.

8. Copy the application binaries to the cluster.

Build a fat/uber jar that contains all dependencies
Copy the resulting .jar file to the master node:

scp \
    -i <path to the private key> \
    <path to the uber jar> \
    root@<master's public IP>:/root/app.jar

9. Login into the cluster.

./spark-ec2 \
    --key-pair=<name of the key pair> \
    --identity-file=<path to the private key> \
    login spark-test

10. Run the application.

/root/spark/bin/spark-submit \
    --class <package.class> \
    --master <master URL> \
    /root/app.jar
    <parameters for your app>

11. Terminate the cluster.

./spark-ec2 --delete-groups destroy spark-test

Sources:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark on AWS

Easy way

Alternative way

Clone this wiki locally