Skip to content

Spark on AWS

Igor Pisarev edited this page Oct 21, 2019 · 1 revision

These instructions show how to deploy a Spark application on Amazon Web Services.

Easy way

  1. Compile your application as an uber .jar file and upload it to an S3 bucket.
  2. In the EMR console select Create cluster.
  3. Set launch mode to Step execution.
  4. Choose Spark application type and click Configure:
    • Select your .jar file that is located on S3
    • Set spark-submit options to --class <package.class>
  5. Set the rest of the parameters (instance type & count, roles) and create the cluster.
  6. You can track the driver's progress and output under the Application history tab.

It takes ~10 mins to spin up the cluster before your application is executed.

Alternative way

This seems to be a deprecated way of deploying a Spark application -- instead of using the EMR solution, this option manages the cluster environment using spark-ec2 scripts (this was the only option for Spark 1.6).

Although being more tedious and inconvenient, this option allows to take advantage of the AWS free tier by using t2.micro instances.

0. Create an AWS account.
This tutorial uses the free tier provided for the first year of AWS usage. However, it is recommended to create a billing alarm in case the usage accidentally exceeds the free tier limits.

1. Create a key pair.

  • Go to EC2 ConsoleKey pairsCreate key pair
  • Download the private key .pem file and set the correct permissions:
chmod 400 <path to the private key>

2. Add an IAM role.

  • Go to Security CredentialsRolesCreate role
  • Select role type AWS ServiceEC2EC2
  • Continue without selecting any security policy
  • Set the role name to spark-app

3. Create a security group.

  • Go to Security CredentialsGroupsAdd new group
  • Select Administrator access policy

4. Add a user.

  • Go to Security CredentialsUsersAdd user
  • Select Programmatic access
  • Add the new user to the previously created group

5. Generate access keys.

  • Go to Security CredentialsAccess keys (access key ID and secret access key)
  • Generate the keys and export them as shell variables:
export AWS_ACCESS_KEY_ID=…
export AWS_SECRET_ACCESS_KEY=… 

6. Get start-up scripts.

git clone -b java8 https://github.com/igorpisarev/spark-ec2.git 

NOTE: this is a patched version that enables support of Java 8. Will switch back once it is added to the main repo.

7. Launch Spark cluster.

./spark-ec2 \
    --key-pair=<name of the key pair> \
    --identity-file=<path to the private key> \
    --user-data=java8.sh \
    --spark-ec2-git-repo=https://github.com/igorpisarev/spark-ec2 \
    --spark-ec2-git-branch=java8 \
    --slaves=1 \
    --instance-type=t2.micro \
    --spark-version=2.1.0 \
    --instance-profile-name=spark-app \
    launch spark-test

This may take ~10 mins for the cluster nodes to enter the ssh-ready state and set everything up.

8. Copy the application binaries to the cluster.

  • Build a fat/uber jar that contains all dependencies
  • Copy the resulting .jar file to the master node:
scp \
    -i <path to the private key> \
    <path to the uber jar> \
    root@<master's public IP>:/root/app.jar

9. Login into the cluster.

./spark-ec2 \
    --key-pair=<name of the key pair> \
    --identity-file=<path to the private key> \
    login spark-test

10. Run the application.

/root/spark/bin/spark-submit \
    --class <package.class> \
    --master <master URL> \
    /root/app.jar
    <parameters for your app>

11. Terminate the cluster.

./spark-ec2 --delete-groups destroy spark-test

Sources:

Clone this wiki locally