R on Spark

SparkR is an R package that provides a light-weight frontend to use Spark from R.

Installing SparkR

Requirements

SparkR requires Scala 2.10 and Spark version >= 0.9.0. Note that as Spark 0.9.0 has not yet been released the current build uses the latest release candidate from the Apache staging repositories. You can also build SparkR against a different Spark version (>= 0.9) by modifying pkg/src/build.sbt.

SparkR also requires the R package rJava to be installed. To install rJava, you can run the following command in R:

install.packages("rJava")

Package installation

To develop SparkR, you can build the scala package and the R package using

./install-dev.sh

If you wish to try out the package directly from github, you can use install_github from devtools

library(devtools)
install_github("amplab-extras/SparkR-pkg", subdir="pkg")

SparkR by default links to Hadoop 1.0.4. To use SparkR with other Hadoop versions, you will need to rebuild SparkR with the same version that Spark is linked to. For example to use SparkR with a CDH 4.2.0 MR1 cluster, you can run

SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 ./install-dev.sh

By default, SparkR uses sbt to build an assembly jar. If you wish to use maven instead, you can set the environment variable USE_MAVEN=1. For example

USE_MAVEN=1 ./install-dev.sh

If you are building SparkR from behind a proxy, you can setup maven to use the right proxy server.

Running sparkR

If you have cloned and built SparkR, you can start using it by launching the SparkR shell with

./sparkR

If you have installed it directly from github, you can include the SparkR package and then initialize a SparkContext. For example to run with a local Spark master you can launch R and then run

library(SparkR)
sc <- sparkR.init(master="local")

To increase the memory used by the driver you can export the SPARK_MEM environment variable. For example to use 1g, you can run

SPARK_MEM=1g ./sparkR

In a cluster settting to set the amount of memory used by the executors you can pass the variable spark.executor.memory to the SparkContext constructor.

library(SparkR)
sc <- sparkR.init(master="spark://<master>:7077",
                  sparkEnvir=list(spark.executor.memory="1g"))

Examples, Unit tests

SparkR comes with several sample programs in the examples directory. To run one of them, use ./sparkR <filename> <args>. For example:

./sparkR examples/pi.R local[2]

You can also run the unit-tests for SparkR by running

./run-tests.sh

Running on EC2

Instructions for running SparkR on EC2 can be found in the SparkR wiki.

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
examples		examples
pkg		pkg
.gitignore		.gitignore
.travis.yml		.travis.yml
DOCUMENTATION.md		DOCUMENTATION.md
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
install-dev.sh		install-dev.sh
run-tests.sh		run-tests.sh
sparkR		sparkR
已读论文.i.docx		已读论文.i.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

R on Spark

Installing SparkR

Requirements

Package installation

Running sparkR

Examples, Unit tests

Running on EC2

About

Releases

Packages

Languages

License

butterluo/SparkR-pkg

Folders and files

Latest commit

History

Repository files navigation

R on Spark

Installing SparkR

Requirements

Package installation

Running sparkR

Examples, Unit tests

Running on EC2

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages