GitHub - CoolDude53/TwitterAnalysis: A simple spark program to analyze tweet data.

TwitterAnalysis

This is a simple spark program. It shows the ability to take a given input of Tweet JSON files, map them to Tweet objects, analyze those, and then collect information and write to output. This implementation is built specifically to work with AWS (S3, EMR). The program takes input and writes output to an S3 bucket. It is meant to be run on an EMR, the setup of which is discussed in more depth below.

Installation and Usage

Clone this repo
(Optional) If edits are desired -> Open maven project in any IDE (Java 8 required!)
Build maven project into a jar file (so long as Spark is provided on EMR, you do not need to include the extracted Spark output in the jar)
Upload jar to S3 bucket and save path for later
Ensure input JSON files are located in the input path and the output directory is created in the output path, as specified in the application.properties file
Deploy an EMR cluster
Add step to cluster with specifications
Choose Spark Application
Deploy mode = cluster
Spark submit args: --class pickle.plaza.TwitterMain (main class of the application pickle.plaza being the containing package for TwitterMain)
Spark application location = path/from/step3 (i.e. s3://twitter-redshift-json/apps/twitteranalysis.jar)
Let application run, you can view progress if SSH'd into the EMR and proxied

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src/main		src/main
README.md		README.md
pom.xml		pom.xml
twitteranalysis.jar		twitteranalysis.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TwitterAnalysis

Installation and Usage

About

Releases

Packages

Languages

CoolDude53/TwitterAnalysis

Folders and files

Latest commit

History

Repository files navigation

TwitterAnalysis

Installation and Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages