- Dockerized Search Engine GUI
- Docker to Cluster Communication
- MapReduce Inverted Index Implementation
- Search Term and Top-N Search Implementation
- The way my application copies files to the inpur directory relies on the following structure that was obtained by un-compressing the provided data files.
- The application also assumes that the InvertedIndex.jar file is uploaded to the root directory of the bucket. To do
this, you can either use the
gcloud
command line utility, or use the Google Cloud Console. The InvertedIndex.jar can be found here.
My application relies on the following environment variables, which are already configured in the Dockerfile:
- DISPLAY=10.0.0.242:0
- The address of my en0 inet6 network adapter that socat and Xquartz uses to display the containerized GUI.
- To find this address, you can run
ifconfig en0 | grep inet | awk '$1=="inet" {print $2}'
from the terminal and append port 0 to the end.
- FILE_LIST_PATH=frontend/src/main/resources/srcFiles.txt
- Path to a hard-coded list of the input files used for input selection
- GOOGLE_APPLICATION_CREDENTIALS=frontend/src/main/resources/credentials/credentials.json
- Path to the Google Authentication Credentials JSON file for a service account
- PROJECT_ID=cloud-comp-dhfs-cluster
- The GCP Project ID (and as I'm writing this I see the typo, but that can't be changed now)
- REGION=us-central1
- The GCP Project Region
- CLUSTER_NAME=cloud-comp-final-proj-cluster
- The GCP cluster name
- BUCKET_NAME=cloud-comp-final-proj-data
- The GCP Storage Bucket Name
NOTE: The application is dependant on the /assets, /input, and /output directories in the following variables to make sure that it does not read results from previous jobs. When editing these, only change the bucket name to ensure the data goes in the proper place.
- BUCKET_ASSET_PATH=cloud-comp-final-proj-data/assets
- The path to the 'assets' dir on the bucket, as seen in the Directory Structure section above
- JOB_INPUT_DIR=cloud-comp-final-proj-data/input
- The path to the InvertedIndex job input dir on the bucket
- JOB_OUTPUT_DIR=cloud-comp-final-proj-data/output
- The path to the InvertedIndex job ouput dir on the bucket
NOTE: The SearchEngineGUI.jar by itself will not run from the command line since it is dependent on these environment variables, which my IDE injects during development. The GUI will run from the docker image, since the vars are configured there. To get my application running on your own GCP clusters, change the ENV settings of the Dockerfile to point towards your cluster, credentials path, etc. Also, the maven build copies my credentials into the JAR, so I have not uploaded that in this repo.
Requirements:
- Java 8
- Apache Maven 3.6.3
- Docker: I have Docker Desktop for Mac Version 2.3.0.1(46911)
- Docker for Mac and Requires
Socat
andXquartz
to render a GUI application from within a Container - Homebrew for MacOS Package Management
To install socat using homebrew, run brew install socat
.
To install Xquartz using homebrew, run brew install xquartz
.
- Launch socat:
socat TCP-LISTEN:6000,reuseaddr,fork UNIX-CLIENT:\"$DISPLAY\"
- Launch Xquartz:
open -a Xquartz
- In the Xquarts window that opens, naviagate to the Security tab and check
Allow Connections from Network Clients
. - Clone this repository:
git clone https://github.com/StevenMonty/MapReduceSearchEngine.git
- Change directory into the cloned repo
- Add your GCP Credentials JSON file to
frontend/src/main/resources/credentials
and make sure to update its path in the Dockerfile cd frontent
mvn install
mvn package
docker build --rm -t stevenmonty/gui .
docker run -it --rm stevenmonty/gui:latest
All of my application logic leverages the InvertedIndex results. I use the results to perform both the TopN and SearchTerm operations.
Most Common English Words used to construct StopWord list
Using HDFS from the Java Client Library
Google Source Code for Submitting Hadoop Jobs