This is an automated NBA data lake that is going to support some sports analytics.
- Data Collection: Collects NBA data via SportsData.io and stores it in an AWS S3 bucket.
- Data Preparation: Prepares raw data stored in S3 for analytics using AWS Glue.
- Data Querying: Enables querying and analytics on the data using AWS Athena.
- Visualization: Easily visualize data using AWS QuickSight for intuitive dashboards and reports.
-
API Key from https://sportsdata.io and save it in the .env file.
-
AWS Permissions: Ensure the required AWS permissions are set up for S3, Glue, and Athena:
S3 Permissions:
s3:CreateBucket
s3:PutObject
s3:DeleteBucket
s3:ListBucket
Glue Permissions:
glue:CreateDatabase
glue:CreateTable
glue:DeleteDatabase
glue:DeleteTable
Athena Permissions:
athena:StartQueryExecution
athena:GetQueryResults
- log into your AWS Console and in the cloudshell, use the following command to Create a file:
nano setup_nba_data_lake.py
- Paste your Python script from src/setup_nba_data_lake.py.
- Create an
.env
file:nano .env
and paste your .env details. - After saving the two files, run the command
python3 setup_nba_data_lake.py
. - Confirm the creation of the S3 bucket.
- Go to the Athena and run the following query, make sure to select the correct Glue database.
SELECT FirstName, LastName, Position, Team FROM nba_players WHERE Position = 'PG';
- To visualize data from Athena using AWS QuickSight, Open a standard account and create a New Analysis.
- Select New dataset and select Athena as the source.
- Select the database and proceed to visualize the data.
- You can also use an SQL query to create a specific dataset and proceed to visualize the reqults.
For a more detailed documetation of the project, visit https://medium.com/@violasangut/creating-data-lake-for-nba-analytics-using-amazon-s3-aws-glue-and-amazon-athena-39710e4ca523