dataflow is a specialized issue tracker designed to streamline and enhance your data science and data analysis projects. The platform offers a unique approach to project management through the concept of flows, while also providing an array of additional features tailored to empower your data-related tasks. Whether you're a data scientist, analyst, or enthusiast, dataflow is here to optimize your workflow.
Please consider that current development is focused on the backend, core architecture, and internal developer tooling. This means a frontend won't be released in the near future. As such, this repository will document application architecture, APIs, and other non-userfacing concepts.
- Clone the repository and install backend dependencies:
git clone https://github.com/RyanHUNGry/dataflow.git && cd ./dataflow/backend && npm install
- Create an environment variables file:
cd backend && touch.env
- Fill out the following environment variables inside
.env
:
NODE_ENV=... #
PG_DEV_DATABASE=... # development database
PG_DEV_USERNAME=...
PG_DEV_PASSWORD=...
PG_DEV_HOST=...
DEV_PORT=...
DEV_JWT_SECRET=...
AWS_PUBLIC_KEY:...
AWS_SECRET_KEY:...
- Start a server on default
http://localhost:8000
:
npm start
# run nodemon process for development
npm run watch
- Ping the API with a tool such as Postman
dataflow has a traditional three environment setup using environment variables to dictate development, test, and production settings.
dataflow uses AWS RDS PostgreSQL instances for data storage. There are three databases inside the instance for development, test, and production. Connection to dataflow is facilitated via PostgreSQL connection protocol with SSL encryption.
dataflow uses AWS S3 buckets to store datasets related to a flow and also summary statistics related to each dataset. There are three folders within each bucket for development, test, and production.
To compute summary statistics of datasets uploaded to S3, AWS Lambda runs a Python script utilizing Pandas to compute and then output to a second bucket. The environment is inferred by Lambda using the object folder prefix.
The dataflow API is powered by Node.js and Express.js. Passport.js is used for authentication middleware with JWT tokens. Knex.js is used as a query builder to query against the AWS RDS PostgreSQL databases. The NODE_ENV environment variable can be used to configure how the API connects with external services. This API will listen on port 8000.
The dataflow API comes with full unit and integration test suites. These tests should be run under test NODE_ENV so that proper connection to external services are used. The tests themselves depend on a Mocha, Chai, and Sinon stack.
Much like other REST APIs, the dataflow API is stateless and quite serverless. Thus, containerization of the application only depends on installing the application itself, and connecting to services with proper environment variables and credentials.
WIP
- Production application: Docker Hub
- Production API: http://54.215.249.98:8000/