This repository is a fork of Nicolas Roux
's package. You can see the base backage in this link:
Ref: https://gitlab.com/nicolas217/modern-open-source-data-stack
URL: https://localhost:8080
Username: airflow
password: airflow
URL: https://localhost:8000
Username: airbyte
password: password
URL: https://localhost:8088
Username: admin
password: admin
Cartelis is an operational data & CRM consulting firm. We improve marketing ROI and streamline the customer's digital journey. We work with our clients to first design and implement martech solutions and then help them capitalize on their data to prioritize impactful marketing actions.
Designed to be self-sufficient in a local or remote machine ( limited to Ubuntu 18.04 as of 30/06/2021), this project brings together best-of-breed open-source frameworks in order to provide automated data migration, hosting, and transformation features in a single consolidated and fully customizable tool. Specifically, it currently puts to use:
- Airflow: to orchestrate and automate the entire data pipeline
- Airbyte: to integrate data, relying on a growing set of connectors
- DBT: to transform data hosted in a data warehouse with models based on largely on SQL syntax
- Postgresql: to host and transform data
An objective of this project is to avoid version conflicts due to the rapidly evolving nature of these open-source frameworks, so the tool is designed to rely on specific versions of each software package:
- Airflow:2.7.3
- Airbyte:0.50.38
- DBT:1.7.4
- PostgreSQL:13
- Clone the repository
- Enter the repository:
cd data-integration
- Run
./setup.sh
Once the build has terminated: - List the docker containers running:
sudo docker ps
- Find the container named airflow-worker and run:
sudo docker exec -it <container ID> /bin/bash
Inside the docker container: - Check if dbt is installed (not fully working as of 19/12/2023):
dbt --version
- If an error is returned, run:
pip install dbt==1.7.4
- Airflow webserver UI: localhost:8080
- Airbyte webserver UI: localhost:8000
- Postgresql server: localhost:5400
- Postgresql server: localhost:5399
- docker exec -it psql -U airflow
- Useful commands:
\l
: list databases\c
: enter database\dt
: view database relationsTABLE <table_name>
: view table dataSELECT * FROM <table_name>
: view table data
- Stop and remove containers :
docker-compose down -v
in the docker folder - Stop containers :
docker-compose stop
in the docker folder - Run containers :
docker-compose run
in the docker folder - Stop and delete containers :
docker-compose down --volumes --rmi all
- access docker:
sudo docker exec -it <airflow worker> /bin/bash
- install vim:
apt-get up date
andapt-get install vim
- Check which version is installed:
dbt --version
- Check which dbt python packages are installed:
pip list
, if only is present, install more packages:pip install dbt
In the docker folder: - Create a local dbt project in the dbt folder:
dbt init <project_name>
- Enter into the dbt project :
cd <project_name>
- Edit dbt_project.yml file :
vim dbt_project.yml
- Replace "my_new_project","default", and my_new_project in models by
<project_name>
- Navigate to
.dbt
folder (/root/.dbt
) and edit file for a Postgresql connection (host is the host machine IP address, and the port number is 5400)
Ref: https://docs.getdbt.com/reference/warehouse-profiles/postgres-profile
Airflow's purpose in this implementation is to orchestrate Airbyte, DBT, and Postgresql task in order to provide as close to full automation as possible in this data pipeline.
DAGS written to run integration tasks between external sources and the local Postgresql server will make use of existing Airbyte operators made available by downloading the apache-airflow-providers-airbyte
python package.
Airbyte-Airflow connection: set-up in Airflow admin connections with connector as HTTP
The Airbyte task needs to contain the Airbyte connector ID contained in the connector URL.
Connect Airflow to airbyte with a new connectio
While it may be easier to run all DBT models with a single task in a DAG (Directed Acyclic Graph) file, this method runs the risk of performing excessive computational tasks despite a potential lack of updated data and is inflexible when performing debugging operations.
A better way is to create a DAG file capable of parsing the manifest.json
file and creating individual tasks based on identified nodes which are then run or tested while taking into account potential dependencies.