This project demonstrates a real-time end-to-end (E2E) data pipeline designed to handle clickstream data. It shows how to ingest, process, store, query, and visualize streaming data using open-source tools, all containerized with Docker for easy deployment.
π Technologies Used:
- Data Ingestion: Apache Kafka
- Stream Processing: Apache Flink
- Object Storage: MinIO (S3-compatible)
- Data Lake Table Format: Apache Iceberg
- Query Engine: Trino
- Visualization: Apache Superset
This pipeline is perfect for data engineers and students interested in learning how to design real-time data systems.
- Clickstream Data Generator simulates real-time user events and pushes them to Kafka topic.
- Apache Flink processes Kafka streams and writes clean data to Iceberg tables stored on MinIO.
- Trino connects to Iceberg for querying the processed data.
- Apache Superset visualizes the data by connecting to Trino.
Component | Technology | Purpose |
---|---|---|
Data Generator | Python (Faker) | Simulate clickstream events |
Data Ingestion | Apache Kafka | Real-time event streaming |
Coordination Service | Apache ZooKeeper | Kafka broker coordination and metadata management |
Stream Processing | Apache Flink | Real-time data processing and transformation |
Data Lake Storage | Apache Iceberg | Data storage and schema management |
Object Storage | MinIO | S3-compatible storage for Iceberg tables |
Query Engine | Trino | Distributed SQL querying on Iceberg data |
Visualization | Apache Superset | Interactive dashboards and data visualization |
e2e-data-pipeline/
βββ docker-compose.yml # Docker setup for all services
βββ flink/ # Flink SQL client and streaming jobs
βββ producer/ # Clickstream data producer using Faker
βββ superset/ # Superset setup and configuration
βββ trino/ # Trino configuration for Iceberg
- Docker and Docker Compose installed.
- Minimum 16GB RAM recommended.
git clone https://github.com/abeltavares/real-time-data-pipeline.git
cd real-time-data-pipeline
docker-compose up -d
Service | URL | Credentials |
---|---|---|
Kafka Control Center | http://localhost:9021 |
No Auth |
Flink Dashboard | http://localhost:18081 |
No Auth |
MinIO Console | http://localhost:9001 |
admin / password |
Trino UI | http://localhost:8080/ui |
No Auth |
Superset | http://localhost:8088 |
admin / admin |
Clickstream events are simulated using Python's Faker library. Here's the event structure:
{
"event_id": fake.uuid4(),
"user_id": fake.uuid4(),
"event_type": fake.random_element(elements=("page_view", "add_to_cart", "purchase", "logout")),
"url": fake.uri_path(),
"session_id": fake.uuid4(),
"device": fake.random_element(elements=("mobile", "desktop", "tablet")),
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
"geo_location": {
"lat": float(fake.latitude()),
"lon": float(fake.longitude())
},
"purchase_amount": float(random.uniform(0.0, 500.0)) if fake.boolean(chance_of_getting_true=30) else None
}
The Kafka consumer reads the clickstream events and pushes them to Apache Flink for real-time processing.
You can monitor the Kafka topic through the Kafka Control Center:
- Kafka Control Center URL: http://localhost:9021
- State Backend: RocksDB
- Checkpointing: Enabled for fault tolerance
- Connectors: Kafka β Iceberg (via Flink SQL)
The sql-client
service in Docker Compose automatically submits the Flink SQL job after the JobManager and TaskManager are running. It uses the clickstream-filtering.sql
script to process Kafka streams and write to Iceberg.
/opt/flink/bin/sql-client.sh -f /opt/flink/clickstream-filtering.sql
Monitor real-time data processing jobs at:
π http://localhost:18081
Processed data from Flink is stored in Iceberg tables on MinIO. This enables:
- Efficient Querying with Trino
- Schema Evolution and Time Travel
To list the contents of the MinIO warehouse, you can use the following command:
docker exec mc bash -c "mc ls -r minio/warehouse/"
Alternatively, you can access the MinIO console via the web at http://localhost:9001.
- Username:
admin
- Password:
password
1. Run Trino CLI
docker-compose exec trino trino
2. Connect to Iceberg Catalog
USE iceberg.db;
3. Query Processed Data
SELECT * FROM iceberg.db.clickstream_sink
WHERE purchase_amount > 100
LIMIT 10;
-
Access Superset: http://localhost:8088
- Username:
admin
- Password:
admin
- Username:
-
Connect Superset to Trino:
-
SQLAlchemy URI:
trino://trino@trino:8080/iceberg/db
-
Configure in Superset:
- Open
http://localhost:8088
- Go to Data β Databases β +
- Use the above SQLAlchemy URI.
- Open
- Create Dashboards:
- Stream processing with Apache Flink.
- Clickstream events are transformed and filtered in real-time.
- Data is stored in Apache Iceberg on MinIO, S3 compatible, supporting schema evolution and time travel.
- Trino provides fast, distributed SQL queries on Iceberg data.
- Apache Superset delivers real-time visual analytics.
- Simplified deployment using Docker and Docker Compose for seamless integration across all services.
- Implement alerting and monitoring with Grafana and Prometheus.
- Introduce machine learning pipelines for predictive analytics.
- Optimize Iceberg partitioning for faster queries.
Component | Command |
---|---|
Start Services | docker-compose up --build -d |
Stop Services | docker-compose down |
View Running Containers | docker ps |
Check Logs | docker-compose logs -f |
Rebuild Containers | docker-compose up --build --force-recreate -d |
Contributions are welcome! Feel free to submit issues or pull requests to improve this project.
This project is licensed under the MIT License.
Enjoy exploring real-time data pipelines!