A comprehensive CI/CD pipeline specifically designed for deploying and managing distributed data processing systems, with focus on web scraping, data streaming, and database management.
- CI/CD Pipeline for Data Processing Infrastructure
- Scrapy spiders with auto-scaling capabilities
- Kafka streaming for real-time data processing
- MongoDB for raw data storage
- PostgreSQL for processed/analyzed data
- Redis for caching and job queues
- API endpoints for data access and monitoring
- Automated testing for data pipelines
- Docker containerization with volume management
- Kubernetes orchestration for distributed systems
- Multiple environment configurations
- Database migration automation
- Monitoring for data quality and pipeline health
- Backup and recovery procedures
data-pipeline-deployment/
βββ scraping/
β βββ spiders/
β β βββ base_spider.py
β β βββ specific_spiders/
β βββ middlewares/
β βββ pipelines/
β βββ settings/
βββ streaming/
β βββ kafka_producers/
β βββ kafka_consumers/
β βββ stream_processors/
βββ storage/
β βββ mongodb/
β β βββ schemas/
β β βββ indexes/
β βββ postgresql/
β β βββ migrations/
β β βββ models/
β βββ redis/
β βββ cache_configs/
βββ api/
β βββ endpoints/
β βββ models/
β βββ services/
βββ pipeline/
β βββ github_actions/
β β βββ test_pipeline.yml
β β βββ deploy_pipeline.yml
β βββ scripts/
β βββ health_checks.sh
β βββ rollback.sh
βββ kubernetes/
β βββ scrapy/
β β βββ deployment.yaml
β β βββ scaler.yaml
β βββ kafka/
β β βββ statefulset.yaml
β β βββ service.yaml
β βββ mongodb/
β βββ postgresql/
β βββ redis/
βββ monitoring/
β βββ prometheus/
β β βββ scraping_metrics.yaml
β βββ grafana/
β β βββ dashboards/
β β βββ pipeline_health.json
β β βββ data_quality.json
β β βββ system_metrics.json
β βββ alerts/
βββ docker/
β βββ scrapy/
β βββ stream_processor/
β βββ api/
βββ tests/
βββ spiders/
βββ processors/
βββ integration/
graph TB
A[Scrapy Spiders] --> B[Kafka Topics]
B --> C[Stream Processors]
C --> D[MongoDB Raw Data]
C --> E[PostgreSQL Processed Data]
F[Redis Cache] --> G[API Layer]
D --> G
E --> G
- Python 3.8+
- Docker and Docker Compose
- Kubernetes cluster
- Kafka cluster
- MongoDB instance
- PostgreSQL database
- Redis instance
- Clone and setup:
git clone https://github.com/username/data-pipeline
cd data-pipeline
python -m venv venv
source venv/bin/activate # or .\venv\Scripts\activate on Windows
pip install -r requirements.txt
- Configure environments:
cp .env.example .env
# Edit .env with your configurations
# settings.py
CONCURRENT_REQUESTS = 32
DOWNLOAD_DELAY = 1.0
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'pipelines.KafkaPipeline': 100,
'pipelines.MongoDBPipeline': 200,
}
KAFKA_PRODUCER_CONFIG = {
'bootstrap.servers': 'kafka:9092',
'client.id': 'scrapy-producer'
}
# mongodb.yaml
mongodb:
uri: mongodb://mongodb:27017
database: raw_data
collections:
scraped_data:
indexes:
- keys:
timestamp: -1
- keys:
url: 1
unique: true
# kafka-topics.yaml
topics:
raw_data:
partitions: 6
replication_factor: 3
configs:
retention.ms: 604800000
processed_data:
partitions: 6
replication_factor: 3
-- init.sql
CREATE TABLE processed_data (
id SERIAL PRIMARY KEY,
source_id VARCHAR(255),
processed_at TIMESTAMPTZ DEFAULT NOW(),
data JSONB,
metadata JSONB,
CONSTRAINT unique_source UNIQUE (source_id)
);
CREATE INDEX idx_processed_data_metadata ON processed_data USING GIN (metadata);
# kubernetes/scrapy/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: scrapy-spiders
spec:
replicas: 3
template:
spec:
containers:
- name: spider
image: registry/spider:latest
env:
- name: KAFKA_BOOTSTRAP_SERVERS
value: kafka:9092
resources:
limits:
memory: 512Mi
cpu: 500m
# kubernetes/stream-processor/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: stream-processor
spec:
replicas: 3
template:
spec:
containers:
- name: processor
image: registry/processor:latest
env:
- name: MONGODB_URI
valueFrom:
secretKeyRef:
name: mongodb-credentials
key: uri
# prometheus/pipeline-metrics.yaml
- job_name: 'scrapy-metrics'
static_configs:
- targets: ['scrapy:8000']
metrics_path: '/metrics'
scrape_interval: 30s
{
"title": "Data Pipeline Overview",
"panels": [
{
"title": "Scraping Rate",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "rate(scrapy_items_scraped_count[5m])",
"legendFormat": "{{spider}}"
}
]
},
{
"title": "Processing Lag",
"type": "gauge",
"datasource": "Prometheus",
"targets": [
{
"expr": "kafka_consumer_group_lag"
}
]
}
]
}
# Start local infrastructure
docker-compose -f docker-compose.dev.yml up -d
# Run spider locally
cd scraping
scrapy crawl example_spider
# Process stream locally
python -m streaming.processors.main
# Run tests
pytest tests/
- Scraping Issues
# Check spider logs
kubectl logs -l app=scrapy-spider
# Verify Kafka connectivity
kafkacat -L -b kafka:9092
- Processing Issues
# Check consumer group lag
kafka-consumer-groups.sh --bootstrap-server kafka:9092 --describe --group processor-group
# Verify MongoDB connectivity
mongosh --eval "db.stats()"
- Database Issues
# Check PostgreSQL connections
psql -c "SELECT count(*) FROM pg_stat_activity;"
# Monitor Redis memory
redis-cli info memory
This project is licensed under the MIT License - see the LICENSE file for details.