Skip to content

Automate cloud deployment with a robust CI/CD pipeline for efficient software delivery.

Notifications You must be signed in to change notification settings

rafaleao9923/ci-cd-pipeline-for-cloud-deployment

Repository files navigation

CI/CD Pipeline for Data Processing Infrastructure

Build Status Docker Pulls License

A comprehensive CI/CD pipeline specifically designed for deploying and managing distributed data processing systems, with focus on web scraping, data streaming, and database management.

πŸ“‹ Table of Contents

🎯 Features

Data Pipeline Components

  • Scrapy spiders with auto-scaling capabilities
  • Kafka streaming for real-time data processing
  • MongoDB for raw data storage
  • PostgreSQL for processed/analyzed data
  • Redis for caching and job queues
  • API endpoints for data access and monitoring

CI/CD Features

  • Automated testing for data pipelines
  • Docker containerization with volume management
  • Kubernetes orchestration for distributed systems
  • Multiple environment configurations
  • Database migration automation
  • Monitoring for data quality and pipeline health
  • Backup and recovery procedures

πŸ“ Project Structure

data-pipeline-deployment/
β”œβ”€β”€ scraping/
β”‚   β”œβ”€β”€ spiders/
β”‚   β”‚   β”œβ”€β”€ base_spider.py
β”‚   β”‚   └── specific_spiders/
β”‚   β”œβ”€β”€ middlewares/
β”‚   β”œβ”€β”€ pipelines/
β”‚   └── settings/
β”œβ”€β”€ streaming/
β”‚   β”œβ”€β”€ kafka_producers/
β”‚   β”œβ”€β”€ kafka_consumers/
β”‚   └── stream_processors/
β”œβ”€β”€ storage/
β”‚   β”œβ”€β”€ mongodb/
β”‚   β”‚   β”œβ”€β”€ schemas/
β”‚   β”‚   └── indexes/
β”‚   β”œβ”€β”€ postgresql/
β”‚   β”‚   β”œβ”€β”€ migrations/
β”‚   β”‚   └── models/
β”‚   └── redis/
β”‚       └── cache_configs/
β”œβ”€β”€ api/
β”‚   β”œβ”€β”€ endpoints/
β”‚   β”œβ”€β”€ models/
β”‚   └── services/
β”œβ”€β”€ pipeline/
β”‚   β”œβ”€β”€ github_actions/
β”‚   β”‚   β”œβ”€β”€ test_pipeline.yml
β”‚   β”‚   └── deploy_pipeline.yml
β”‚   └── scripts/
β”‚       β”œβ”€β”€ health_checks.sh
β”‚       └── rollback.sh
β”œβ”€β”€ kubernetes/
β”‚   β”œβ”€β”€ scrapy/
β”‚   β”‚   β”œβ”€β”€ deployment.yaml
β”‚   β”‚   └── scaler.yaml
β”‚   β”œβ”€β”€ kafka/
β”‚   β”‚   β”œβ”€β”€ statefulset.yaml
β”‚   β”‚   └── service.yaml
β”‚   β”œβ”€β”€ mongodb/
β”‚   β”œβ”€β”€ postgresql/
β”‚   └── redis/
β”œβ”€β”€ monitoring/
β”‚   β”œβ”€β”€ prometheus/
β”‚   β”‚   └── scraping_metrics.yaml
β”‚   β”œβ”€β”€ grafana/
β”‚   β”‚   └── dashboards/
β”‚   β”‚       β”œβ”€β”€ pipeline_health.json
β”‚   β”‚       β”œβ”€β”€ data_quality.json
β”‚   β”‚       └── system_metrics.json
β”‚   └── alerts/
β”œβ”€β”€ docker/
β”‚   β”œβ”€β”€ scrapy/
β”‚   β”œβ”€β”€ stream_processor/
β”‚   └── api/
└── tests/
    β”œβ”€β”€ spiders/
    β”œβ”€β”€ processors/
    └── integration/

πŸ— Architecture

graph TB
    A[Scrapy Spiders] --> B[Kafka Topics]
    B --> C[Stream Processors]
    C --> D[MongoDB Raw Data]
    C --> E[PostgreSQL Processed Data]
    F[Redis Cache] --> G[API Layer]
    D --> G
    E --> G
Loading

πŸ“ Prerequisites

  • Python 3.8+
  • Docker and Docker Compose
  • Kubernetes cluster
  • Kafka cluster
  • MongoDB instance
  • PostgreSQL database
  • Redis instance

πŸš€ Getting Started

  1. Clone and setup:
git clone https://github.com/username/data-pipeline
cd data-pipeline
python -m venv venv
source venv/bin/activate  # or .\venv\Scripts\activate on Windows
pip install -r requirements.txt
  1. Configure environments:
cp .env.example .env
# Edit .env with your configurations

βš™οΈ Configuration Examples

Scrapy Settings

# settings.py
CONCURRENT_REQUESTS = 32
DOWNLOAD_DELAY = 1.0
ROBOTSTXT_OBEY = True

ITEM_PIPELINES = {
    'pipelines.KafkaPipeline': 100,
    'pipelines.MongoDBPipeline': 200,
}

KAFKA_PRODUCER_CONFIG = {
    'bootstrap.servers': 'kafka:9092',
    'client.id': 'scrapy-producer'
}

MongoDB Configuration

# mongodb.yaml
mongodb:
  uri: mongodb://mongodb:27017
  database: raw_data
  collections:
    scraped_data:
      indexes:
        - keys:
            timestamp: -1
        - keys:
            url: 1
          unique: true

Kafka Topics

# kafka-topics.yaml
topics:
  raw_data:
    partitions: 6
    replication_factor: 3
    configs:
      retention.ms: 604800000
  processed_data:
    partitions: 6
    replication_factor: 3

PostgreSQL Schema

-- init.sql
CREATE TABLE processed_data (
    id SERIAL PRIMARY KEY,
    source_id VARCHAR(255),
    processed_at TIMESTAMPTZ DEFAULT NOW(),
    data JSONB,
    metadata JSONB,
    CONSTRAINT unique_source UNIQUE (source_id)
);

CREATE INDEX idx_processed_data_metadata ON processed_data USING GIN (metadata);

πŸ”„ Deployment Configurations

Scrapy Deployment

# kubernetes/scrapy/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scrapy-spiders
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: spider
        image: registry/spider:latest
        env:
          - name: KAFKA_BOOTSTRAP_SERVERS
            value: kafka:9092
        resources:
          limits:
            memory: 512Mi
            cpu: 500m

Stream Processor Deployment

# kubernetes/stream-processor/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stream-processor
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: processor
        image: registry/processor:latest
        env:
          - name: MONGODB_URI
            valueFrom:
              secretKeyRef:
                name: mongodb-credentials
                key: uri

πŸ“Š Monitoring Setup

Pipeline Metrics

# prometheus/pipeline-metrics.yaml
- job_name: 'scrapy-metrics'
  static_configs:
    - targets: ['scrapy:8000']
  metrics_path: '/metrics'
  scrape_interval: 30s

Grafana Dashboard Example

{
  "title": "Data Pipeline Overview",
  "panels": [
    {
      "title": "Scraping Rate",
      "type": "graph",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "rate(scrapy_items_scraped_count[5m])",
          "legendFormat": "{{spider}}"
        }
      ]
    },
    {
      "title": "Processing Lag",
      "type": "gauge",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "kafka_consumer_group_lag"
        }
      ]
    }
  ]
}

πŸ”§ Local Development

# Start local infrastructure
docker-compose -f docker-compose.dev.yml up -d

# Run spider locally
cd scraping
scrapy crawl example_spider

# Process stream locally
python -m streaming.processors.main

# Run tests
pytest tests/

πŸ” Troubleshooting

Common Issues

  1. Scraping Issues
# Check spider logs
kubectl logs -l app=scrapy-spider
# Verify Kafka connectivity
kafkacat -L -b kafka:9092
  1. Processing Issues
# Check consumer group lag
kafka-consumer-groups.sh --bootstrap-server kafka:9092 --describe --group processor-group

# Verify MongoDB connectivity
mongosh --eval "db.stats()"
  1. Database Issues
# Check PostgreSQL connections
psql -c "SELECT count(*) FROM pg_stat_activity;"

# Monitor Redis memory
redis-cli info memory

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Automate cloud deployment with a robust CI/CD pipeline for efficient software delivery.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published