Skip to content

ETL Pipeline

Current MVP

The containerized ETL pipeline — from raw PDF ingestion through PostgreSQL storage — is implemented in Phase 2.


Overview


Pipeline Diagram

flowchart LR
    A[PDF Files] --> B[Parser]
    B --> C[Transformer / Cleaner]
    C --> D{Valid?}
    D -- Yes --> E[PostgreSQL Loader]
    D -- No --> F[Dead Letter Queue / Logs]
    E --> G[(PostgreSQL)]

Orchestration

Airflow DAG Structure

Prefect Flow (Alternative)


Transform Logic

Financial Data Normalization

Deduplication

Incremental Loading


Docker Setup

# Partial docker-compose.yml structure
services:
  pipeline:
    build: ./pipeline
    volumes:
      - ./data:/app/data
    environment:
      - DATABASE_URL=postgresql://...

Monitoring and Alerting


Data Lineage