ETL Pipeline¶
Current MVP
The containerized ETL pipeline — from raw PDF ingestion through PostgreSQL storage — is implemented in Phase 2.
Overview¶
Pipeline Diagram¶
flowchart LR
A[PDF Files] --> B[Parser]
B --> C[Transformer / Cleaner]
C --> D{Valid?}
D -- Yes --> E[PostgreSQL Loader]
D -- No --> F[Dead Letter Queue / Logs]
E --> G[(PostgreSQL)] Orchestration¶
Airflow DAG Structure¶
Prefect Flow (Alternative)¶
Transform Logic¶
Financial Data Normalization¶
Deduplication¶
Incremental Loading¶
Docker Setup¶
# Partial docker-compose.yml structure
services:
pipeline:
build: ./pipeline
volumes:
- ./data:/app/data
environment:
- DATABASE_URL=postgresql://...