Data Pipeline Architecture¶
Current MVP
The current pipeline covers automated PDF download, PyMuPDF parsing, data cleaning, and PostgreSQL ingestion orchestrated via Airflow or Prefect.
Pipeline Overview¶
flowchart LR
A[Scheduled Trigger] --> B[Scraper]
B -->|PDF files| C[PDF Parser]
C -->|Raw JSON| D[Data Cleaner]
D -->|Clean JSON| E[Postgres Loader]
E --> F[(PostgreSQL)]
F -->|Text chunks| G[Embedding Generator]
G -->|Vectors| H[(pgvector)] Stages¶
Stage 1 — Scraping¶
Stage 2 — PDF Parsing¶
Stage 3 — Data Cleaning¶
Stage 4 — Database Ingestion¶
Stage 5 — Embedding Generation¶
Planned Architecture (Future Phases)
Embedding generation and pgvector indexing is implemented in Phase 3.