Skip to content

Data Pipeline Architecture

Current MVP

The current pipeline covers automated PDF download, PyMuPDF parsing, data cleaning, and PostgreSQL ingestion orchestrated via Airflow or Prefect.


Pipeline Overview

flowchart LR
    A[Scheduled Trigger] --> B[Scraper]
    B -->|PDF files| C[PDF Parser]
    C -->|Raw JSON| D[Data Cleaner]
    D -->|Clean JSON| E[Postgres Loader]
    E --> F[(PostgreSQL)]
    F -->|Text chunks| G[Embedding Generator]
    G -->|Vectors| H[(pgvector)]

Stages

Stage 1 — Scraping

Stage 2 — PDF Parsing

Stage 3 — Data Cleaning

Stage 4 — Database Ingestion

Stage 5 — Embedding Generation

Planned Architecture (Future Phases)

Embedding generation and pgvector indexing is implemented in Phase 3.


Orchestration


Output Schema