Skip to content

PDF Parsing

Current MVP

PyMuPDF-based parsing for text extraction and financial table detection is implemented in Phase 2.


Overview


Technology Stack

Tool Purpose
PyMuPDF (fitz) Primary PDF text and table extraction
Unstructured (optional) Complex layout parsing for non-standard reports
pdfplumber (optional) Fallback for table-heavy documents

Extraction Targets

Financial Tables

Narrative Text

Metadata


Output JSON Schema

{
  "company": "Maybank",
  "period": "Q3 2025",
  "filing_date": "2025-11-15",
  "tables": {
    "income_statement": {},
    "balance_sheet": {},
    "cash_flow": {}
  },
  "narrative": {
    "management_discussion": "",
    "outlook": ""
  }
}

Quality Checks


Known Limitations