Skip to content

PDF Parsing¶

Current MVP

PyMuPDF-based parsing for text extraction and financial table detection is implemented in Phase 2.

Overview¶

Technology Stack¶

Tool	Purpose
PyMuPDF (fitz)	Primary PDF text and table extraction
Unstructured (optional)	Complex layout parsing for non-standard reports
pdfplumber (optional)	Fallback for table-heavy documents

Extraction Targets¶

Financial Tables¶

Narrative Text¶

Metadata¶

Output JSON Schema¶

{
  "company": "Maybank",
  "period": "Q3 2025",
  "filing_date": "2025-11-15",
  "tables": {
    "income_statement": {},
    "balance_sheet": {},
    "cash_flow": {}
  },
  "narrative": {
    "management_discussion": "",
    "outlook": ""
  }
}

Quality Checks¶

Known Limitations¶