PDF Parsing¶
Current MVP
PyMuPDF-based parsing for text extraction and financial table detection is implemented in Phase 2.
Overview¶
Technology Stack¶
| Tool | Purpose |
|---|---|
| PyMuPDF (fitz) | Primary PDF text and table extraction |
| Unstructured (optional) | Complex layout parsing for non-standard reports |
| pdfplumber (optional) | Fallback for table-heavy documents |
Extraction Targets¶
Financial Tables¶
Narrative Text¶
Metadata¶
Output JSON Schema¶
{
"company": "Maybank",
"period": "Q3 2025",
"filing_date": "2025-11-15",
"tables": {
"income_statement": {},
"balance_sheet": {},
"cash_flow": {}
},
"narrative": {
"management_discussion": "",
"outlook": ""
}
}