Email Processing¶
Overview¶
Once an inbox is linked and active, Unspend processes its emails to find and extract invoice data. The pipeline fetches emails from the past year, identifies which ones contain invoices, extracts financial details (amount, currency, dates), and stores the resulting invoice records.
Pipeline¶
1. Fetch Emails¶
The system refreshes its access credentials and lists all emails from the past year for the linked inbox. Each email is then queued for individual processing, with emails from the same sender processed sequentially to avoid conflicts.
2. Deduplication¶
If an email has already been successfully processed, it is skipped. This makes the pipeline safe to re-run.
3. Filter¶
Two checks determine whether to continue processing:
- Ignored senders: If the sender is on the ignored list, the email is marked as "Ignored Sender" and skipped.
- Keyword matching: If the sender is not a known vendor, the email subject must contain at least one keyword — "invoice", "receipt", or "bill" — to proceed. Emails that match neither condition are marked "No Match".
4a. Known Vendor Path¶
If the sender matches a known vendor, the system uses that vendor's most recent extraction template to parse the email directly. No LLM involvement is needed.
4b. Unknown Vendor Path¶
If the sender is not a known vendor but the email passed keyword filtering:
- Invoice extraction: An LLM reads the email content and extracts the amount, currency, invoice date, and billing period. If the LLM determines it is not an invoice, the email is marked "Not Invoice".
- Vendor identification: The LLM identifies the vendor name, website, and description. If it cannot, the email is marked "Failed — Unknown Vendor".
- Template generation: The system iteratively generates a regex-based extraction template and validates it against the LLM's initial extraction. If the template produces different values, it is refined and retried (up to 4 iterations). If all iterations fail, the email is marked "Failed — Couldn't Generate Template".
- Vendor creation: A new vendor record is created with the identified information. The vendor's logo is downloaded from their website domain. The vendor is assigned to a category selected by the LLM from: Miscellaneous, Marketing, Sales, Engineering, Operations.
5. Parse¶
The vendor's extraction template is applied to the email content (HTML body or PDF attachment). The template uses regex patterns to extract:
- Invoice amount
- Currency
- Invoice date
- Billing period (start and end dates)
If parsing fails, the email is marked "Failed — Parsing" and a detailed failure record is stored for debugging.
6. Store¶
On successful parsing:
- The email content or PDF attachment is uploaded to object storage
- HTML email bodies are converted to PDF before upload
- An invoice record is created with the extracted financial data
- The email is marked "Processed"
Email Statuses¶
| Status | Meaning |
|---|---|
| Processing | Currently being processed |
| Processed | Invoice successfully extracted and stored |
| Ignored Sender | Sender is on the ignored list |
| No Match | Unknown sender and no invoice keywords in subject |
| Not Invoice | LLM determined the email is not an invoice |
| Failed — Unknown Vendor | Could not identify the vendor |
| Failed — Couldn't Generate Template | Template generation exhausted all iterations |
| Failed — Unknown Format | Attachment could not be read as a PDF |
| Failed — Parsing | Extraction template could not parse the content |
| Failed — Unknown Error | An unexpected error occurred during processing |
Invoice Sources¶
Invoices can come from two sources within an email:
- HTML body: The invoice content is embedded in the email itself. Converted to PDF for storage.
- PDF attachment: The invoice is attached as a PDF file. Stored as-is.
When attachments are present, the first PDF attachment is used.