Skip to content

Email Processing

Overview

Once an inbox is linked and active, Unspend processes its emails to find and extract invoice data. The pipeline fetches emails from the past year, identifies which ones contain invoices, extracts financial details (amount, currency, dates), and stores the resulting invoice records.

Pipeline

1. Fetch Emails

The system refreshes its access credentials and lists all emails from the past year for the linked inbox. Each email is then queued for individual processing, with emails from the same sender processed sequentially to avoid conflicts.

2. Deduplication

If an email has already been successfully processed, it is skipped. This makes the pipeline safe to re-run.

3. Filter

Two checks determine whether to continue processing:

  • Ignored senders: If the sender is on the ignored list, the email is marked as "Ignored Sender" and skipped.
  • Keyword matching: If the sender is not a known vendor, the email subject must contain at least one keyword — "invoice", "receipt", or "bill" — to proceed. Emails that match neither condition are marked "No Match".

4a. Known Vendor Path

If the sender matches a known vendor, the system uses that vendor's most recent extraction template to parse the email directly. No LLM involvement is needed.

4b. Unknown Vendor Path

If the sender is not a known vendor but the email passed keyword filtering:

  1. Invoice extraction: An LLM reads the email content and extracts the amount, currency, invoice date, and billing period. If the LLM determines it is not an invoice, the email is marked "Not Invoice".
  2. Vendor identification: The LLM identifies the vendor name, website, and description. If it cannot, the email is marked "Failed — Unknown Vendor".
  3. Template generation: The system iteratively generates a regex-based extraction template and validates it against the LLM's initial extraction. If the template produces different values, it is refined and retried (up to 4 iterations). If all iterations fail, the email is marked "Failed — Couldn't Generate Template".
  4. Vendor creation: A new vendor record is created with the identified information. The vendor's logo is downloaded from their website domain. The vendor is assigned to a category selected by the LLM from: Miscellaneous, Marketing, Sales, Engineering, Operations.

5. Parse

The vendor's extraction template is applied to the email content (HTML body or PDF attachment). The template uses regex patterns to extract:

  • Invoice amount
  • Currency
  • Invoice date
  • Billing period (start and end dates)

If parsing fails, the email is marked "Failed — Parsing" and a detailed failure record is stored for debugging.

6. Store

On successful parsing:

  • The email content or PDF attachment is uploaded to object storage
  • HTML email bodies are converted to PDF before upload
  • An invoice record is created with the extracted financial data
  • The email is marked "Processed"

Email Statuses

Status Meaning
Processing Currently being processed
Processed Invoice successfully extracted and stored
Ignored Sender Sender is on the ignored list
No Match Unknown sender and no invoice keywords in subject
Not Invoice LLM determined the email is not an invoice
Failed — Unknown Vendor Could not identify the vendor
Failed — Couldn't Generate Template Template generation exhausted all iterations
Failed — Unknown Format Attachment could not be read as a PDF
Failed — Parsing Extraction template could not parse the content
Failed — Unknown Error An unexpected error occurred during processing

Invoice Sources

Invoices can come from two sources within an email:

  • HTML body: The invoice content is embedded in the email itself. Converted to PDF for storage.
  • PDF attachment: The invoice is attached as a PDF file. Stored as-is.

When attachments are present, the first PDF attachment is used.