OCR Pipeline: From Blurry Scan to Structured Data in Under 3 Seconds

Maxime ChampouxFebruary 5, 202610 min read

Eighty percent of the world's financial data starts as an image. An invoice photographed on a loading dock. A receipt faxed from a machine manufactured in 2003. A PDF that looks digital but is actually a scanned printout of a screenshot. If you want to build accounting software that works in the real world, you need to process all of it.

We have processed tens of thousands of invoices through Well's OCR pipeline. The median processing time is under three seconds. But the interesting part is not speed. It is what happens between the raw image and the structured record, specifically how a language model catches the errors that traditional OCR cannot.

The Problem with Real Documents

OCR technology has existed for decades. The basic version works fine on clean, well-lit, properly oriented documents. The problem is that almost no document in the wild meets those criteria.

Here is what actually arrives in an accounts payable inbox: a phone photo taken at an angle on a dimly lit warehouse floor. A faxed invoice where the toner was running low, so half the characters are ghosted. A PDF that has been printed, signed with a pen, scanned back in, emailed, printed again by the recipient, and then photographed. Each generation degrades the text further. By the time it reaches your system, the letter "O" and the number "0" are indistinguishable. The euro sign has merged with the first digit. A coffee ring obscures the invoice date.

Traditional OCR engines treat each character as an isolated classification problem. They look at a cluster of pixels and ask: is this a "5" or an "S"? Without context, the answer is often wrong. An OCR engine does not know that "Janury" is not a word, that "€1,0OO" contains a letter where a digit should be, or that a tax rate of 190% is implausible in any jurisdiction.

This is the gap we set out to close.

The Pipeline

Well's document processing pipeline has five stages. Each one exists because we discovered, usually through a production failure, that skipping it produces bad data.

Stage 1: Pre-processing

Before any text extraction happens, the raw document goes through image conditioning. Auto-rotation corrects documents photographed sideways or upside down. Noise reduction cleans up artifacts from low-quality cameras and fax machines. Contrast enhancement makes faded text readable.

For documents that are already digital PDFs with embedded text layers, we skip the image path entirely and extract text directly. This sounds obvious, but a surprising number of OCR systems run pixel-based extraction on digital PDFs, throwing away perfectly good text data and introducing errors in the process.

The pre-processing stage also handles password-protected files, multi-page documents, and the occasional PDF that claims to have a text layer but actually contains garbage Unicode from a corrupted export.

Stage 2: OCR Extraction

The conditioned image goes through optical character recognition. We use a multi-engine approach, running multiple OCR models and comparing their outputs. Where the engines agree, confidence is high. Where they disagree, we flag the text region for closer inspection in the next stage.

This is standard practice in production OCR systems. The accuracy at this stage typically lands between 85% and 95%, depending on document quality. For a clean digital invoice, it is above 99%. For a photo of a crumpled receipt, it might be 80%.

That gap between 95% and 99.5% is where most OCR projects stall. Closing it with traditional techniques requires exponentially more effort. Custom training data, domain-specific models, hand-tuned post-processing rules. We took a different approach.

Stage 3: LLM Validation

This is the step that changed everything for us.

After OCR extraction produces raw text, we pass it through a large language model. Not to re-extract the text, but to validate and correct it. The difference matters.

The LLM receives the OCR output along with confidence scores for ambiguous regions. It then applies something that traditional OCR completely lacks: contextual understanding.

When the OCR engine outputs "€1,0OO", the LLM recognizes that the letter "O" appearing in a numeric amount field is almost certainly a zero. It corrects it to "€1,000". When the extracted date reads "Janury 15, 2024", the LLM knows there is no month called "Janury" and corrects it to "January". When the vendor name comes through as "Amaz0n Web Servces", the LLM recognizes the company and fixes both the zero-for-O substitution and the missing "i".

These are not pattern-matching rules. We did not write a dictionary of common OCR errors. The LLM understands language, numbers, dates, company names, and the structure of financial documents. It catches errors that would take thousands of hand-coded rules to address, and it handles novel errors we have never seen before.

The LLM also infers missing fields. If the OCR failed to extract a currency symbol but the vendor address is in Germany and the amounts include values like "19%" tax, the model infers EUR. If a line item quantity is illegible but the unit price and total are clear, it calculates the quantity.

Before we added this stage, our structured data accuracy was around 93%. After, it climbed to 99.2%. The residual errors are almost exclusively from documents so degraded that even a human would struggle to read them.

Stage 4: Entity Mapping

Clean, validated text is not the end goal. The extracted data needs to map onto Well's data model, which spans 12 typed entities covering everything a financial document might contain.

The mapping stage takes validated fields and assigns them to the correct entities: vendor information, invoice metadata, line items with quantities and unit prices, tax breakdowns, payment terms, bank details, purchase order references. Each field lands in a structured record that the rest of Well's system can query, reconcile, and process.

This is where domain specificity matters. A general-purpose document extraction tool might give you key-value pairs. We need data that fits a financial data model, with the relationships between entities preserved. The line items need to sum to the subtotal. The tax needs to match the rate applied to the correct base amount. The vendor needs to link to an existing record or trigger the creation of a new one.

Stage 5: Storage and Linking

The original document is preserved on CDN storage and linked to the structured record. We call the final output an "invoice composite": the structured data plus the source document, permanently connected.

This matters for audit trails. When an accountant questions a line item, they can click through to the original document and see exactly what the pipeline extracted. If the pipeline made an error, the original is always available for manual correction.

A Document's Journey

To make this concrete, here is what happens when a construction company's site manager photographs an invoice on a job site.

The photo arrives at 14:23:07. It is a 3.2 megapixel image, slightly tilted, with a shadow across the bottom third. The invoice is from a building materials supplier, printed on yellow paper with a dot-matrix printer.

At 14:23:07.3, pre-processing kicks in. The image rotates 3 degrees clockwise. Contrast enhancement brightens the shadowed region. Noise reduction smooths the dot-matrix artifacts.

At 14:23:07.8, OCR extraction runs. Two engines process the image. They agree on most of the text but diverge on several characters in the amount column and the invoice date. The raw extraction reads the total as "€2,4S7.80" and the date as "02/I1/2024".

At 14:23:08.4, the LLM validation stage receives the raw text. It identifies that "€2,4S7.80" contains a letter "S" where the digit "5" should be and corrects the total to "€2,457.80". It recognizes "02/I1/2024" as a date where the letter "I" replaced the digit "1" and corrects it to "02/11/2024". It also catches that the vendor's postal code was extracted as "8O1O3" and corrects it to "80103".

At 14:23:08.9, entity mapping assigns the corrected data to Well's data model. Vendor name, address, and tax ID populate the vendor entity. Invoice number, date, and payment terms populate the invoice metadata. Three line items with descriptions, quantities, unit prices, and totals populate the line item entities. Tax breakdown populates the tax entity.

At 14:23:09.1, the invoice composite is stored. The original photo goes to CDN. The structured record goes to the database. Total elapsed time: 1.8 seconds.

The site manager sees a confirmation with the extracted data. Three line items, correct totals, matched vendor. If anything looks wrong, one tap opens the original photo for comparison.

What Makes This Different

The OCR market is not short on solutions. Every major cloud provider offers document extraction APIs. Specialized vendors have built invoice-specific OCR products. The technology is mature.

What most of these solutions share is a common architecture: image processing, character recognition, template matching, and rule-based post-processing. They work well on documents that match their training data. They struggle with the long tail of formats, languages, and degradation levels that appear in real-world accounts payable.

The LLM validation layer changes the failure mode. Traditional OCR fails silently. It extracts "€1,0OO" and reports it as successful extraction with high confidence. The downstream system records a payment of €10,000 or rejects the document entirely, depending on how it parses the garbled amount. Either way, someone has to catch the error manually.

With LLM validation, the system understands what it is reading. It does not just recognize characters; it comprehends invoices. It knows that amounts should be numbers, dates should be valid, and vendor names should be real companies. When the OCR output violates these expectations, the LLM flags and corrects it.

This is not a minor improvement in accuracy. It is a categorical change in how errors are handled. The system moves from pattern matching to comprehension.

The Foundation Layer

Every financial workflow at Well starts with a document. Expense management begins when someone photographs a receipt. Accounts payable begins when an invoice arrives. Bank reconciliation begins when a statement is imported.

The accuracy of everything downstream depends on the accuracy of document ingestion. If the OCR pipeline extracts the wrong amount, the wrong date, or the wrong vendor, every subsequent process inherits that error. Matching fails. Approvals route incorrectly. Reports contain wrong numbers.

Running at 99.2% field-level accuracy means that errors in downstream processes caused by document ingestion are nearly eliminated. For a company processing 500 invoices per month, that is the difference between 35 manual corrections per month and 4.

The three-second processing time matters too, but differently. Speed does not just save time in the extraction step. It changes the workflow. When document processing takes minutes or hours, it becomes a batch job. Someone collects invoices, submits them, and checks results later. When it takes three seconds, it becomes interactive. The person holding the invoice gets immediate feedback. They can verify the extraction while the physical document is still in their hands.

What We Learned

Building this pipeline taught us that the hard part of document processing is not the OCR. Character recognition is a solved problem for clean inputs. The hard part is handling the infinite variety of real-world document quality and catching errors before they propagate.

The LLM validation layer is not a feature we planned from the start. It emerged from frustration with rule-based post-processing. Every time we added a rule to catch one class of error, three new error types appeared in production. The rules were playing whack-a-mole with the long tail of OCR failures.

Replacing hundreds of hand-coded rules with a model that understands language and context was the turning point. The error rate dropped. The maintenance burden dropped. And the system started handling document types it had never seen before, because comprehension generalizes in a way that rules do not.

Three seconds. Twelve entities. 99.2% accuracy. From a blurry photo on a loading dock to a structured financial record, ready for processing.

That is the pipeline.

Maxime Champoux

CEO & co-founder, Well

Maxime is the CEO and co-founder of Well. He built Well to rebuild finance around AI-native data, not spreadsheets.

Ready to automate your financial workflows?

Try Well free