Chrome Extension Vturfu: When AI Takes the Browser Wheel to Fetch Your Invoices

Maxime ChampouxMarch 20, 202610 min read

OpenAI shipped Operator in January 2025. Google announced Project Mariner. Anthropic released computer use. The biggest AI labs are racing to build agents that control web browsers. We patented this approach for invoice retrieval two years before any of them shipped.

Not to score points. But because we arrived at this solution from the problem side, not the technology side. And the path we took explains why browser-controlling AI agents aren't a cool demo. They're the only architecture that works for a specific, painful problem: getting invoices out of supplier portals.

Forty Portals, Forty Logins, Every Month

The average European SMB works with 40 to 60 suppliers. Each supplier has a portal. Each portal has invoices. Each portal has its own login flow, its own UI, its own way of burying the PDF download button three clicks deep behind a menu labeled "Documents" or "Billing" or, inexplicably, "Services."

I spent three years watching businesses deal with this. The accountant opens Chrome. Logs into portal number one. Clicks around. Downloads the invoice. Saves it. Opens portal number two. Repeats. Forty times a month.

This is not an edge case. It is the daily reality for the 500,000+ SMBs on Qonto, for every business using Pennylane, for every company that buys things from other companies. The invoices exist. They sit on web pages. There is no API to get them.

Why APIs Will Never Solve This

Every engineer who hears this problem says: "Just use the APIs." I had the same instinct. It's wrong.

Most supplier portals don't have APIs. The ones that do have APIs that don't expose invoice downloads. The ones that expose invoice downloads have authentication flows that change without notice. Even if you could build integrations for the top 100 suppliers, you'd cover maybe 30% of the invoices a typical business needs.

The long tail is infinite. A local office supply company in Lyon. A SaaS tool with 500 customers. A freelance designer who sends invoices through a custom WordPress plugin. These suppliers will never build APIs for invoice retrieval. They barely maintain their web portals.

We tried the API route. We built connectors. Each took days to build, broke within months, and covered a single supplier. After building 50, we'd covered a fraction of what our users needed. Thousands of connectors, each requiring ongoing maintenance, to approach complete coverage. The math simply didn't work.

So we asked a different question: what if we stopped integrating with each portal individually and instead built something that could navigate any portal?

The LLM Reads the Page Like a Human Would

In early 2023, I fed GPT-4 a simplified DOM of a supplier portal login page. Not a screenshot. The actual HTML structure, stripped to the meaningful elements: input fields, buttons, links, labels.

I asked: "You're looking at a login page. The user's email is test@example.com. What should you do?"

It answered: "Type the email into the input field labeled 'Email' and click the button labeled 'Sign In.'"

The LLM didn't need a hardcoded selector. It didn't need a mapping file. It didn't need a Selenium script. It read the page the way a human would and decided what to do.

I tested it on ten login pages. Then fifty. Then a hundred. Different layouts, frameworks, languages. The LLM handled them all. Not perfectly, not every time, but with a success rate that made the connector approach look absurd.

The question shifted from "can this work?" to "how do we make it reliable, fast, and secure enough for production?"

How Vturfu Works

Vturfu is a Chrome extension that runs locally in the user's browser. The architecture is a perception-action loop, the same pattern as autonomous driving, but with DOM trees instead of cameras and click events instead of steering.

Step 1: Observe. The extension reads the current page's DOM and extracts a simplified representation: interactive elements, their labels, positions, and states. The page's skeleton. Everything a human would use to decide what to click, but structured as data.

Step 2: Decide. This simplified DOM goes to an LLM along with a goal ("Download the latest invoice from this supplier"). The model returns a single action: click this button, type in this field, scroll down, wait for this element.

Step 3: Act. The extension executes the action. The page updates.

Step 4: Loop. New DOM state. New LLM decision. New action. Repeat until the invoice is downloaded or the agent determines it cannot proceed.

Each step takes one to two seconds. A typical invoice download requires five to fifteen actions: navigate to the portal, log in, find the invoices section, filter by date, click download. Total: 15 to 30 seconds per supplier. A human doing this manually takes two to five minutes.

Why DOM Beats Pixels

You might ask why we don't use vision models to process screenshots, especially now that GPT-4V and Claude handle images well. We tried both. DOM wins on three dimensions.

Precision. Working with the DOM means knowing exactly which element to interact with. No coordinate ambiguity, no risk of clicking the wrong pixel. The LLM says "click the element with id='download-btn'" and the extension clicks exactly that element.

Speed. A simplified DOM is a few kilobytes of text. A screenshot is hundreds of kilobytes or more. Processing text is faster and cheaper. When you're pulling invoices from 40 suppliers, the difference between one second and five seconds per step compounds.

Reliability. DOM elements carry semantic meaning. A button labeled "Download" is unambiguously a download button. In a screenshot, visual elements can be ambiguous: is that a button or decoration? Is that text a label or a heading?

We fall back to vision when the DOM is obfuscated or when critical information is rendered in canvas elements. But for 90% of supplier portals, DOM-first is the correct approach.

Your Credentials Never Leave Your Machine

This is where alarm bells go off: "You're letting an AI control my browser while I'm logged into financial portals?"

Yes. And it is more secure than the alternative.

The traditional approach to automated invoice collection requires handing your supplier portal credentials to a third-party service. That service stores your passwords on their servers, logs in on your behalf from their infrastructure, downloads invoices through their systems. Your credentials live on someone else's servers. One breach exposes thousands of businesses' supplier portal access.

Vturfu inverts this. The extension runs locally. Your credentials never leave your browser. The LLM receives a sanitized DOM structure, not your cookies, not your session tokens, not your passwords. It sees "there is a password field" and responds "type the password." The typing happens locally. The password travels from your password manager to your browser, the same path it takes when you type manually.

No credential is ever transmitted to our servers. No session token is exfiltrated. Even if our API were compromised, an attacker would get DOM skeletons, not credentials.

We designed this architecture deliberately. The invoice collection problem requires access to authenticated sessions. The only responsible way to handle that is keeping authentication entirely on the client side.

Patent FR2500423

We filed French patent FR2500423 in 2023, covering the method of using LLMs to control browser navigation for document retrieval.

The technique of feeding DOM state to an LLM and executing its recommended actions for invoice retrieval is not an obvious combination. Local execution for security, DOM simplification for efficiency, goal-directed LLM control for generalization across arbitrary web UIs. We wanted to document priority and protect the specific application.

The patent covers three things: extracting and simplifying DOM state for LLM consumption, the action loop where an LLM directs browser actions based on page state, and applying this technique to financial document retrieval from authenticated web portals.

Since filing, every major AI lab has launched browser-control products. That validates the approach. It also means the patent carries real defensive value. We are not trying to prevent anyone from building browser agents. We are ensuring that our specific application to invoice retrieval is protected.

What Broke on the Way to Production

Getting from "the LLM can navigate a login page" to "this works for 500,000 businesses" meant solving problems invisible in any demo.

Portal diversity is staggering. Some portals use React with dynamically generated class names that change on every deploy. Some use iframes nested three levels deep. Some require two-factor authentication via SMS. Some show invoices in paginated tables, others in infinite scroll lists, others as individual detail pages you must open one by one.

We built a classification layer. Before the LLM starts navigating, it categorizes the portal type and selects the right strategy. A paginated table portal gets a different action sequence than an infinite scroll portal. This reduces LLM calls and increases reliability.

Error recovery was the hardest engineering problem. What happens when the LLM clicks the wrong button? When a page loads slowly and the DOM snapshot captures an intermediate state? When a portal throws up a CAPTCHA? We built a state machine that tracks the agent's progress, detects loops, and either retries with a different strategy or escalates to manual intervention.

After 18 months of iteration: 85% first-attempt success rate across arbitrary supplier portals. With retry logic, 92%. The remaining 8% are portals with CAPTCHAs, hardware-token 2FA, or UIs so broken that humans struggle with them too.

Those numbers are honest. We are not at 100% and may never be. But 92% automated across any supplier portal on the internet, with no pre-built integration, is a category difference from what existed before.

Live at Scale: Qonto's Invoice Operator

In 2024, we partnered with Qonto to build their Invoice Operator feature, working alongside OpenAI. Qonto serves over 500,000 European SMBs. Their users had exactly the problem we'd been solving.

The integration was clean because of the architecture. The Chrome extension runs in the user's browser. Qonto's backend specifies which suppliers to target. The extension navigates, downloads, and uploads invoices directly to Qonto. No credentials pass through Qonto's servers or ours.

Scale exposed new edge cases, new portal types, new failure modes. Each one made the system better. That feedback loop, running across hundreds of thousands of real users hitting real supplier portals, is something you cannot replicate in a lab.

The Thesis

Every business in Europe manually downloads invoices from supplier portals every month. APIs will never cover the long tail of suppliers. The only architecture that generalizes is an AI agent that navigates the browser like a human, running locally so credentials stay safe.

We built it. We patented it. We shipped it to 500,000+ businesses.

Browser-controlling AI agents are now a major focus for OpenAI, Google, and Anthropic. They are building general-purpose tools. We built the specific application first: the one that requires financial-grade security, handles the messiest web UIs on the internet, and solves a problem that costs businesses billions of hours per year.

The web was built for humans. Sometimes the best way to automate it is to build something that reads pages the way humans do. The difference is that our agent does it in 30 seconds, never gets bored, and never accidentally saves the invoice in the wrong folder.

Maxime Champoux

CEO & co-founder, Well

Maxime is the CEO and co-founder of Well. He built Well to rebuild finance around AI-native data, not spreadsheets.

Ready to automate your financial workflows?

Try Well free