Browser Automation for Finance: How We Built an LLM-Controlled Chrome Extension

Maxime ChampouxFebruary 17, 202612 min read

Over 60% of financial institutions still rely on web portals that offer no API. Account aggregation, payment initiation, loan comparison: these workflows live behind login screens designed for human eyes and human hands. The industry response has been screen scraping, fragile scripts that break every time a bank updates its CSS. There is a better way, and it starts with letting a language model read the page.

The Problem With Scripted Automation

Selenium and Playwright are the workhorses of browser automation. They operate on a simple contract: you write a script that says "find element with ID login-button, click it, wait for selector .dashboard." This works until it does not. A bank redesigns its login page. A new cookie consent banner appears. The element ID changes from login-button to btn-signin. Your script fails silently or throws an error at 3 AM.

Financial automation compounds this fragility. Banks rotate their UIs frequently, sometimes A/B testing different flows on different users. A script written for Crédit Agricole's portal in January may not survive February. Multiply that by dozens of institutions across multiple countries, and you are running a maintenance treadmill that consumes engineering time without producing value.

The deterministic approach assumes the path is known in advance. For financial portals, it rarely is.

Seeing the Page as a Language Model Sees It

The core technical insight behind our Chrome extension is straightforward: instead of scripting a fixed path through a web page, serialize the page's DOM into a simplified representation and let a language model decide what to do next.

The serialization step matters. A raw DOM dump of a modern banking portal can run to hundreds of thousands of tokens. We strip it down. The extension walks the DOM tree and extracts only interactive elements and visible text: buttons, input fields, links, labels, error messages. Each element gets a stable reference ID. The output looks less like HTML and more like a structured description of what a human would see.

[ref:1] Input "Email address" (empty)
[ref:2] Input "Password" (empty)
[ref:3] Button "Sign in"
[ref:4] Link "Forgot password?"
[ref:5] Text "Welcome to your account portal"

This compressed representation typically fits in a few hundred tokens. The language model receives it along with a task description ("log in to this bank account") and returns a sequence of actions: type into ref:1, type into ref:2, click ref:3. The extension executes these actions in the browser, observes the resulting page state, serializes the new DOM, and loops.

The loop is the key architectural choice. Rather than planning an entire multi-page flow upfront, the system operates one step at a time. It acts, observes, acts again. This makes it inherently adaptive. If the bank adds a new interstitial screen ("Accept our updated terms of service"), the model sees it, reads the button text, and clicks through. No script update required.

Patent FR2500423 covers this method: the combination of DOM serialization, LLM-driven action selection, and the observe-act loop applied to web-based financial workflows.

Inside the Action Loop

The extension's runtime follows a cycle that repeats until the task is complete or the model signals it cannot proceed.

**Step 1: Capture. **The content script traverses the visible DOM and builds the simplified representation. It filters out hidden elements, decorative images, and script tags. Interactive elements get priority. The capture also includes page metadata: current URL, page title, any visible alerts or modals.

**Step 2: Decide. **The serialized state goes to the language model along with the conversation history (prior states and actions) and the original task instruction. The model returns one or more actions in a structured format.

**Step 3: Execute. **The extension translates each action into browser events. A "click ref:3" becomes a simulated click on the corresponding DOM element. A "type ref:1 'user@email.com'" dispatches keyboard events to the input field. A "wait 2000" pauses execution. A "scroll down" adjusts the viewport.

**Step 4: Observe. **After execution, the extension waits for the page to settle (network requests complete, DOM mutations stop), then captures the new state. This new state feeds back into Step 2.

The conversation history gives the model memory across steps. It knows it already entered the email on the previous screen and is now looking at the password prompt. This stateful awareness is what separates the approach from simple element-matching heuristics.

Multi-Step Authentication

Login flows are where this architecture earns its value. Modern bank logins are rarely a single username-and-password screen. A typical flow might look like this:

Enter email address, click "Next"
Enter password on a new screen
Receive an SMS code, enter it in a third screen
Approve a push notification on a mobile app
Dismiss a "Download our app" interstitial
Arrive at the account dashboard A scripted approach needs to anticipate and code each of these steps. If the bank adds step 5, the script breaks. The LLM-controlled approach handles it naturally. The model sees "Download our app" with a "Skip" or "No thanks" button, recognizes it as an interstitial, and clicks past it.

The extension manages credentials through a local vault. When the model's action references a credential field, the extension injects the value directly. The model never sees the actual password. It requests "type ref:2 with stored credential 'password'" and the extension handles substitution locally. This is a deliberate security boundary: the language model orchestrates the flow but never handles sensitive data.

For two-factor authentication, the system adapts to what it observes. If it sees an SMS code input, it can wait for the user to provide the code (the extension surfaces a notification). If the 2FA is app-based push approval, it detects the "Waiting for approval" screen and polls until the page state changes.

The CAPTCHA Problem

CAPTCHAs exist specifically to block automation, so any honest discussion of browser automation must address them directly.

The extension categorizes CAPTCHAs into three tiers. Simple checkbox CAPTCHAs ("I am not a robot") are handled by clicking the checkbox element, which the model identifies in the DOM. Image-selection CAPTCHAs ("Select all traffic lights") can sometimes be processed through vision capabilities, though success rates vary by complexity. Advanced CAPTCHAs, such as reCAPTCHA v3's behavioral scoring or challenges that require sustained mouse movement patterns, cannot be reliably automated.

When the extension encounters a CAPTCHA it cannot solve, it pauses and notifies the user. The human solves the CAPTCHA manually, and the extension resumes automated operation on the next page. This hybrid model is pragmatic: full automation where possible, human intervention where necessary, and a clean handoff protocol between the two.

In practice, many financial portals use lighter verification methods than consumer sites. Some rely on device fingerprinting or SMS codes rather than visual CAPTCHAs. The extension's adaptive approach handles these without special-case logic.

Security: Nothing Leaves the Browser

The security model is the architectural decision that makes this viable for finance. Every component runs locally.

The Chrome extension executes in the user's browser. DOM serialization happens in a content script with access only to the active tab. Credentials are stored in the browser's local storage, encrypted at rest. The LLM inference can run against a local model or a cloud API, but even in the cloud case, the model receives only the simplified DOM. It never sees raw credentials, session tokens, or cookies.

No remote server orchestrates the browser. No proxy intercepts traffic. No credentials traverse a third-party infrastructure. The user's browser talks directly to their bank, exactly as it would during manual use. The extension simply automates the human's side of that conversation.

This is a meaningful distinction from server-side aggregation services, which historically required users to hand over their banking credentials to a third party. The regulatory environment in Europe, shaped by PSD2 and the deprecation of screen scraping in favor of dedicated APIs, reflects the security concerns with that model. Our approach sidesteps the issue entirely: credentials stay on the user's machine, and the automation layer has no more access than the user sitting at their keyboard.

The local execution model also simplifies compliance. There is no data processing agreement needed for credential handling because no third party processes credentials. The extension is a tool the user runs, not a service that accesses their accounts on their behalf.

How This Differs From Scripted Approaches

The comparison to Selenium and Playwright deserves specificity.

**Maintenance burden. **A scripted connector to a single bank requires ongoing maintenance as the UI evolves. An LLM-controlled approach absorbs UI changes automatically, as long as the page's intent remains legible. A button that changes from "Login" to "Sign in" or moves from the top-right to a centered position does not break the flow.

**Coverage speed. **Writing a new scripted connector takes days of reverse-engineering a portal's flow. The LLM-controlled approach can often handle a new portal on first attempt, given only a task description and credentials. When it fails, the failure is typically on a specific step that can be debugged from the conversation log.

**Determinism. **Scripted approaches are deterministic. Given the same page, they always take the same action. LLM-controlled automation introduces non-determinism: the model might choose different actions on different runs. For financial workflows, this is managed through action validation — the extension checks that the model's proposed action is sensible. Clicking a "Transfer $10,000" button triggers a confirmation prompt, not silent execution. Every action is logged for audit.

**Error handling. **Scripts fail hard when they encounter unexpected states. The LLM approach fails soft: the model sees an error message ("Invalid password"), recognizes it, and can retry or report the failure with context. This produces better error diagnostics than a stack trace from a missing CSS selector.

**Scope of adaptation. **A scripted connector is binary: it either works or it does not. There is no graceful degradation. The LLM approach operates on a spectrum. If a bank changes its entire authentication system, the model may still navigate it successfully because it reads the page as a human would. If it encounters a completely novel pattern it has never seen in training data, it may fail on that step but can still describe what it sees, giving engineers immediate diagnostic information rather than a cryptic selector error.

The tradeoff is latency. Each step requires an LLM inference call, which adds hundreds of milliseconds to seconds depending on the model and infrastructure. For interactive trading where milliseconds matter, this is a non-starter. For account aggregation that runs overnight, it is irrelevant.

The Patent and What It Protects

Patent FR2500423 covers the method of using a language model to control browser navigation for financial data retrieval. Specifically, it claims the combination of three elements: the DOM-to-text serialization optimized for interactive elements, the action-loop architecture where the model observes and acts iteratively, and the application to financial services where credentials are handled locally without third-party transmission.

The patent does not claim browser automation generally, nor does it claim LLM-based agents broadly. It claims a specific method applied to a specific domain with a specific security architecture. This specificity is intentional. General browser automation patents would face extensive prior art challenges. The intersection of LLM-controlled navigation, financial data access, and local credential isolation is a narrower and more defensible position.

Conversation Logs as Debugging Tools

A side effect of the observe-act architecture is that every automation session produces a readable conversation log. Each entry contains the serialized DOM state, the model's reasoning, and the action taken. When something goes wrong, an engineer can read the log like a narrative: "The model saw the password field, entered the credential, clicked Sign In, saw an error message saying 'Account locked,' and stopped."

Compare this to debugging a Selenium script failure. The typical output is a stack trace pointing to a line where find_element_by_id('submit-btn') returned null. The engineer then needs to manually load the page, inspect the DOM, and figure out what changed. With conversation logs, the failure context is already captured. The model's last observation tells you exactly what the page looked like when things went wrong.

This has practical implications for support operations. When a user reports that their bank connection failed, the support team can review the conversation log without needing to reproduce the issue. The log shows precisely where the flow diverged from expectations.

What This Engineering Enables

The technical architecture described here is the foundation for financial data access that does not depend on banks providing APIs. In markets where open banking regulation lags (much of Africa, Southeast Asia, Latin America), web portals remain the only interface. Even in Europe, where PSD2 mandates APIs, coverage is incomplete and API reliability varies.

An LLM-controlled browser extension turns any web portal into a programmable interface. The user's browser becomes the API layer. The language model provides the adaptability that scripted approaches lack. And the local execution model provides the security that server-side scraping cannot.

The engineering is not theoretical. The extension runs in production, navigating real banking portals, handling real authentication flows, and retrieving real financial data. The code that serializes a DOM into five lines of text, sends it to a model, and executes the response — that loop is where the value lives.

It is a small loop. It solves a large problem.

Maxime Champoux

CEO & co-founder, Well

Maxime is the CEO and co-founder of Well. He built Well to rebuild finance around AI-native data, not spreadsheets.

Ready to automate your financial workflows?

Try Well free