The Enrichment Waterfall: Multi-Provider Data Fusion at Scale

Maxime ChampouxJanuary 10, 202611 min read

CRM records decay at roughly 30% per year. Job titles change, companies move offices, revenue figures shift with every quarter. If you are building a B2B product that depends on company data, you face an uncomfortable choice: trust a single data provider and accept its blind spots, or stitch together multiple sources and deal with the chaos that follows.

We chose the second path. This is the engineering story behind Well's enrichment waterfall, the system that turns a single click into 23 populated fields by querying six data providers in sequence, scoring each result for confidence, and fusing everything into one clean record.

The Problem With Single-Provider Enrichment

Most CRM enrichment tools work like a phone book lookup. You pass in a company domain, the provider returns what it has, and you paste the result into your record. Simple enough, until you notice the gaps.

No single provider covers every field reliably. Provider A might have strong revenue estimates for North American companies but return nothing for European SMBs. Provider B nails employee counts but has stale address data. Provider C specializes in technographic data but ignores financials entirely.

We tested five providers against a benchmark set of 500 companies. The best single provider covered 71% of fields accurately. The worst hit 34%. None of them broke 80% on their own. The math was clear: relying on one source meant leaving data on the table for every record.

Designing the Waterfall

A waterfall architecture is not a new idea. DNS resolution uses one. Authentication systems use one. The pattern is straightforward: try the first source, and if it fails or returns a low-quality result, try the next.

What made our implementation specific was the field-level granularity. We don't run the waterfall per record. We run it per field, per record. A single company enrichment might query Provider A for revenue, accept that result, then fall through to Provider C for tech stack because Providers A and B returned nothing useful.

The provider priority order is not static. We set it based on field type and geographic region. For revenue data on US companies, a financial data provider sits at the top. For employee counts in Europe, a commercial registry goes first. These rankings emerged from months of measuring provider accuracy against known-good datasets.

Here is the simplified logic:

for each field in record:
    for each provider in priority_order[field][region]:
        result = query(provider, company_identifier)
        score = compute_confidence(result, field)
        if score > 0.85:
            accept(result)
            break
        elif score >= 0.50:
            flag_for_review(result)
            continue  # try next provider, might get higher confidence
        else:
            ignore(result)
            continue
    if no accepted result and flagged results exist:
        use highest-confidence flagged result

The outer loop runs 23 times, once for each field. The inner loop runs up to six times. In the worst case, a single enrichment triggers 138 API calls. In practice, early accepts and provider skips bring the median down to around 40.

Confidence Scoring: The Core of the System

Every result that comes back from a provider gets a confidence score between 0 and 1. This score is not something the providers give us. We compute it ourselves, because provider self-reported confidence is unreliable when it exists at all.

The scoring function weighs several signals:

Recency. A revenue figure from a filing dated three months ago scores higher than one from 18 months ago. We apply a time-decay curve that drops sharply after 12 months.

Source type. A company name pulled from an official registry scores higher than one scraped from a LinkedIn page. We maintain a source-type ranking per field that reflects how authoritative each origin tends to be.

Internal consistency. If a provider returns an employee count of 50 but a revenue figure of $2 billion, something is off. We run cross-field plausibility checks that penalize results where the numbers don't add up relative to industry benchmarks.

Historical accuracy. We track each provider's hit rate per field over time. If Provider B has been wrong about tech stack data for the last 200 queries, its confidence gets a persistent discount for that field. This feedback loop means the system improves its own scoring without manual intervention.

The three confidence bands were chosen after analyzing six months of human review data. Above 0.85, our ops team accepted the enrichment result 97% of the time. Between 0.50 and 0.85, they accepted 62% of the time. Below 0.50, the acceptance rate dropped to 11%. Those three bands naturally divided the data into accept, review, and reject.

The Fusion Layer

When multiple providers return data for the same field and none clears the 0.85 threshold, you end up with competing answers. Provider A says revenue is $12M, Provider B says $15M, Provider C says $10M. Which one do you use?

Our fusion layer handles this with a weighted-average approach for numerical fields and a voting mechanism for categorical ones.

For revenue, employee count, and similar numbers, we take the confidence-weighted mean. If Provider A (confidence 0.72) says $12M and Provider B (confidence 0.81) says $15M, the fused result trends toward $15M. For categorical fields like industry classification or company type, we use majority voting weighted by confidence. If two providers say "SaaS" and one says "Consulting," SaaS wins.

The fused result gets its own confidence score, derived from the spread and agreement among sources. Three providers agreeing within 10% produces a high-confidence fused result, often clearing 0.85 even when no individual provider did. Three providers disagreeing wildly produces a low score, which flags it for human review.

This fusion step is where the waterfall pays for itself. Individual providers hover around 70% field coverage. After fusion, we hit 94%.

Handling Provider Failures

Data providers go down. They rate-limit you. They change their API response format without notice. They return 200 OK with an empty body. Building a multi-provider system means building for failure as a default state.

Our approach rests on three mechanisms:

Circuit breakers. Each provider connection has a circuit breaker that opens after three consecutive failures within a five-minute window. When the circuit opens, we skip that provider entirely for all in-flight enrichments and try the next one in the waterfall. The circuit attempts a half-open probe every 60 seconds, sending a single test query to check if the provider has recovered.

Response validation. Before we even score a result, we validate its structure. Does the response contain the expected fields? Are the data types correct? Is the revenue figure a number and not the string "N/A"? We built a schema validator per provider that catches malformed responses before they enter the scoring pipeline. About 3% of successful API responses fail validation and get treated as if the provider returned nothing.

Timeout budgets. The entire enrichment operation for one record has a ten-second budget. Individual provider calls get two seconds each. If a provider is slow, we don't wait. We move to the next one. This means the user clicking the enrich button never waits more than ten seconds, regardless of how many providers are having a bad day.

Normalization: The Unglamorous Necessity

Six providers return data in six formats. One gives revenue in thousands, another in full figures, a third in local currency. One returns employee count as a range ("50-100"), another as an integer, a third as a string with commas. Address formats vary wildly across providers and countries.

The normalization layer sits between raw API responses and the scoring engine. It converts everything into our canonical format before confidence scoring begins. This sounds simple, but the normalization code accounts for roughly 40% of the enrichment system's total codebase. Edge cases accumulate. A German company's "GmbH" might appear as "GMBH," "Gmbh," "G.m.b.H.," or get dropped entirely. Currency conversions need daily exchange rates. Address parsing needs country-specific logic for postal codes, states versus provinces, and abbreviation expansion.

We considered using an off-the-shelf normalization library for addresses, but none handled the intersection of B2B company data with international formats well enough. We ended up writing our own, which we now maintain as an internal package with roughly 1,200 test cases covering 40 countries.

The One-Click Experience

From the user's perspective, none of this complexity is visible. They see a company record in Well with empty or outdated fields. They click a button labeled "Enrich." A progress indicator runs for a few seconds. Then the fields fill in, some marked green (high confidence, auto-accepted), some marked yellow (medium confidence, needs review), and some left empty (no provider returned anything usable).

The yellow-flagged fields show the competing values from different providers, along with their confidence scores. The user picks the one that looks right, or enters their own value. That human decision feeds back into our historical accuracy tracking, making future confidence scores more precise.

This feedback loop matters more than it might seem. Over the first six months of operation, the percentage of fields landing in the yellow "review" band dropped from 23% to 14%. The system learned which providers to trust for which fields, and the waterfall order adjusted accordingly.

Why Not Just Use AI to Merge Everything?

A fair question. Why build deterministic waterfall logic when you could throw all provider responses into a language model and ask it to pick the best one?

We tested this. The results were surprisingly poor for structured data. LLMs are good at reasoning about text, but when you ask them whether $12M or $15M is the more likely revenue figure for a 50-person SaaS company, they hallucinate plausible-sounding reasoning that has no grounding in the actual data quality. They cannot evaluate source recency or provider track records, because those are statistical properties of our pipeline, not facts present in the prompt.

We do use ML in one place: the cross-field plausibility checks. A lightweight model trained on our verified records flags combinations that look improbable (high revenue, low employee count, wrong industry classification). But the core waterfall logic stays deterministic. When a field is wrong, we can trace exactly which provider supplied it, what its confidence score was, and why it was selected. That traceability disappears inside a language model.

The Moat Nobody Talks About

Data enrichment looks like a commodity from the outside. Dozens of vendors sell company data APIs. But the quality gap between "having data" and "having the right data at the right confidence level" is where defensibility lives.

Every enrichment cycle generates labeled training data. Every human review sharpens the confidence model. Every new provider added to the waterfall increases coverage in ways that a single-provider competitor cannot replicate without building the same fusion infrastructure. Six months of production usage means tens of thousands of human-validated data points feeding the scoring engine. A competitor starting today would need to run the same volume through the same review process to reach equivalent accuracy.

This compounds. The team with better data quality gets higher user trust, which drives more usage, which generates more review data, which improves accuracy further. It is a slow flywheel, but it is real, and it is hard to shortcut.

What We Learned

Building this system took four months for the initial version and another three months of tuning before the accuracy numbers stabilized. A few lessons stand out.

Start with two providers, not six. We launched with two providers and added the rest incrementally. Each new provider required its own API integration, normalization rules, and confidence calibration. Adding them one at a time let us measure the marginal improvement each one contributed. Two of our six providers account for 60% of accepted results. The other four fill gaps that are narrow but real.

Measure provider accuracy independently. We maintain a benchmark set of 200 companies with manually verified data. Every week, we run all providers against this set and update our accuracy metrics. Without this ground truth, confidence scoring would drift.

Human review data is gold. Every time a user corrects a yellow-flagged field, we get a labeled training sample for free. This data drives the feedback loop that improves confidence scoring over time. We invested heavily in making the review interface fast and frictionless, because the quality of the system depends on users actually completing reviews instead of ignoring them.

Normalization is half the work. We underestimated this by a factor of three. If you are building a multi-provider system, budget accordingly.

The enrichment waterfall is not the most visible feature in Well. Users see a button and filled-in fields. But it sits at the intersection of data engineering, API reliability, and product design in a way that determines the quality of every record in the system. When the data is right, every feature downstream works better. When it is wrong, nothing else matters.

That is why we built the waterfall instead of picking a single provider and calling it done. The marginal effort of adding each source is high. The marginal improvement in data quality compounds across every record, every query, and every decision a user makes based on that data.

Maxime Champoux

CEO & co-founder, Well

Maxime is the CEO and co-founder of Well. He built Well to rebuild finance around AI-native data, not spreadsheets.

Ready to automate your financial workflows?

Try Well free