Building in Public: Our Design System Journey from Chaos to Consistency

Maxime ChampouxFebruary 2, 202611 min read

Six months into building Well, we had 47 different shades of gray in our codebase. Not a design choice. Just entropy.

Most startups accumulate this kind of drift. You ship fast, copy-paste a hex code, nudge padding by 2 pixels because it "looks better," and move on. Nobody notices until everyone notices. For us, the tipping point came during a design review in January 2026 when we put three screens side by side and realized our primary button had four different border radii across the app.

We are a team of four engineers. Deciding to build a design system meant one of us would stop shipping features for two weeks. At a startup where every week counts, that is a real cost with a real opportunity cost. This is the story of how we made that decision, what we got wrong, and what we learned.

The mess we made

Well launched in mid-2025 without a design system. This was the right call. At that stage, we needed to validate the product, not architect a component library. We used Tailwind, kept things loose, and moved fast.

By month three, the cracks showed. Different developers had different instincts about spacing. One of us preferred gap-4, another liked gap-5. Our color palette technically had a primary blue, but three variations had crept in through one-off adjustments. We had two different modal implementations, each with its own animation timing.

None of this was catastrophic. Users did not complain. But compounding inconsistency has a hidden cost: every new screen takes longer because you are making micro-decisions that should already be settled. "Which gray for secondary text?" becomes a question you answer differently depending on the day.

We tracked our velocity over those months. Feature completion time was creeping up, not dramatically, but the trend was clear. New screens that once took a day were taking a day and a half. Multiply that across the dozens of screens we planned for Q1 2026, and the math starts to matter.

The redesign forced the issue

In February 2026, we committed to a visual redesign of Well. The goals were practical: better information hierarchy, improved accessibility contrast ratios, and a more cohesive feel. We wrote about the design system itself in a previous post. This article is about the process of building it: the decisions, the wrong turns, and what the experience taught us about infrastructure investment at a small scale.

The redesign was the forcing function. We could either apply new visual decisions on top of the existing mess, or we could build the foundation first and apply the redesign systematically. We chose the foundation.

That choice came with a two-week commitment. With four engineers, dedicating one person to design system work for two weeks means 12.5% of your total engineering capacity for half a month. We debated this for two days before committing. The argument that won: doing the redesign without a system would mean doing it again in three months when the drift returned.

Shadow tokens: the idea that almost worked too well

Our implementation approach centered on design tokens, which are standardized variables for colors, spacing, typography, and other visual properties. Specifically, we used what we call shadow tokens: tokens that reference other tokens rather than containing raw values.

Here is what that looks like in practice. Instead of defining a button background as #2563EB (a specific blue), we define it as color.action.primary, which references color.brand.blue.500, which references the actual hex value. Three levels of indirection.

The power of this approach is flexibility. Want to change the entire app's primary color? Change one value, and it cascades everywhere. Want a dark mode? Create an alternative mapping at the middle layer, and every component adapts without being touched.

The problem is that we took this idea too far.

The overengineering mistake

Our first token architecture had 340 tokens. For a product with roughly 30 screens at the time. That is more than 11 tokens per screen, which should have been an obvious red flag.

We had tokens for things like spacing.card.inner.horizontal.compact and color.text.secondary.on-surface.elevated. Each token was precisely named and logically structured. It was also impossible to remember and painful to use.

The root cause was a classic engineering trap: optimizing for theoretical flexibility over practical usability. We designed the token system to handle every conceivable future scenario (multiple themes, density modes, platform-specific overrides) before we had evidence that we would need any of that.

Three days into implementation, the engineer working on the system came to a standup and said: "I can not remember my own token names." That was the moment we knew we had a problem.

Ruthless simplification

We spent a day auditing every token against actual usage. The result: we cut the system from 340 tokens to 82. A 76% reduction.

The tokens that survived followed a simple test: does this token get used in more than three components? If not, it was either too specific (merge it with a more general token) or too abstract (nobody was actually using it).

Some specific cuts:

Spacing went from 24 tokens to 8. We eliminated the distinction between "inner" and "outer" spacing and dropped "compact" variants entirely. Four base spacing values (4, 8, 16, 24 pixels) plus four contextual aliases (tight, normal, relaxed, loose) covered everything we needed.
Colors went from 89 tokens to 31. We collapsed the three-level reference chain into two levels for most cases. color.action.primary still exists, but it now maps directly to a value rather than going through an intermediate brand layer.
Typography went from 18 tokens to 9. We had separate tokens for font size, line height, and letter spacing. We merged them into composite tokens like type.body and type.heading.large that define all three properties together. The simplified system took one more day to implement. Total time from start to functional design system: eight working days, not the ten we had budgeted. The extra two days went back into feature work.

What to build vs. what to borrow

The component library question ran parallel to the token work. With four engineers, building every component from scratch is not just slow, it is irresponsible. At the same time, third-party component libraries bring their own maintenance burden and rarely match your design language exactly.

Our approach: build the primitives, compose with libraries, customize the skin.

We built five foundational components ourselves: Button, Input, Card, Modal, and Layout containers. These are the components that appear on virtually every screen and that carry the most visual identity. Getting these right meant every screen would feel cohesive even if secondary components came from elsewhere.

For everything else (date pickers, dropdowns, tooltips, data tables) we used Radix UI primitives and styled them with our token system. Radix gives us accessible, unstyled building blocks. Our tokens give them our visual language. The combination let us ship a complete component set in days rather than weeks.

One non-obvious benefit of this split: when a Radix component does not quite work for our needs, replacing it with a custom implementation is straightforward because the styling layer is already ours. We have done this twice so far, both times without touching any consuming code.

The integration week

Rolling out a design system to an existing codebase is its own challenge. We chose a screen-by-screen migration rather than a big-bang replacement, which let us continue shipping features during the transition.

The process was mechanical but tedious. For each screen:

Replace hardcoded colors with token references
Replace custom spacing with token values
Swap ad-hoc components for library components
Visual diff against the redesign mockups
Fix the inevitable edge cases We processed about five screens per day across the team. The entire migration took six working days, with the final day dedicated to catching inconsistencies we had introduced during the migration itself. The irony of introducing new inconsistencies while building a consistency system was not lost on us.

The conversation nobody wants to have

Allocating engineering time to infrastructure at a four-person startup requires a conversation that feels uncomfortable. Someone has to advocate for invisible work while the rest of the team is shipping things users can see.

We handled this by framing the design system as a feature, not as tech debt cleanup. The deliverable was not "better code" but "faster future shipping." That framing mattered because it tied the investment to a measurable outcome. We committed to tracking feature velocity before and after, which created accountability for the decision.

The engineer who led the design system work described it as the most isolated two weeks of their time at Well. Feature work generates visible progress: pull requests, demos, user feedback. Infrastructure work generates none of that until it is done. We underestimated the psychological cost of that isolation, and in retrospect, pairing on the system for at least a few days would have been worth the additional capacity cost.

One tactical lesson: we presented the design system to the full team at the end of week one, before it was finished. This served two purposes. It gave the team visibility into the work, and it surfaced usability issues early. The feedback from that session is what triggered the token simplification. If we had waited until the system was "done," we would have shipped 340 tokens and spent additional time scaling back.

Measuring the payoff

We are now three weeks past the design system launch. The numbers:

Feature completion time dropped by 33%. New screens that averaged 12 hours before the design system now average 8 hours. The savings come almost entirely from eliminated micro-decisions. Engineers pick tokens from a constrained set instead of eyeballing values.

Design review cycles shortened. Before the system, design reviews averaged 2.3 rounds of feedback per screen, usually catching spacing or color inconsistencies. Post-system, we average 1.1 rounds. The system catches the inconsistencies before a human needs to.

Onboarding signal (early). We have not onboarded a new engineer since the system launched, so this is projected. But our documentation now includes a component playground where you can see every available component with its variants. Previous onboarding involved reading through existing screens and pattern-matching. The playground should compress that considerably.

The two-week investment paid for itself within the first two weeks of use, measured purely on feature velocity. The less quantifiable benefits (visual coherence, reduced cognitive load during development, faster design-to-code translation) compound over time.

What we would do differently

Three things, in order of how much they cost us:

Start with fewer tokens, then add. Our instinct was to design a complete system upfront. The better approach is to start with the minimum viable token set (we estimate around 50 tokens for a product our size) and add tokens only when you find yourself repeating a value across multiple components. Let usage drive the architecture, not speculation.

Migrate incrementally from day one. We waited until the token system was "complete" before starting migration. This created a big-bang transition that was stressful and error-prone. A better approach: define your first ten tokens, migrate the most common patterns immediately, and expand the system as you go.

Document decisions, not just tokens. Our initial documentation listed every token with its value. What it lacked was the reasoning: why spacing.normal is 16px and not 12px, why we chose a two-level color hierarchy instead of three. When we revisit decisions months from now, the "what" will be obvious from the code. The "why" will not.

The meta-observation

Building a design system with four engineers is an exercise in resource allocation under constraint. Every hour spent on infrastructure is an hour not spent on features. Every shortcut on infrastructure creates future drag on features. There is no formula for the right balance; there is only the discipline to watch the signals and act when the cost of not investing exceeds the cost of investing.

For us, that inflection point came at month six. For larger teams with more parallel workstreams, it might come sooner. For solo founders, it might never come, and that is fine. The worst outcome is building infrastructure you do not need yet, closely followed by not building infrastructure you needed last month.

We chose to share this process because "building in public" has become a phrase that often means curated success stories. The reality of building anything is that you make mistakes, overengineer things, and learn by undoing your own work. Our design system is better because our first version was too complicated. That is not a failure of planning. That is how building works.

The 47 shades of gray are gone. We have 6 now. And we know exactly why each one exists.

Maxime Champoux

CEO & co-founder, Well

Maxime is the CEO and co-founder of Well. He built Well to rebuild finance around AI-native data, not spreadsheets.

Ready to automate your financial workflows?

Try Well free