How We Built AI Response Modality: Teaching an LLM to Choose Text, Grid, or Chart

Maxime ChampouxFebruary 24, 202610 min read

Most AI products treat presentation as an afterthought. The model generates an answer, the frontend renders it as text, and everyone moves on. But when you're building a data product — where a user might ask "show me revenue by quarter" or "why did churn spike in March" — the format of the answer matters as much as the content.

At Well, we spent weeks on a problem that sounds trivial: how should the AI decide whether to respond with a paragraph of text, a data grid, or a chart?

The answer taught us something about LLMs that applies far beyond formatting.

Three Rendering Paths

Every query that hits Well's AI layer routes to one of three rendering paths.

Text handles explanations, causal reasoning, and narrative answers. Ask "why did customer acquisition cost increase last quarter" and you get prose — a structured explanation that walks through contributing factors.

Grid handles tabular data, ranked lists, and structured comparisons. Ask "show me our top 10 customers by revenue" and you get a sortable table with columns, not a paragraph describing each customer.

Chart handles trends, distributions, and visual comparisons. Ask "how has monthly recurring revenue changed over the past year" and you get a line chart, not twelve numbers listed in a sentence.

The distinction feels obvious when you read these examples. It was not obvious to build.

The Rules Temptation

The first instinct — and the one most engineering teams follow — is to build a rules engine.

The logic seems straightforward. If the query contains "list" or "top N," render a grid. If it contains "trend" or "over time," render a chart. If neither matches, default to text.

We built this. It took about two days. It handled maybe 60% of queries correctly.

The remaining 40% exposed the problem with rules: natural language is not keyword-shaped. "What's happening with our enterprise accounts" could need any of the three formats depending on context. A user asking "break down Q3 performance" might want a grid of metrics, a chart comparing months, or a text explanation of what drove the numbers. The word "break down" tells you nothing about format. It tells you about intent, and intent requires understanding.

We added more rules. We handled synonyms, built regex patterns, layered in heuristics for question structure. The rules file grew to several hundred lines. Accuracy crept up to around 75%. Every edge case we fixed introduced a new one.

Two weeks in, the rules engine was the most-edited file in the codebase. That is usually a sign you are solving the wrong problem.

Letting the LLM Decide

The alternative was to let the language model itself choose the rendering format. This felt risky. LLMs hallucinate facts — could we trust one to make a UX decision?

The argument for trying it was simple: format selection is a reasoning task. Given a query, you need to understand what the user wants, consider what kind of data will answer it, and pick the presentation that makes that data most useful. LLMs are good at exactly this kind of contextual reasoning. Rules engines are not.

We added a section to the system prompt that described the three modalities, when each is appropriate, and provided examples. Not abstract descriptions. Concrete query-and-format pairs.

The prompt section looked roughly like this:

You will respond in one of three formats: text, grid, or chart. Choose based on the user's intent and the shape of the data.

Use TEXT when the answer requires explanation, reasoning, or narrative. Example: "Why did conversion rates drop in February?" → text.

Use GRID when the answer is a list, table, or structured comparison of discrete items. Example: "Show me all deals closed this quarter" → grid.

Use CHART when the answer involves trends over time, distributions, or visual comparisons between categories. Example: "How has pipeline value changed month over month?" → chart.

When ambiguous, prefer text. The user can always request a different format.

That last line matters. It establishes a default and acknowledges that the system will sometimes be wrong.

What 90% Accuracy Looks Like

With the prompt-based approach, the LLM picks the correct format roughly 90% of the time. That number comes from internal testing across several hundred real queries from beta users.

The 10% failure cases follow a pattern. Most involve ambiguous queries where even a human would hesitate. "How are we doing" could be a text summary, a grid of KPIs, or a chart of trends. The model tends to default to text for these, which is usually acceptable if not optimal.

A smaller set of failures involve the model over-indexing on chart format. Time-related words like "monthly" or "quarterly" sometimes trigger a chart even when the user is asking a causal question. "Why did our quarterly numbers look off" should probably be text, but the model occasionally renders a chart of quarterly values. It answers a different question — a valid one, but not the one asked.

We handle this with an override. A small control in the UI lets users switch between formats after the response loads. The data is already available; switching format only requires re-rendering, not re-querying.

This override serves a dual purpose. It fixes the 10% of wrong choices, and it generates training signal. Every override is a label: "for this query, the user preferred grid over text." We log these to evaluate prompt changes and, eventually, to fine-tune.

Performance Trade-offs

The three paths are not equally expensive to render.

Text is fastest. The LLM streams tokens, the frontend renders them incrementally. Users see the first words within a second.

Grid is nearly as fast. The model outputs structured data, usually JSON or a markdown table, and the frontend parses it into an interactive table. The parsing adds a few milliseconds. Functionally instant.

Chart adds meaningful latency. After the model produces the data, the frontend needs to determine chart type, configure axes, handle edge cases in the data shape, and render the visualization. On average, charts take 1.5 to 2 seconds longer than text from query to fully rendered response.

We pre-built a library of chart templates for common query patterns. Revenue over time gets a line chart. Category comparisons get a bar chart. Proportional breakdowns get a donut chart. This avoids the cost of dynamic chart configuration for the most frequent cases.

The library covers about 70% of chart responses. The remaining 30% use a general-purpose charting pipeline that takes longer but handles novel data shapes. We are gradually expanding the template library based on usage patterns.

The Prompt as Product Surface

Here is the part that changed how we think about LLMs in product development.

The system prompt that guides modality selection is, functionally, a product specification. It encodes UX decisions — what kind of answer a user wants given certain kinds of questions. When we update it, we are not tweaking an algorithm. We are redefining how the product behaves.

This makes the prompt a product surface in the same way a design system is a product surface. It is opinionated, it shapes user experience, and it needs to be maintained with the same care as any user-facing code.

Traditional software encodes product decisions in branching logic. If-else trees, state machines, configuration files. These are explicit and auditable. You can trace exactly why the system made a decision.

Prompt-based decisions are different. You can see the instructions you gave the model, but you cannot trace the exact reasoning path it took for a given query. This is uncomfortable for engineers accustomed to deterministic systems. It requires a different kind of quality assurance — statistical rather than logical, evaluated on distributions rather than edge cases.

We run a test suite of 200 queries with expected formats. After any prompt change, we run the suite and check that accuracy stays above 88%. If it drops, the change does not ship. This is closer to how you'd test a machine learning model than how you'd test a feature flag.

The Deeper Lesson

The initial question — rules or LLM for format selection — turned out to be the wrong framing.

The real question is: where in your product should decisions be deterministic, and where should they be probabilistic?

Deterministic decisions work when the input space is small and well-defined. Authentication should be deterministic. Permissions should be deterministic. Pricing calculations should be deterministic.

Probabilistic decisions work when the input space is large, ambiguous, and context-dependent. Format selection from natural language input is exactly this kind of problem. So is tone adjustment, content summarization, and query interpretation.

The mistake most teams make is trying to force probabilistic problems into deterministic solutions. The rules engine we built was not bad engineering. It was a deterministic solution to a probabilistic problem. It could never work well enough because the problem space — natural language queries mapped to presentation formats — is inherently fuzzy.

Going the other direction is equally dangerous. Making authentication probabilistic ("the LLM thinks this looks like a valid session") would be absurd. The skill is in classification: knowing which decisions belong in which category.

For us, format selection was the first domain we moved from deterministic to probabilistic. It will not be the last. As we build more AI-native features, the boundary between "hard-coded logic" and "model-decided behavior" keeps shifting. The product roadmap is increasingly about deciding where that boundary sits for each feature.

What We Would Do Differently

If we rebuilt the modality system from scratch, three things would change.

First, we would skip the rules engine entirely. The two weeks spent building and iterating on rules taught us why rules fail for this problem, but we could have learned that lesson in two days with a quick prototype. The sunk cost of the rules engine delayed the prompt-based approach.

Second, we would build the override mechanism from day one. We shipped the initial LLM-based version without user overrides, assuming 90% accuracy was sufficient. Users told us otherwise. The 10% of wrong format choices were disproportionately annoying because users had no recourse. Adding overrides later required reworking the response data model to support format-agnostic data storage.

Third, we would invest earlier in the chart template library. Dynamic chart generation is the main source of latency complaints, and templating the common cases was straightforward. We delayed it because it felt like optimization before we had proven the feature. In retrospect, chart latency was a barrier to user trust in the entire modality system. Fast charts made users more willing to accept the AI's format choices across the board.

Implications for AI Product Design

Every AI product that presents information faces this problem. Search engines, analytics tools, coding assistants, customer support bots — all of them make implicit or explicit decisions about how to format responses.

Most default to text because text is the easiest to generate and the hardest to get wrong. But defaulting to text is itself a product decision, and often the wrong one. When a user asks for data and receives a paragraph, they have to do the work of parsing numbers from prose. That is a cost you have pushed onto the user.

The format of an AI response is not decoration. It is part of the answer. A chart of revenue over twelve months communicates trajectory in a way that twelve numbers in a sentence cannot. A grid of customer records lets users sort and scan in a way that prose descriptions prevent.

Treating format selection as an intelligence problem — something the AI reasons about, not something a rules engine prescribes — produces better results. Not perfect results. Better ones. And with override mechanisms and feedback loops, better results that improve over time.

The medium is part of the message. The AI should be smart enough to know that.

Maxime Champoux

CEO & co-founder, Well

Maxime is the CEO and co-founder of Well. He built Well to rebuild finance around AI-native data, not spreadsheets.

Ready to automate your financial workflows?

Try Well free