Engineering Digital
Transformation with
Intelligence
We transform businesses through software innovation and intelligent systems. We enable digital transformation by combining deep engineering capabilities with a strong understanding of business, operational, and industry-specific realities.
Building a GenAI MVP: Five Gaps Between Working Prototype and Shippable Product
27 April 2026.
Reading Time: 4 Minutes.
Across the GenAI platforms we’ve audited in 2026, the same five gaps appear in almost every codebase: no prompt caching, no tool use, no structured output, no retry on streaming, no cost ceiling. None of these are exotic. All are easier to build in early than retrofit under production load. Miss them and your team is building Spaghetti AI.
Take a representative case from 2026: a GenAI platform built to ingest market informational signals and synthesize them into actionable intelligence for strategic decision making.
Claude Opus at the center. FastAPI backend. Next.js 15 frontend. On paper, the stack looked modern: Claude API plus LangGraph for orchestration, PostgreSQL with pgvector for retrieval, Redis for cache, Server Sent Events for streaming.
Then the code review told a different story.
When the architecture diagram doesn't match the implementation
The architecture document listed LangGraph as the orchestration layer. The codebase had zero LangGraph imports.
Every Claude call was a single-shot messages.create with the entire context pre-baked into the prompt. No multi-step reasoning. No subagents. No tool use.
This is the most common pattern across the GenAI platforms audited in 2026 platforms claiming agentic architecture. It deserves a name: prompt-stuffing masquerading as an agent. It’s a tactical version of the AI Execution Gap the distance between what your architecture diagram claims and what the code actually does.
This kind of MVP typically lands at 65% ready the standard midpoint for any GenAI build at the prototype-to-product transition. Core CRUD works. Analysis returns useful outputs. What’s missing is everything that turns a prototype into an operatable system.
The five gaps that show up everywhere
- No prompt caching The app tracked cache_read_tokens in its billing ledger but never sent cache_control headers in any actual API call. Every request re-processed the same 400–800 token system prompt from scratch.
At 10,000 requests a day, that’s three-figure daily waste and a latency tax on every user interaction.
The fix: Fifteen lines of code. Mark the stable system prompt as cacheable. Cheapest win in GenAI engineering. - No tool use
Signals were concatenated into the prompt as plain text. Claude received a wall of context and was expected to reason over it in one pass. If a signal was missing, Claude had no way to ask for it.
Tool use flips this. Give Claude a get_regulatory_data(region) tool and let it decide what to pull.
Result: More targeted outputs, shorter prompts, lower costs. - No structured output Every synthesis prompt ended with “Return ONLY valid JSON.” A try/except JSONDecodeError silently swallowed failures.
This is a flag for “one model update away from an outage.” Use the model’s native structured-output features. Don’t bolt string-parsing onto hope. - No retry or partial-result recovery on streams
Streaming looked great in the demo. In production, a mid-stream network hiccup dropped the connection, lost the partial output, and showed the user an error. No retry. No resume. No graceful degradation.
Budget for the failure modes on day one. - No cost ceiling
The platform measured token usage but didn’t enforce limits. Any authenticated user could trigger unlimited synthesis calls.
One runaway loop. One malicious user. One bad test script. The Anthropic bill becomes a board-level conversation. Meter, alert, and cap. In that order. Before launch.
What we got right
Three patterns held up well.
A single Claude chokepoint. Every call routed through one _call_claude() method. Uniform logging, error handling, token accounting. Boring, effective, and makes retrofitting features like caching or retries a one-file change.
Skills as orthogonal injection. A skills registry let non-engineers tune the model’s behavior (tone, domain focus, output format) without a code deploy. Skills were defined as data, resolved at request time.
Signal providers as a plugin pattern. Each signal source implemented the same interface. Adding a new source was mechanical. This is the shape tool use should take.
The playbook: best practices for GenAI MVPs
Start from the API, not the architecture diagram. Get one prompt working end-to-end with four fundamentals: prompt caching on the stable system prompt, structured output (JSON mode or tool use), streaming with retry and partial-result handling, and token accounting wired to a dashboard. Without these, you have a prototype with a production URL — not an MVP.
Build one chokepoint, then extend it. Route every model call through a single function. Put caching, retries, logging, cost tracking, and circuit breaking there. You’ll thank yourself the first time you need to swap models or investigate a cost spike.
Prefer tool use over prompt-stuffing. If your prompt exceeds a page of context, Claude is reading things it doesn’t need. Convert data-fetching into tools. Let the model pull on demand. Shorter prompts, better answers, lower cost.
Treat prompts as code. Version them. Diff them. Review them. A prompt change is a behavior change — it deserves the same rigor as a migration.
Put a ceiling on spend before launch. Per-user rate limits. Per-org token budgets. Alerts on anomalous usage. Circuit breakers that disable gracefully instead of draining the account. Do this before the first external user sees the product.
Match the claim to the code. If your architecture doc says “agentic,” the code should contain an agent. If it says “LangGraph orchestration,” there should be a graph. Aspirational documentation is a recruiting liability and an onboarding landmine.
Evals are not optional (but they can start small). You don’t need an eval harness on day one. You do need a shared doc of twenty golden inputs and their expected outputs. Run it before every prompt change. Cheapest quality gate you’ll build.
The 65% pattern
65% is the midpoint across the GenAI platforms audited in 2026. Core product works. Users get value. Architecture claims are aspirational. Cost control is missing. Observability is limited. Retry logic isn’t wired. Streaming resilience isn’t built.
None of these gaps are exotic. All are easier to add early than to retrofit under load. The difference between a GenAI prototype and a GenAI product is almost entirely in these five decisions.
The takeaway
Building with Claude is fast. Building well with Claude requires engineering discipline plus a handful of LLM-specific practices that are easy to skip and expensive to skip:
- Prompt caching
- Tool use
- Structured output
- Streaming recovery
- Cost ceilings
Treat those five as table stakes. Keep your architecture claims honest. Your MVP will spend less time apologizing and more time delivering the intelligence your users came for.
Frequently asked
What is the 65% readiness gap?
The typical MVP we audit has working core functionality but lacks prompt caching, tool use, structured output, retry logic, and cost ceilings. Each is small individually; collectively they’re the difference between a demo and a shippable product.
Is prompt-stuffing really that bad?
At small scale, no. At production scale, yes. It eats cache opportunities, burns tokens on context the model doesn’t need, and creates the same coupling patterns that produce Spaghetti AI in larger systems.
How do you enforce cost ceilings without breaking UX?
Per-user token budgets with graceful degradation when a user hits the ceiling, the system returns a cached response or a simpler model, not an error. Per-org circuit breakers handle the anomaly cases.


