From AI MVP to production: what it actually takes

Software engineeringZegaware engineering23 June 202611 min read

Last updated: 23 June 2026

An artificial intelligence (AI) built minimum viable product (MVP) becomes production-ready when its security, authorisation, tests, error handling, observability, performance and architecture are brought to a standard a senior engineer will sign off. The demo proves the idea works. Turning it into a product means hardening what is sound, replacing what is not, and making it safe to operate under real load.

The gap between a demo and a system you can operate

AI coding tools are genuinely good at getting to a working minimum viable product (MVP). They turn a clear description into running software in hours, and the result often demonstrates the core idea convincingly. That is real value, and it is why so many teams now start here.

A demo and an operable system are measured differently, though. A demo runs the happy path, on one machine, for one cooperative user, with clean data. Production runs many users at once, accepts hostile input, fails partially, touches real money and real data, and has to keep running for years while a team changes it.

The benchmarks make the gap concrete. Carnegie Mellon University's SUSVIBES study found that agent-written code was roughly 61% functionally correct but only 10.5% secure, and that over 80% of the working solutions still contained a critical vulnerability [1]. In other words, "it works" and "it is safe to operate" are separate measurements, and AI output scores far higher on the first than on the second.

None of this means the MVP is wasted. It means the MVP is a strong first draft of a product, not the product itself. The honest question is not whether AI-built code is acceptable, a topic we cover in is AI-generated code safe to ship?, but what specific work stands between the draft and something you can put in front of real users.

What "production-ready" actually means

Production-ready is not a feeling or a launch date. It is a set of properties you can check, each answering a question an operator will eventually ask. Six matter most.

Property	The question it answers	How we check it
Security and authorisation	Can only the right people do the right things?	Authentication on every route, authorisation per action, secrets out of the code, input validated
Tests that assert behaviour	Will we know when something breaks?	Tests that fail on wrong output, not only on a thrown error
Error handling and observability	Does the system tell the truth about what it is doing?	Structured logs, metrics, traces, alerts on real failure
Performance under real load	Does it hold up with real users and real data?	Load tested at expected concurrency and data volume
Maintainable architecture	Can a team change it safely?	Clear boundaries, low duplication, code the team can explain
Operability	Can we run it day to day?	Repeatable deploys, configuration, backups, rollback, migrations

Security and authorisation

Every action is authenticated and authorised, secrets stay out of the source code, and untrusted input is validated before use. The Open Worldwide Application Security Project (OWASP) ranks Broken Access Control as the most common web application risk, noting it moved up to first place, with 94% of tested applications showing some form of broken access control [9]. Veracode found that 45% of AI-generated code contains at least one OWASP Top 10 vulnerability, with no improvement across two years, and with the code failing 85% of cross-site scripting tests and 87% of log injection tests [2]. Veracode summarised the position bluntly: "The productivity revolution is here. The security revolution isn't" [2].

Tests that assert behaviour

A real test fails when the behaviour is wrong, not merely when the code throws an error. Coverage percentages mean little if the assertions only check that a function returned something at all. The purpose of a test suite is to tell you, quickly and honestly, that a change broke something it should not have.

Error handling and observability

A production system tells the truth about what it is doing through logs, metrics, traces and alerts. This matters because failures are often silent: Arize's field analysis of production failures found systems reporting "success" on a Hypertext Transfer Protocol (HTTP) 200 status while the actual outcome was wrong [6]. A failure that cannot be seen cannot be fixed.

Performance under real load

Production-ready software holds its response times with realistic concurrency and data volume, not just with a single test record. CodeRabbit's comparison of AI-authored and human-authored changes found performance issues appearing 1.42 times more often in AI-authored code [3]. Load is where shortcuts surface.

Maintainable architecture

A maintainable system is one a team can still change safely a year later. A 2025 study analysing AI-generated code found that it tends to be simpler but more repetitive, and more prone to unused constructs [8]. Repetition and dead code are not cosmetic: they raise the cost and the risk of every future change.

Operability

Operable software can be deployed, configured, backed up, rolled back, monitored and migrated without heroics. This is the property MVPs most often skip entirely, because a demo never needs a second environment or a rollback plan.

The specific work of productionising an AI-built MVP

When our senior engineers review an AI-built MVP, the same findings recur. The Vibe Code Audit exists to surface them systematically, and we describe the pattern in detail in what a vibe code audit actually finds. The seven findings below account for most of the productionisation work.

Recurring finding	Why it is dangerous in production	Evidence
Exposed secrets	Keys in code or git history grant direct access	AI-assisted commits leak secrets at 3.2% versus a 1.5% baseline [4]
Missing authorisation	Users can act on data that is not theirs	Broken Access Control is the top OWASP risk [9]
Hollow tests	A green pipeline with no real safety net	AI-authored changes carry about 1.7 times more issues [3]
Silent failures	Bad outcomes are reported as success	Systems return HTTP 200 while the outcome is wrong [6]
Architectural drift	Repetition and dead code make change risky	AI code is simpler but more repetitive, with more unused constructs [8]
Hallucinated dependencies	Imports of packages that do not exist invite supply-chain attack	19.7% of AI samples referenced a non-existent package [5]
Performance collapse under load	Works for one user, fails for a thousand	Performance issues appear 1.42 times more often [3]

Exposed secrets. GitGuardian's 2026 report found that commits made with AI assistance leaked secrets at 3.2%, against a 1.5% baseline, set within a wider total of 28.65 million new secrets pushed to public repositories during 2025 [4]. A key in source control is a key an attacker can find.

Missing authorisation. Routes that check who you are but never check what you are allowed to do. This is the Broken Access Control category OWASP places first [9], and it is the finding with the widest blast radius, because it exposes other users' data rather than just breaking one feature.

Hollow tests. A green pipeline that proves very little: tests that execute the code without asserting the result. CodeRabbit found AI-authored changes carried roughly 1.7 times more issues than human-authored ones [3], and a hollow suite is exactly what hides them.

Silent failures. Errors caught and discarded, so a broken operation still returns a success code. Arize documented systems returning an HTTP 200 status while the real outcome was wrong [6]. In production this means problems are discovered by customers, not by the team.

Architectural drift. Near-duplicate code, slightly different each time, alongside unused constructs left behind [8]. Forrester predicts that 75% of technology decision-makers will see technical debt rise to a moderate or high level of severity by 2026, and names AI development as an accelerant [7]. Drift is how that debt accumulates in practice.

Hallucinated dependencies. Imports of packages that do not exist. Spracklen and colleagues, presenting at USENIX Security 2025, found that 19.7% of AI code samples referenced a non-existent package, and that 43% of those hallucinations recurred across repeated runs [5]. A recurring fake name is a supply-chain risk, because an attacker can register the name an assistant keeps inventing.

Performance that collapses under load. Code that is correct for one record and quadratic for a thousand: unbounded queries, missing pagination, no caching. The 1.42 times higher rate of performance issues CodeRabbit measured [3] tends to stay hidden until real traffic arrives.

How to decide what to keep, harden, or rebuild

Not every line needs the same treatment, and rebuilding everything would throw away the MVP's main benefit. We sort the code into three buckets, and the sorting itself is most of a review. The same judgement underlies how to review AI-generated code.

Signal in the code	Verdict
Correct, readable, low-risk, already covered by tests that assert	Keep
Right shape but missing authorisation, validation, tests or logging	Harden
Security-critical, performance-critical, built on hallucinated dependencies, or code nobody can explain	Rebuild

The deciding factors are blast radius (what breaks, and who is affected, if this fails), data sensitivity, how often the code changes, and how well the team understands it. Most AI-built code lands in "harden": the shape is right, and the work is to add authorisation, validation, real tests, error handling and logging. Rebuild is reserved for security-critical and performance-critical paths, anything resting on hallucinated dependencies, and code nobody can explain. Keep is for correct, readable, low-risk code already covered by tests that assert, and there is usually more of it than a nervous first look suggests.

An honest account of cost and sequence

Honesty about cost is part of the engineering. Productionisation is a real but bounded second pass, and far cheaper than a from-scratch rebuild, because the product shape, the data model and the happy-path behaviour already exist. The temptation is to ship the MVP as it stands and fix things later, but that is usually a false economy: unaddressed security gaps and silent failures compound, and the technical debt Forrester describes only grows as traffic arrives [7].

Sequence matters as much as effort. In our engagements we work in this order:

Review first. You cannot plan what you have not measured, so the audit comes before any change.
Close the security gaps. Secrets, authorisation and input validation come first, because they are the most expensive to get wrong and the hardest to retrofit once more code depends on the current shape.
Make failure visible. Error handling, logging and monitoring go in next, so that every later change is observed rather than guessed at.
Prove behaviour with tests that assert, so that hardening does not quietly break what already works.
Harden performance against realistic load, once behaviour is locked in by tests.
Pay down architectural debt where it blocks the next change, not as an end in itself.

On price, we do not publish a single rate for this work, because the scope depends entirely on what the review finds. What we do commit to is clear prices in writing, agreed per engagement, with the audit itself bounded so that the first step carries no open-ended risk. Every piece of work is signed off by a named senior engineer, which is the same accountability we apply to the review that precedes it.

Frequently asked questions

How long does it take to make an AI-built MVP production-ready?

It depends on the size of the minimum viable product and how security-sensitive it is, but the work is bounded and usually a fraction of the original build, because the product shape already exists. A review scopes it precisely. The largest costs are normally authorisation, tests that assert behaviour and observability, rather than new features.

Can I keep building features on the MVP instead of hardening it?

You can, but unaddressed security gaps and silent failures compound as you add code, and new features inherit the old flaws. Forrester expects technical debt to rise to moderate or high severity for most technology decision-makers by 2026, with AI development named as an accelerant [7]. A short review tells you what is safe to build on first.

Is AI-generated code safe to use in production?

Yes, once it has been reviewed and hardened to the same standard as any other code. The origin of the code matters less than whether it has been verified. Benchmarks show high functional correctness but low security before review, with over 80% of working AI solutions still carrying a critical vulnerability [1]. Treat it as a strong draft, not a finished product.

What is the difference between an MVP and production-ready software?

A minimum viable product proves an idea works on the happy path for a few users. Production-ready software is safe to operate at scale: authenticated, authorised, tested, observable, performant under load and maintainable by a team. The distance between the two is engineering work, and not usually a full rewrite.

Productionise your MVP

Zegaware productionises AI-built MVPs. If you have a working prototype and need it made safe to operate at scale, our Bespoke Software engagements take an MVP from demo to production: closing the security gaps, adding tests that assert behaviour, making failure visible, and hardening performance against real load, all signed off by a named senior engineer. The MVP got you a strong start. The next step is making it a product.

Sources

Songwen Zhao et al., "Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks" (SUSVIBES benchmark), Carnegie Mellon University, arXiv:2512.03262, 2026. https://arxiv.org/abs/2512.03262
Veracode, Spring 2026 GenAI Code Security Update, 24 March 2026. https://www.veracode.com/blog/spring-2026-genai-code-security/
CodeRabbit, State of AI vs Human Code Generation Report, 17 December 2025. https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report
GitGuardian, The State of Secrets Sprawl 2026, 17 March 2026. https://blog.gitguardian.com/the-state-of-secrets-sprawl-2026/
Joseph Spracklen et al., "We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs", USENIX Security 2025, arXiv:2406.10279. https://arxiv.org/abs/2406.10279
Aryan Kargwal, "Why AI Agents Break: A Field Analysis of Production Failures", Arize AI, 29 January 2026. https://arize.com/blog/common-ai-agent-failures/
Forrester, "Predictions 2025: Technology And Security" (press release), 22 October 2024. https://www.forrester.com/press-newsroom/forrester-predictions-2025-tech-security/
Domenico Cotroneo et al., "Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity", arXiv:2508.21634, August 2025. https://arxiv.org/abs/2508.21634
OWASP, "A01:2021 Broken Access Control", OWASP Top 10 2021. https://owasp.org/Top10/A01_2021-Broken_Access_Control/

Not sure what you are shipping? Our Vibe Code Audit puts senior engineers across your AI-built software and signs off what is safe to ship. Fixed fee, scored review, a clear go or no-go.

Book an audit