Most AI coding fails for a simple reason: people ask for too much in one shot, get something half-right, and then stack more guesses on top of it.
That works for toy apps. It breaks fast on real systems.
I’m building shipdemos.com, and the workflow that actually worked was not “vibe code harder.” It was: turn AI into a strict execution system.
Not a chatbot. Not a brainstorming partner. A system.
Here’s the process.
The basic idea
I stopped asking the model to “build the feature.”
Instead, I made it do this:
- analyze the system
- write a roadmap in markdown
- compress that roadmap into small executable phases
- implement one phase at a time
- audit each phase after implementation
- run break-tests and second-pass audits to catch what the first pass missed
That changed everything.
The result was better code, fewer fake completions, fewer hidden regressions, and way better accuracy in the parts of the system that matter.
In my case, that meant the full pipeline:
- PR diff ingestion
- DOM crawling
- step generation
- validation
- execution
- repair loop
- rendering
Step 1: Make the AI write the roadmap first
The first useful thing I had Cursor do was not code.
Perform a deep comparative codebase analysis of this codebase.
1. Identify core architectural patterns, design choices, and key modules in both codebases.
2. Compare accuracy-related logic (e.g., detection, inference, processing pipelines) and highlight differences.
3. Evaluate algorithms, model usage, and data handling approaches impacting accuracy.
4. Benchmark or infer which implementation yields better accuracy and why.
5. Detect gaps, inefficiencies, or outdated practices in shipvideo-engine.
6. Highlight where Glimpse uses superior techniques or optimizations.
7. Suggest concrete improvements for shipvideo-engine with technical justification.
8. Estimate impact of each improvement (accuracy, latency, scalability).
9. Provide implementation-level guidance (code patterns, libraries, refactors).
10. Deliver a prioritized action plan with highest ROI changes first.
I had it inspect the existing codebase and write a real roadmap in markdown. Not a vague “next steps” list. An actual implementation document with:
- phases
- exact files
- data structures
- risks
- verification steps
- rollout order
That matters because once the roadmap exists, you stop arguing with the model mid-build about what it “meant.”
The roadmap becomes the reference point.
For me, that roadmap covered things like:
- unifying DOM schema across crawler, extractor, validator, and generator
- adding multi-route DOM crawling
- replacing brittle navigation detection
- adding live selector validation
- smarter diff budgeting
- hardening the repair loop
- wiring screenshot/render config correctly
All in markdown. Versioned. Readable. Reviewable.
Step 2: Compress the roadmap into executable phases
Create a detailed implementation roadmap as a Markdown file for the
following architectural improvements named EXECUTION_PLAN.md.
Requirements:
- Break into phases (Phase 1, Phase 2, ...)
- Each phase must be independently implementable and testable
- For each phase include:
- Goal (what problem it fixes)
- Exact files to modify
- Data structures to introduce/change
- Step-by-step implementation tasks
- Risks / edge cases
- How to verify correctness (tests or manual checks)
Constraints:
- Prioritize highest ROI changes first
- Do NOT combine unrelated refactors in one phase
- Keep phases small enough to complete in a single PR
- Maintain backward compatibility unless explicitly stated
Focus areas to include:
1. DOM schema unification
2. Multi-route DOM crawler (bounded BFS)
3. Selector validation improvements (uniqueness enforcement)
4. Navigation detection redesign
5. Diff processing improvements (smart truncation)
6. Repair loop robustness
7. Removal of dead execution paths
8. Config + trigger discipline improvements
Output:
- A clean, structured .md roadmap ready to execute phase-by-phase
Big plans are useless if they’re too wide to implement cleanly.
So the next move was to compress the roadmap into a smaller execution plan with independently shippable phases.
Each phase had only:
- goal
- files to change
- key tasks
- success criteria
That sounds minor, but it’s huge.
Because once the plan is compressed, the AI has less room to wander.
Instead of saying “improve the crawler,” I could say:
- implement only Phase 3
- touch only these files
- don’t pre-build future phases
- keep behavior stable unless the phase explicitly changes it
- stop after completion
That is the difference between controlled progress and a messy rewrite.
Step 3: Use strict execution constraints
Implement Phase 3 from EXECUTION_PLAN.md
Execution constraints (STRICT):
* EXECUTION_PLAN.md is the ONLY source of truth.
* ACCURACY_IMPROVEMENT_ROADMAP.md /ACCURACY_IMPROVEMENT_ROADMAP.md may be used ONLY for edge cases or clarification.
* If there is any conflict, follow EXECUTION_PLAN.md and report it.
Rules:
1. Implement ONLY Phase 2.
2. Do NOT modify files outside this phase.
3. Do NOT anticipate future phases.
4. Keep changes minimal and scoped.
5. Preserve existing behavior unless explicitly required.
Output:
* Files modified
* Full diffs
* Why each change was made (mapped to Phase 2 tasks)
* Any assumptions or ambiguities
* Any risks introduced
Validation:
* No broken imports or circular dependencies
* No dead or unused code
* Code runs correctly
Stop after completion. Do not proceed further.
This was probably the biggest unlock.
I used a reusable Execution Prompt that forced the model into phase-by-phase work.
The important part was not the wording. It was the constraints.
Things like:
- implement only one phase at a time
- do not modify files outside phase scope
- do not anticipate future phases
- preserve existing behavior unless required
- prefer minimal changes
- validate imports, con_sistency, and runnability
- stop after the phase
That sounds obvious. It is not how most people use AI.
Most people ask for the destination. I started forcing the model to respect the path.
That made the code much more stable.
Step 4: Audit after every phase
Audit Phase 6 implementation for hidden inconsistencies:
1. Verify all screenshot paths use full_page flag consistently
2. Verify record_spend behavior:
* is it called once or multiple times per invocation?
* is token usage accurately tracked across fallback?
3. Verify viewport consistency:
* step_runner viewport vs render.py viewport
* ensure a single source of truth
Output:
* exact inconsistencies found
* required fixes
* affected files
After each phase, I did not just move on.
I ran an Audit Prompt against the exact files that changed.
That audit checked:
- was the phase actually implemented as written?
- are contracts consistent between modules?
- are there silent accuracy issues?
- are there capped lists, bad assumptions, dead branches, or partial implementations?
This caught a lot.
Examples:
- schema mismatches between
dom_crawlerandstep_generation - routes being crawled but not represented cleanly in the returned snapshot
- valid selectors being rejected because quote normalization was missing
- homepage context being dropped when seed routes filled the crawl budget
- list caps silently hiding important DOM elements
- runtime validation disagreeing with generation-time validation
A single “done” pass would not have caught that stuff.
Step 5: Run break-tests on purpose
Now try to break the system:
* Find scenarios where the pipeline produces incorrect or misleading demo steps
* Find cases where the system silently fails but reports success
* Identify where grounding fails and LLM guesses instead
* Identify worst-case PRs that this system would handle poorly
Be aggressive and critical.
Then I started doing something I think more people should do with AI: intentionally try to break the implementation.
I used a separate Break-Test Prompt to simulate ugly cases like:
- large PR diffs
- multi-route UI changes
- lazy-loaded elements
- auth-protected routes
- navigation changes mid-flow
- rendering with mixed screenshot dimensions
This is where a lot of fake confidence dies.
For example, one of the biggest improvements came from realizing arbitrary caps were destroying grounding quality.
The crawler was only keeping tiny slices of buttons, links, inputs, and test IDs. That meant the model was generating steps from incomplete evidence.
Removing or expanding those limits improved accuracy more than a lot of “clever” prompting.
Same with multi-route crawl. If the system only sees the homepage, it will hallucinate selectors for routes it never visited. That’s not an LLM problem. That’s a grounding problem.
Step 6: Do a second-pass audit
Even after audit plus break-test, I still ran a Second-Pass Audit Prompt.
Why?
Because the first audit usually catches obvious implementation mistakes. The second pass catches system-level inconsistencies.
Things like:
- trigger logic duplicated in two places
- config values loaded but never actually used
- script-first path lagging behind stepwise hardening
- route validation using a narrower runtime view than the original crawl
- re-anchor failure not aborting execution after navigation
That second pass changed outcomes a lot.
Without it, I would have shipped code that looked good in isolation but still had bad handoffs between modules.
The mental model that actually works
If you want to copy this workflow, this is the model:
_Pass 1: map the system Pass 2: turn the map into phases Pass 3: implement one phase under hard constraints Pass 4: audit the implementation Pass 5: try to break it Pass 6: audit the system again after the break-test
That is the loop.
Not “prompt once and pray.”
Why single-pass AI fails
Single-pass AI usually fails for four reasons:
- it changes too much at once
- it hides contract mismatches between modules
- it treats “syntactically valid” as “correct”
- it never pressure-tests the result
That was very obvious in this project.
A lot of the worst issues were not crashes. They were silent accuracy failures:
- wrong selectors that looked valid
- good routes rejected later by another module
- useful DOM elements dropped by arbitrary caps
- old assumptions surviving after architecture changes
Those are exactly the bugs a single-pass workflow misses.
Why phase isolation matters
Phase isolation is not just project management.
It’s how you keep AI from smearing half-finished ideas across the codebase.
When I forced the model to implement only one phase at a time, a few good things happened:
- diffs got smaller
- validation got easier
- audits got sharper
- regressions were easier to trace
- roadmap intent stayed intact
It also made review much less emotional. I wasn’t looking at a giant AI rewrite. I was looking at one small architectural move at a time.
Why module contracts matter more than prompts
Prompting matters. Contracts matter more.
A lot of the work in this project was not “make the model smarter.” It was making the system internally consistent.
Examples:
- DOM schema had to match across crawler, extractor, validator, and generator
- route authority had to mean the same thing at generation time and execution time
- screenshot settings had to flow from config to runner to renderer
- diff prioritization had to share the same UI file classifier as trigger logic
If those contracts drift, better prompts don’t save you.
How I used GPT-5.4 vs Claude Sonnet 4.6
I didn’t use one model for everything.
I used GPT-5.4 for the heavy system work:
- roadmap creation
- execution-plan compression
- phase-by-phase implementation
- cross-file refactors
- detailed audits
It was good at following hard constraints and keeping long architectural context straight.
I used Claude Sonnet more like a second brain:
- alternate audit passes
- cross-checking reasoning
- pressure-testing whether a phase was actually complete
- catching issues from a different angle
That split worked well.
One model built. The other challenged.
That’s a much better setup than asking one model to both implement and approve its own work.
The practical takeaway
If you’re building something real with AI, don’t ask it to build the whole thing in one pass.
Make it do this instead:
- write the roadmap in markdown
- compress it into small executable phases
- implement one phase at a time under strict rules
- audit every phase
- run break-tests deliberately
- do a second-pass audit before you trust the result
Treat prompts as reusable operating procedures:
- Execution Prompt
- Audit Prompt
- Break-Test Prompt
- Second-Pass Audit Prompt
That is the workflow.
The point is not to make AI feel magical. The point is to make it reliable enough to matter.
That’s how I’m building a shipdemos.com a product that generates demo videos from GitHub pull requests. The idea is simple: you push code, a video appears on your PR showing what changed. No manual recording, no Loom, no “can you show me what this does?” in Slack.
What’s Next?
Launching soon! You can sign up for the waitlist or suggest features you’d like to see.
👉 Check it out — Shipdemos.com
