April 01, 2026

If You're Building Anything with LLMs, This Might Change How You Build Anything

A practical workflow for building real LLM products with roadmaps, phased execution, audits, and break-tests instead of single-pass prompting.

by Sourabh Ligade

If You're Building Anything with LLMs, This Might Change How You Build Anything

Most AI coding fails for a simple reason: people ask for too much in one shot, get something half-right, and then stack more guesses on top of it.

That works for toy apps. It breaks fast on real systems.

I’m building shipdemos.com, and the workflow that actually worked was not “vibe code harder.” It was: turn AI into a strict execution system.

Not a chatbot. Not a brainstorming partner. A system.

Here’s the process.

The basic idea

I stopped asking the model to “build the feature.”

Instead, I made it do this:

analyze the system
write a roadmap in markdown
compress that roadmap into small executable phases
implement one phase at a time
audit each phase after implementation
run break-tests and second-pass audits to catch what the first pass missed

That changed everything.

The result was better code, fewer fake completions, fewer hidden regressions, and way better accuracy in the parts of the system that matter.

In my case, that meant the full pipeline:

PR diff ingestion
DOM crawling
step generation
validation
execution
repair loop
rendering

Step 1: Make the AI write the roadmap first

The first useful thing I had Cursor do was not code.


Perform a deep comparative codebase analysis of this codebase.
1. Identify core architectural patterns, design choices, and key modules in both codebases.
2. Compare accuracy-related logic (e.g., detection, inference, processing pipelines) and highlight differences.
3. Evaluate algorithms, model usage, and data handling approaches impacting accuracy.
4. Benchmark or infer which implementation yields better accuracy and why.
5. Detect gaps, inefficiencies, or outdated practices in shipvideo-engine.
6. Highlight where Glimpse uses superior techniques or optimizations.
7. Suggest concrete improvements for shipvideo-engine with technical justification.
8. Estimate impact of each improvement (accuracy, latency, scalability).
9. Provide implementation-level guidance (code patterns, libraries, refactors).
10. Deliver a prioritized action plan with highest ROI changes first.

I had it inspect the existing codebase and write a real roadmap in markdown. Not a vague “next steps” list. An actual implementation document with:

phases
exact files
data structures
risks
verification steps
rollout order

That matters because once the roadmap exists, you stop arguing with the model mid-build about what it “meant.”

The roadmap becomes the reference point.

For me, that roadmap covered things like:

unifying DOM schema across crawler, extractor, validator, and generator
adding multi-route DOM crawling
replacing brittle navigation detection
adding live selector validation
smarter diff budgeting
hardening the repair loop
wiring screenshot/render config correctly

All in markdown. Versioned. Readable. Reviewable.

Step 2: Compress the roadmap into executable phases

Create a detailed implementation roadmap as a Markdown file for the 
following architectural improvements named EXECUTION_PLAN.md.
Requirements:
- Break into phases (Phase 1, Phase 2, ...)
- Each phase must be independently implementable and testable
- For each phase include:
  - Goal (what problem it fixes)
  - Exact files to modify
  - Data structures to introduce/change
  - Step-by-step implementation tasks
  - Risks / edge cases
  - How to verify correctness (tests or manual checks)
Constraints:
- Prioritize highest ROI changes first
- Do NOT combine unrelated refactors in one phase
- Keep phases small enough to complete in a single PR
- Maintain backward compatibility unless explicitly stated
Focus areas to include:
1. DOM schema unification
2. Multi-route DOM crawler (bounded BFS)
3. Selector validation improvements (uniqueness enforcement)
4. Navigation detection redesign
5. Diff processing improvements (smart truncation)
6. Repair loop robustness
7. Removal of dead execution paths
8. Config + trigger discipline improvements
Output:
- A clean, structured .md roadmap ready to execute phase-by-phase

Big plans are useless if they’re too wide to implement cleanly.

So the next move was to compress the roadmap into a smaller execution plan with independently shippable phases.

Each phase had only:

goal
files to change
key tasks
success criteria

That sounds minor, but it’s huge.

Because once the plan is compressed, the AI has less room to wander.

Instead of saying “improve the crawler,” I could say:

implement only Phase 3
touch only these files
don’t pre-build future phases
keep behavior stable unless the phase explicitly changes it
stop after completion

That is the difference between controlled progress and a messy rewrite.

Step 3: Use strict execution constraints

Implement Phase 3 from EXECUTION_PLAN.md 
Execution constraints (STRICT):
* EXECUTION_PLAN.md is the ONLY source of truth.
* ACCURACY_IMPROVEMENT_ROADMAP.md /ACCURACY_IMPROVEMENT_ROADMAP.md may be used ONLY for edge cases or clarification.
* If there is any conflict, follow EXECUTION_PLAN.md and report it.
Rules:
1. Implement ONLY Phase 2.
2. Do NOT modify files outside this phase.
3. Do NOT anticipate future phases.
4. Keep changes minimal and scoped.
5. Preserve existing behavior unless explicitly required.
Output:
* Files modified
* Full diffs
* Why each change was made (mapped to Phase 2 tasks)
* Any assumptions or ambiguities
* Any risks introduced
Validation:
* No broken imports or circular dependencies
* No dead or unused code
* Code runs correctly
Stop after completion. Do not proceed further.

This was probably the biggest unlock.

I used a reusable Execution Prompt that forced the model into phase-by-phase work.

The important part was not the wording. It was the constraints.

Things like:

implement only one phase at a time
do not modify files outside phase scope
do not anticipate future phases
preserve existing behavior unless required
prefer minimal changes
validate imports, con_sistency, and runnability
stop after the phase

That sounds obvious. It is not how most people use AI.

Most people ask for the destination. I started forcing the model to respect the path.

That made the code much more stable.

Step 4: Audit after every phase

Audit Phase 6 implementation for hidden inconsistencies:
1. Verify all screenshot paths use full_page flag consistently
2. Verify record_spend behavior:
   * is it called once or multiple times per invocation?
   * is token usage accurately tracked across fallback?
3. Verify viewport consistency:
   * step_runner viewport vs render.py viewport
   * ensure a single source of truth
Output:
* exact inconsistencies found
* required fixes
* affected files

After each phase, I did not just move on.

I ran an Audit Prompt against the exact files that changed.

That audit checked:

was the phase actually implemented as written?
are contracts consistent between modules?
are there silent accuracy issues?
are there capped lists, bad assumptions, dead branches, or partial implementations?

This caught a lot.

Examples:

schema mismatches between dom_crawler and step_generation
routes being crawled but not represented cleanly in the returned snapshot
valid selectors being rejected because quote normalization was missing
homepage context being dropped when seed routes filled the crawl budget
list caps silently hiding important DOM elements
runtime validation disagreeing with generation-time validation

A single “done” pass would not have caught that stuff.

Step 5: Run break-tests on purpose

Now try to break the system:
* Find scenarios where the pipeline produces incorrect or misleading demo steps
* Find cases where the system silently fails but reports success
* Identify where grounding fails and LLM guesses instead
* Identify worst-case PRs that this system would handle poorly
Be aggressive and critical.

Then I started doing something I think more people should do with AI: intentionally try to break the implementation.

I used a separate Break-Test Prompt to simulate ugly cases like:

large PR diffs
multi-route UI changes
lazy-loaded elements
auth-protected routes
navigation changes mid-flow
rendering with mixed screenshot dimensions

This is where a lot of fake confidence dies.

For example, one of the biggest improvements came from realizing arbitrary caps were destroying grounding quality.

The crawler was only keeping tiny slices of buttons, links, inputs, and test IDs. That meant the model was generating steps from incomplete evidence.

Removing or expanding those limits improved accuracy more than a lot of “clever” prompting.

Same with multi-route crawl. If the system only sees the homepage, it will hallucinate selectors for routes it never visited. That’s not an LLM problem. That’s a grounding problem.

Step 6: Do a second-pass audit

Even after audit plus break-test, I still ran a Second-Pass Audit Prompt.

Why?

Because the first audit usually catches obvious implementation mistakes. The second pass catches system-level inconsistencies.

Things like:

trigger logic duplicated in two places
config values loaded but never actually used
script-first path lagging behind stepwise hardening
route validation using a narrower runtime view than the original crawl
re-anchor failure not aborting execution after navigation

That second pass changed outcomes a lot.

Without it, I would have shipped code that looked good in isolation but still had bad handoffs between modules.

The mental model that actually works

If you want to copy this workflow, this is the model:

_Pass 1: map the system Pass 2: turn the map into phases Pass 3: implement one phase under hard constraints Pass 4: audit the implementation Pass 5: try to break it Pass 6: audit the system again after the break-test

That is the loop.

Not “prompt once and pray.”

Why single-pass AI fails

Single-pass AI usually fails for four reasons:

it changes too much at once
it hides contract mismatches between modules
it treats “syntactically valid” as “correct”
it never pressure-tests the result

That was very obvious in this project.

A lot of the worst issues were not crashes. They were silent accuracy failures:

wrong selectors that looked valid
good routes rejected later by another module
useful DOM elements dropped by arbitrary caps
old assumptions surviving after architecture changes

Those are exactly the bugs a single-pass workflow misses.

Why phase isolation matters

Phase isolation is not just project management.

It’s how you keep AI from smearing half-finished ideas across the codebase.

When I forced the model to implement only one phase at a time, a few good things happened:

diffs got smaller
validation got easier
audits got sharper
regressions were easier to trace
roadmap intent stayed intact

It also made review much less emotional. I wasn’t looking at a giant AI rewrite. I was looking at one small architectural move at a time.

Why module contracts matter more than prompts

Prompting matters. Contracts matter more.

A lot of the work in this project was not “make the model smarter.” It was making the system internally consistent.

Examples:

DOM schema had to match across crawler, extractor, validator, and generator
route authority had to mean the same thing at generation time and execution time
screenshot settings had to flow from config to runner to renderer
diff prioritization had to share the same UI file classifier as trigger logic

If those contracts drift, better prompts don’t save you.

How I used GPT-5.4 vs Claude Sonnet 4.6

I didn’t use one model for everything.

I used GPT-5.4 for the heavy system work:

roadmap creation
execution-plan compression
phase-by-phase implementation
cross-file refactors
detailed audits

It was good at following hard constraints and keeping long architectural context straight.

I used Claude Sonnet more like a second brain:

alternate audit passes
cross-checking reasoning
pressure-testing whether a phase was actually complete
catching issues from a different angle

That split worked well.

One model built. The other challenged.

That’s a much better setup than asking one model to both implement and approve its own work.

The practical takeaway

If you’re building something real with AI, don’t ask it to build the whole thing in one pass.

Make it do this instead:

write the roadmap in markdown
compress it into small executable phases
implement one phase at a time under strict rules
audit every phase
run break-tests deliberately
do a second-pass audit before you trust the result

Treat prompts as reusable operating procedures:

Execution Prompt
Audit Prompt
Break-Test Prompt
Second-Pass Audit Prompt

That is the workflow.

The point is not to make AI feel magical. The point is to make it reliable enough to matter.

That’s how I’m building a shipdemos.com a product that generates demo videos from GitHub pull requests. The idea is simple: you push code, a video appears on your PR showing what changed. No manual recording, no Loom, no “can you show me what this does?” in Slack.

What’s Next?

Launching soon! You can sign up for the waitlist or suggest features you’d like to see.

👉 Check it out — Shipdemos.com

If you found this useful or just want to chat about tech, projects, or ideas—feel free to connect with me:

👤 Personal LinkedIn GitHub Twitter (X)