I'm Building a Thing That Records Your PRs Automatically (I Think?)

I'm building ShipDemos.com that automatically generates demo videos from GitHub pull requests. The idea is simple: you push code, a video appears on your PR showing what changed. No manual recording, no Loom, no "can you show me what this does?" in Slack.

I've spent weeks going in circles on this, and I'm writing this to clarify what I actually know versus what I'm still guessing at.


The Problem

Engineering teams waste absurd amounts of time on demo videos. Not because videos are bad they're actually really useful, but because the workflow is broken.

Here's what happens now: An engineer ships a feature. Product manager asks "can you show me what this looks like?" Engineer opens Loom, fumbles through recording, edits out the dead space, posts the link. Takes 10-15 minutes. Multiply that by 20-50 PRs per month and you're burning 3-8 hours per engineer on repetitive recording.

The worse part: most teams just skip it. They ship features with text-only changelogs because nobody has time to record demos. Sales teams bug engineers for feature walkthroughs. Customer success scrambles to create training materials. It's all inefficient.

I kept thinking: this should be automatic. The code changed, the preview environment deployed, we know what page to look at. Why are humans manually recording this?


What I'm Building

ShipDemos.com hooks into your GitHub repository and preview deployments (Vercel, Netlify, whatever). When you open a PR, it:

  1. Detects what changed (analyzes the diff)
  2. Navigates to your preview environment
  3. Records the relevant screens
  4. Stitches them into a video
  5. Posts it as a PR comment

The core technical stack is:

  • GitHub webhooks for PR events
  • Playwright for browser automation
  • FFmpeg for video rendering
  • Cloudflare R2 for storage
  • LLMs for classification (which demo template to use)

The product has three modes, which I keep flip-flopping between:

Mode 1: Template-based - You define demo flows once (e.g., "billing changes = navigate to /billing, click export, screenshot"). System matches PRs to templates via LLM classification. High accuracy (85-90%) but requires 30 minutes of setup.

Mode 2: Instruction-based - Developers write demo steps in PR descriptions like a checklist. System parses and executes them. Zero setup, but requires per-PR effort (defeats the purpose).

Mode 3: Autonomous agents - Use vision LLMs (Claude computer-use) to explore the app and figure out what to demo. Truly automatic but expensive ($0.30/video vs $0.02) and slower (3 min vs 30 sec).

I'm currently leaning toward Mode 1 with Mode 3 as a premium tier.


What I've Figured Out So Far

Code diffs are too low-level to reliably infer UI behavior

I tried building heuristics: parse the diff, extract routes, find button text, detect modals. It works for simple cases (new page = just screenshot it) but fails spectacularly for interactive features. A PR that adds a button opening a modal? The diff shows <button onClick={openModal}>Export</button> but gives zero indication that clicking it matters or what the modal contains.

Success rate with pure heuristics: ~60%. Not good enough.

Templates solve the reliability problem but create a setup problem

If users define flows once, accuracy jumps to 85-90%. But now you're asking customers to spend 30-60 minutes configuring templates. That's friction. The counter-argument: they save 10+ hours per month after that, so ROI is obvious. But humans are irrational about upfront costs.

I tested this framing: "Spend 30 minutes now, save 10 hours per month forever" vs "Fully automatic, zero setup." The second sounds better even if it's less accurate. This is a positioning problem, not a technical one.

LLM classification works surprisingly well

I feed the LLM: PR title, changed files, and a list of available templates. It picks the right one 80-85% of the time. The confidence scoring helps too if confidence is low, we can skip the video or use a safe fallback.

Example:

PR: "Add CSV export to billing page"
Files: BillingPage.tsx, ExportButton.tsx
Available templates: [billing, dashboard, settings, default]
LLM output: "billing" (confidence: 0.91)

This works because the LLM isn't inventing steps - it's just matching patterns. Low hallucination risk.

Preview deployments are non-negotiable

You can't record demos from production because the new feature isn't there yet. You can't record from localhost because that requires the engineer's machine to be running. Preview deployments (Vercel, Netlify) are the only viable option.

This shrinks the addressable market. Not every company has preview infrastructure. But the companies that do are exactly the ones shipping fast enough to need this product. So it's actually good filtering.

Cursor just validated the entire thesis

Cursor (the AI coding tool) just launched a feature where their AI agents write code, test it, and generate demo videos automatically. This is huge validation that "automatic demo videos" is a real need. But they're solving it for code their agent wrote. I'm solving it for code anyone wrote, in any editor.

Their market: Cursor users using AI agents (~5K developers).

My market: Any team shipping PRs (~1M+ developers).I KNOW IT NOT THIS EASY TO MEASURE TAM

We're not competing. They proved demand exists.


What Still Doesn't Work

I don't know if template setup is acceptable friction

Some days I think "30 minutes of setup is nothing, teams spend that on tooling all the time." Other days I think "if it's not zero-setup, nobody will use it." I genuinely don't know which is true until real customers try it.

The vision LLM approach (Mode 3) could be zero-setup, but it's 15x more expensive and 6x slower. Is that worth it? Maybe for premium customers. Maybe not at all.

The accuracy ceiling might be lower than I think

Even with templates, you hit edge cases:

  • Feature requires authentication (how does the system log in?)
  • Feature requires specific data state (empty vs populated)
  • Feature is multi-step with branching logic
  • Feature is only visible to certain user roles

I can handle some of this (test accounts, seed data), but not all of it. The theoretical max accuracy might be 85%, not 95%. I don't know yet.

I'm not sure about the AI coding threat

If 70% of code gets written by AI agents in 2-3 years, and those agents generate their own demos (like Cursor does), then my market shrinks by 70%. That's a real risk.

Counter-argument: even if agents write code, humans will still edit it. Multiple people contribute to PRs. Bugs get fixed post-merge. The "final state" of what shipped is different from what the agent initially built. So there's still a need for post-PR documentation.

But I'm not 100% confident in this reasoning.

I don't know if this is a features product or a documentation platform

Am I selling "automatic demo videos" (narrow, competes with Loom)?

Or am I selling "PR documentation automation" (broader, competes with Notion/Confluence)?

The second positioning might be more defensible long-term. Aggregate demos from all sources (agents, manual uploads, my automation), curate them, distribute them to customers. Become the changelog platform.

But that's much bigger scope.

If This Fails

There's a real chance this doesn't work.

Maybe template setup is too much friction and nobody bothers. Maybe 85% accuracy isn't good enough and one bad video tanks trust. Maybe AI coding takes off faster than I expect and agents just handle this automatically. Maybe the market is smaller than I think teams without preview deployments, teams that don't care about demos, teams unwilling to pay $149/month for this.

I could spend 6 months building this and end up with 2 paying customers.

But even if it fails, I'll learn something useful:

  • Whether LLM classification actually works in production at scale
  • What the real accuracy ceiling is for automated UI recording
  • If template-based systems can compete with "magic" zero-setup promises
  • How preview deployment infrastructure actually works across different platforms
  • Whether vision LLMs are ready for production or still too expensive/slow

And honestly? Building this is fun. Writing browser automation that intelligently clicks through UIs, debugging why FFmpeg drops frames, figuring out how to make LLMs reliably classify PRs without hallucinating, this is the kind of problem-solving I enjoy.

Worst case: I build something interesting, write about what I learned, and move on to the next idea with better intuition about automation and AI tooling.

Best case: I solve a real problem for engineering teams and build a sustainable business.

Either way, I'm shipping something this week.

If you want to follow along or try this on your own team, check out ShipDemos.com.