Automated Demo Video Generation: The Pipeline I Built

Demo videos convert. Everyone knows this. Wyzowl's latest data shows 89% of people say watching a video convinced them to buy a product or service, and in my own experience, product pages with embedded demo videos convert at roughly 2.4x the rate of pages with only static images and text. The math is not subtle.

And yet most businesses have maybe three demo videos, all slightly outdated, recorded by someone who no longer works there. The reason is simple: making a demo video is a pain in the ass. So I built an automated demo video generation pipeline that produces them without me touching a screen recorder. Every step — scripting, recording, narration, compositing, quality review — runs autonomously. This article breaks down exactly how it works.

Demo Videos Are the Highest-Leverage Sales Asset Nobody Makes Enough Of

The Math on Demo Videos vs. Other Content

A blog post takes me about 45 minutes with my AI-assisted content pipeline. A social media post takes 10. A demo video? Before I built this system, a single two-minute demo took 4-8 hours. That includes writing the script, setting up the environment, doing 3-5 takes because I clicked the wrong thing or stumbled over a line, editing out the mistakes, adding narration, exporting, and uploading.

The ROI per video is high. But the production cost per video is also high, which means most businesses treat demo videos like a quarterly project instead of what they should be — a standard artifact that ships with every feature, every release, every sales conversation.

I've seen the same pattern with clients. A payments startup I worked with had one demo video from 18 months ago. Their product had changed completely. A labor compliance SaaS client had zero video content — their entire sales motion was live Zoom demos, which meant every prospect required a salesperson's time.

Why the Bottleneck Is Production, Not Strategy

Nobody needs to be convinced that demo videos work. The bottleneck is purely production. Screen recording is manual, linear, and unforgiving. One mistake means you start over or spend time editing. Narration requires either a good mic setup and a quiet room, or paying a voiceover artist and waiting days.

This is a velocity problem. When the cost of producing something is high, you produce less of it. When you produce less of it, you underinvest in the highest-converting content format available. I got tired of that tradeoff and decided to engineer my way out of it.

The Full Pipeline: Five Tools, Zero Manual Recording

Architecture Overview

The pipeline runs as a single command. Here's the sequence:

Architecture diagram of the automated demo video generation pipeline showing five sequential stages: Claude for scripting, Playwright for browser recording, Cartesia for AI narration, FFmpeg for video compositing, and Gemini Vision for automated QA review, with a retry loop from QA back to earlier stages Five-Stage Pipeline Architecture

Claude writes the narration script and generates Playwright automation code
Playwright executes every browser interaction — clicks, scrolls, form fills — and records the screen
Cartesia generates natural-sounding narration audio from the script
FFmpeg composites the screen recording with the narration track, adds title cards and transitions
Gemini Vision watches the final video, evaluates quality, and either approves it or triggers a retry

Five tools. Zero manual recording. The output is a polished MP4 ready to publish.

Why Each Tool Exists in the Chain

I could have tried to find one tool that does everything. That's the instinct most people have — look for an all-in-one solution. But I've written before about why I use multiple AI models in production, and the principle applies here too: each component should be best-in-class at its specific job.

Claude is the strongest scriptwriter available. Playwright is the most reliable browser automation framework — it handles modern SPAs, shadow DOMs, and dynamic content better than Selenium or Puppeteer. Cartesia produces narration that actually sounds human at a fraction of ElevenLabs' cost. FFmpeg is the industry standard for video manipulation. And Gemini 2.5 Pro's multimodal vision can literally watch a video frame by frame and tell you what's wrong with it.

A monolithic tool would be mediocre at all five jobs. This chain is excellent at each one.

Step 1: Claude Writes the Script and the Automation Code

Script Structure That Works for Demos

The input is a prompt describing what to demo, who the audience is, and the tone. Something like: "Demo the project creation flow for a technical PM audience. Professional tone, no fluff, under 90 seconds."

Claude outputs two artifacts simultaneously. The first is a narration script with timing markers. It looks roughly like this: an intro sentence, then a sequence of action-narration pairs where each action ("Click 'New Project'") has a corresponding narration line ("From the dashboard, we'll create a new project") with a pause duration between segments.

The second artifact is Playwright automation code that mirrors every action in the script. The key insight — and the thing that took the most iteration to get right — is that both artifacts are generated together in the same context window. Claude understands that when the narration says "enter a project name," the Playwright code needs to type into a specific input field at a specific moment.

Generating Playwright Selectors From the Script

The system prompt is structured so Claude knows it's writing for a visual medium. I provide it with a simplified DOM description of the target application — not the full HTML, but the key interactive elements with their selectors. Claude maps narration beats to specific selectors and generates the automation code with timing data baked in.

Does it always work perfectly? No. Claude sometimes generates selectors that don't exist or have changed since the DOM snapshot was taken. About 30% of first attempts have at least one selector issue. That's fine. The QA step catches it, and the pipeline retries with corrected context. Acknowledging failure modes and building around them is more honest — and more effective — than pretending everything works on the first try.

Step 2: Playwright Records While Cartesia Narrates

Browser Automation That Looks Human

Playwright runs in headed mode (you need a visible browser to record the screen) and executes every interaction from the generated script. But raw browser automation looks terrible in a demo — instant clicks, teleporting cursors, text appearing all at once. Nobody watches that and thinks "this is a real product experience."

Side-by-side comparison of raw browser automation behavior versus humanized automation showing four key differences: cursor movement using Bézier curves instead of teleporting, typing at realistic speed with random delays, hover pauses before clicks, and smooth scrolling instead of instant jumps Humanization Layers in Browser Automation

So the automation includes humanization layers. Mouse movements follow Bézier curves instead of teleporting. Typing happens at 85-120 WPM with slight randomization between keystrokes. Clicks have a 200-400ms hover pause before they fire. Scrolling is smooth, not instant. These are small details, but they're the difference between a demo that feels like watching a real person and one that feels like watching a bot.

The viewport is locked at 1920x1080 for consistent output. Playwright's built-in video recording captures the screen at 30fps. The recording starts 2 seconds before the first action and ends 3 seconds after the last — giving the final video clean head and tail frames.

Narration That Doesn't Sound Like a Robot

Cartesia generates the narration as a separate audio track. I chose Cartesia over ElevenLabs for three reasons: the voice quality is comparable, the latency is lower for batch generation, and the cost is roughly 40% less per minute of audio. When you're generating dozens of videos, that cost difference matters.

The narration comes back with precise timing metadata — I know exactly how long each segment takes in milliseconds. This data feeds into the FFmpeg step.

FFmpeg then composites everything. The video track from Playwright, the audio track from Cartesia, any title cards or lower-third overlays I've defined in the template, and simple transitions between sections. The FFmpeg command chain handles resolution normalization, audio leveling, and export to H.264 MP4. The output is a complete, ready-to-publish video file.

Step 3: Gemini Vision Watches the Video and Grades It

What the QA Agent Checks

This is the step that makes the pipeline autonomous instead of merely automated. Most "automated" video systems still require a human to watch the output and verify it's good. I don't watch these videos before they publish. Gemini does.

Infographic showing the four quality control criteria evaluated by Gemini Vision AI: action completion verification, narration sync accuracy within 1-second tolerance, error state detection, and technical quality checks, with 70% first-pass and 95%+ post-retry success rates Gemini Vision QA Evaluation Criteria

Gemini 2.5 Pro's multimodal capabilities accept video input. The QA agent samples frames at key moments (aligned with the timing markers from the script) and evaluates four criteria:

Action completion: Did every UI action described in the script visibly happen on screen? If the script says "click New Project," Gemini checks that the New Project modal or page actually appeared.
Sync accuracy: Is the narration aligned with the visual actions within a 1-second tolerance? A narrator saying "watch as the dashboard loads" while the screen shows a completely different page is an instant fail.
Error detection: Are there any visible error states, loading spinners that persisted too long, broken layouts, or console error overlays? Gemini flags these.
Technical quality: Correct resolution, no black frames, proper aspect ratio, audio levels within acceptable range.

This follows the same pattern I use across every AI system I build — AI that rejects its own bad work is the difference between automation you can trust and automation you have to babysit.

Pass/Fail Criteria and Auto-Retry

Each criterion gets a pass or fail. Any single fail triggers a retry. The pipeline is smart enough to retry at the appropriate step. A timing sync issue? Rerun just the Playwright recording with adjusted delays. A selector failure where a button was never clicked? Go back to Claude with the error context and regenerate the automation code. A narration issue? Regenerate just the Cartesia audio.

First-pass success rate sits around 70%. Most failures are timing sync issues — the narration runs slightly ahead of or behind the visual action. These almost always resolve on the second attempt with adjusted pause durations. Full pipeline failures (where the script logic itself is wrong) happen maybe 5-8% of the time and require a complete regeneration.

After retry, the overall success rate is above 95%.

Results: What This Actually Saves

Time and Cost Per Video

Before: 4-8 hours per demo video. After: 12-15 minutes end-to-end, zero human involvement.

Before and after comparison showing manual demo video production taking 4-8 hours and costing $150-400 per video versus the automated pipeline taking 12-15 minutes at $0.30-$0.80 per video, representing approximately 97% cost reduction Before vs After Production Metrics

Cost per video breaks down to roughly $0.30-$0.80 in API calls. Claude for the script and automation code is the biggest chunk. Cartesia narration is a few cents per minute of audio. Gemini QA is minimal. FFmpeg is free.

What Changes When Demo Videos Cost Nothing

When demo videos are essentially free, you stop thinking about them as a project and start treating them as a standard build artifact. Every feature ships with a demo. Every product release gets a video changelog. Onboarding walkthroughs stay current because regenerating them after a UI change takes 15 minutes instead of a full afternoon.

The most interesting use case is personalized sales demos. Feed a prospect's company name and use case into the script prompt, and the pipeline generates a demo that references their specific workflow. I've used this for an AI SDR platform I built — each outbound sequence includes a personalized product walkthrough. Try doing that manually for 50 prospects a week.

Each video used to be a project. Now it's a CI/CD step.

When to Build a Pipeline vs. When to Just Hit Record

If you make two demo videos a year, use Loom. Seriously. This pipeline is overbuilt for low-volume use.

Decision flowchart helping readers determine whether to build an automated demo video pipeline based on production volume, release frequency, and personalization needs, with recommendations ranging from using Loom for low volume to building a full pipeline for high-frequency personalized demo generation Decision Framework: Build Pipeline vs. Just Record

It makes sense when you ship frequently and need updated demos for every release. When you have multiple products or features that each need dedicated walkthroughs. When you want personalized demos in your sales motion. Or when you're building a platform where customers expect self-serve video documentation.

The upfront build cost was real. This is custom Python, prompt engineering, FFmpeg pipeline work, and QA agent configuration. It's not a weekend project. But once it exists, the marginal cost of each video approaches zero, and the system compounds — every improvement to one component improves every video the pipeline produces from that point forward.

This is the kind of system I build for businesses — not the obvious wins that everyone's already chasing, but the workflow infrastructure that eliminates entire categories of manual work permanently.

Thinking About AI for Your Business?

If this kind of thinking resonates — building systems that eliminate bottlenecks rather than just talking about AI strategy — I'd like to hear what you're working on. I do free 30-minute discovery calls where we look at your operations, identify where AI could actually move the needle, and figure out whether it makes sense to work together.

No pitch deck. No slides. Just a real conversation about what's possible.

Book a Discovery Call