The DispatchArchive

Field notes from the operation.

Working papers on Transfer of Experience and AI agents — shipped by teams running agents in production.

AI ProductivityAI Jungle

How to Build an AI Video Review Loop with Gemini (Practical Guide)

Use Gemini's multimodal capabilities to automatically review AI-generated videos. Score quality, catch errors, and iterate — all without watching a single frame yourself. Based on our ReportCast production pipeline.

How to Build an AI Video Review Loop with Gemini

TL;DRYou can use Gemini's multimodal capabilities to automatically review AI-generated videos — scoring visual quality, audio sync, text accuracy, and pacing on a 1-10 scale. We built a 4-iteration loop for our ReportCast video pipeline that improved output scores from 6.5 to 8.0 without a single human review. Gemini caught lip sync issues, wrong aspect ratios, and audio overlap that we would have missed on first watch. This article covers the architecture, exact prompts, scoring criteria, and how to wire it into your own video pipeline.

Why Video Needs Automated QA

If you're generating video with AI — Remotion, Runway, Kling, HeyGen, or any programmatic pipeline — you already know the problem: every render is a coin flip.

The audio might be 200ms off. A title card might overlap the speaker. The aspect ratio might be 16:9 when the client asked for 9:16. The background music might drown the narrator at the 30-second mark.

Human QA catches these issues. But watching every render takes time, and when you're iterating 4-8 times per video, the review bottleneck kills your shipping speed.

We hit this building ReportCast — an AI-generated video product that turns market research reports into short-form video with AI journalists. A 52-second video with two speakers, data visualizations, and branded overlays has at least 15 things that can go wrong per render.

So we built a review loop. Gemini watches the video, scores it, tells us what's broken, and we feed that back into the next render. Four iterations later: 6.5 → 8.0, no human in the loop.

The Architecture

The review loop has four components:

Render → Extract Frames → Gemini Review → Score + Feedback → Re-render
   ↑                                              |
   └──────────────────────────────────────────────┘
                    (repeat until score ≥ 7.5)

Step 1: Render the Video

Your video pipeline produces an MP4. In our case, that's Remotion rendering a React composition with:

  • ElevenLabs voice tracks (two different voices for two journalists)
  • Gemini Imagen-generated visuals for data points
  • Branded title cards and lower thirds
  • Background music at -18dB under voice

The render itself takes 30-90 seconds depending on complexity.

Step 2: Extract Key Frames

You don't send the entire video to Gemini. You extract representative frames at critical moments:

```bash

Extract frames at key timestamps

ffmpeg -i output.mp4 -vf "select='eq(n,0)+eq(n,30)+eq(n,90)+eq(n,150)+eq(n,last)'" \
-vsync vfn frame_%03d.png
```

For a 52-second video at 30fps, we extract:

  • Frame 0 (opening title)
  • Frame 30 (1 second — first speaker appears)
  • Frame 90 (3 seconds — transition)
  • Frame 150 (5 seconds — second speaker)
  • Last frame (closing card)

Plus the full audio track as a separate file.

Step 3: Gemini Multimodal Review

This is where it gets interesting. Gemini 2.5 Pro can process images and audio in a single prompt. You send:

  • The extracted frames (as images)
  • The audio track
  • A structured scoring rubric

Here's the core review prompt:

```
You are a professional video QA reviewer. Review this AI-generated
video based on the frames and audio provided.

Score each dimension 1-10:

  1. VISUAL QUALITY
  • Are title cards readable and properly positioned?
  • Do speaker visuals match the audio timing?
  • Are transitions smooth between segments?
  • Is the aspect ratio consistent throughout?
  1. AUDIO QUALITY
  • Is the narration clear and properly paced?
  • Does background music stay below voice level?
  • Are there any audio artifacts, pops, or cuts?
  • Do speaker transitions sound natural?
  1. CONTENT ACCURACY
  • Do data visualizations match the narrated numbers?
  • Are company names and figures displayed correctly?
  • Is the closing CTA present and readable?
  1. PACING & FLOW
  • Does the video feel rushed or too slow?
  • Are pauses between speakers appropriate?
  • Does the total runtime match the target (45-60s)?

For each dimension, provide:

  • Score (1-10)
  • Specific issues found (with timestamps if possible)
  • Suggested fix for each issue

Overall score = weighted average:
Visual 30% + Audio 30% + Content 25% + Pacing 15%

If overall score >= 7.5: PASS
If overall score < 7.5: FAIL with specific fixes needed
```

Step 4: Parse and Re-render

Gemini returns structured feedback. The key is parsing it into actionable render parameters:

```python

Pseudocode for the feedback loop

def review_loop(video_path, max_iterations=4, target_score=7.5):
for i in range(max_iterations):
frames = extract_frames(video_path)
audio = extract_audio(video_path)

review = gemini_review(frames, audio)
score = review['overall_score']

print(f"Iteration {i+1}: {score}/10")

if score >= target_score:
return video_path, review

# Apply fixes based on feedback
render_params = translate_feedback_to_params(review)
video_path = re_render(render_params)

return video_path, review # Return best effort
```

The translate_feedback_to_params function is where domain knowledge matters. When Gemini says "audio overlap at 28-32s", you need to map that to a specific Remotion parameter — maybe increasing the gap between speaker segments by 500ms.

What Gemini Actually Catches

After running this loop on 20+ videos, here's what Gemini consistently detects:

Catches Reliably (8/10+ detection rate)

  • Wrong aspect ratio — if you render 16:9 but frames show 9:16, it flags immediately
  • Missing text elements — title cards, lower thirds, CTAs that didn't render
  • Audio level imbalance — when music drowns the narrator
  • Abrupt transitions — hard cuts where there should be fades
  • Mismatched data — when a chart shows "$2.3M" but the narrator says "$2.5M"

Catches Sometimes (5-7/10 detection rate)

  • Lip sync issues — requires good frame extraction timing
  • Pacing problems — subjective, but Gemini gives reasonable feedback on "too rushed" segments
  • Brand consistency — catches wrong colors or fonts if you describe the expected brand in the prompt

Misses Often (< 5/10 detection rate)

  • Subtle audio artifacts — compression glitches, micro-pops
  • Emotional tone — whether the narrator sounds engaged vs flat
  • Cultural appropriateness — visual metaphors that might not work in certain markets

Our Production Results

Four iterations on a ReportCast demo video:

IterationOverall ScoreIssues FoundKey Fix Applied
16.5Wrong aspect ratio, audio overlap at 28s, missing closing CTAFixed render dimensions, added 500ms gap, restored CTA component
27.0Speaker B audio slightly louder than A, transition at 15s too abruptNormalized audio levels, added 300ms crossfade
36.5Regression — new crossfade caused visual glitch on data chartRemoved crossfade on chart segment, kept for speaker transitions
48.0Minor: closing card could hold 1s longerAccepted — above threshold, shipped

Total time: ~12 minutes (3 min per render + review cycle). Without the loop, a human would spend 15-20 minutes watching and noting issues, plus back-and-forth with the developer to fix them.

Cost per review cycle: ~$0.05 (Gemini Pro with 5 images + audio ≈ 2K input tokens + 500 output tokens).

Cost for 4 iterations: $0.20.

Compare that to a freelance video reviewer at $30-50/hour.

How to Wire This Into Your Pipeline

If You Use Remotion

Remotion already outputs MP4 programmatically. Add a post-render step:

  1. Render → npx remotion render src/index.ts CompositionName output.mp4
  2. Extract frames → ffmpeg command from above
  3. Call Gemini API with frames + audio
  4. Parse response → adjust Remotion props
  5. Re-render with updated props
  6. Repeat until PASS

If You Use Runway / HeyGen / Other

Same pattern, but you can't automatically re-render with adjusted params (those tools are GUI-first). Instead, use the review loop to generate a specific revision brief that you paste into the tool's regeneration prompt.

If You're Building From Scratch

Start with the frame extraction + Gemini review steps. Even without automated re-rendering, getting structured QA feedback in 30 seconds instead of watching the full video yourself is a massive time save.

The Prompt Engineering That Matters

Three things that dramatically improved review quality:

1. Be Specific About Your Format

Generic: "Review this video for quality." Specific: "This is a 52-second vertical video (1080x1920) with two AI journalists discussing China EV market data. Speaker A (female voice) covers the first 25 seconds. Speaker B (male voice) covers seconds 26-48. Closing card runs from 48-52."

The more context Gemini has about what the video should look like, the better it spots deviations.

2. Use a Numbered Rubric

Free-form reviews give free-form (unusable) feedback. A numbered rubric with specific dimensions forces Gemini to evaluate systematically. Our 4-dimension rubric (visual, audio, content, pacing) consistently produces actionable feedback.

3. Ask for Timestamps

"Audio overlap" is vague. "Audio overlap at 28-32 seconds where Speaker A's final sentence overlaps Speaker B's intro" is fixable. Always ask Gemini to reference specific timestamps in its feedback.

Limitations

This approach doesn't replace human creative review. It replaces technical QA — the boring, repeatable checks that catch production errors before they reach the client.

You still need a human to decide:

  • Does the video feel right emotionally?
  • Is the creative direction working?
  • Would the target audience engage with this?

But for "is the render technically correct?" — Gemini handles it at $0.05 per review, 30 seconds per cycle, and catches things humans miss because we get tired after the third watch.

Frequently Asked Questions

Can Gemini review a full video or just frames?

Gemini 2.5 Pro accepts images and audio but not direct video upload for analysis in the API. You extract key frames and the audio track separately. For most QA purposes, 5-10 frames plus full audio gives sufficient coverage. Frame selection at transition points catches 80%+ of visual issues.

How much does AI video review cost?

About $0.05 per review cycle with Gemini Pro (5 images + audio). A full 4-iteration loop costs ~$0.20. Compare that to $30-50/hour for a human reviewer who needs 15-20 minutes per video.

What video formats work with this approach?

Any format that ffmpeg can process — MP4, MOV, WebM, AVI. The review loop works with the extracted frames and audio, not the container format. Your pipeline can use Remotion, Runway, HeyGen, ffmpeg compositing, or any other tool.

How accurate is AI video review compared to human review?

For technical QA (aspect ratio, audio levels, missing elements, data accuracy), Gemini matches or exceeds human detection rates — 80%+ on the issues that matter most. For subjective quality (emotional tone, creative direction, audience fit), humans are still significantly better. The optimal approach is AI for technical QA, human for creative review.

Can this work for social media content at scale?

Yes — this is where it shines most. If you're producing 10-50 short-form videos per week, manual QA becomes a full-time job. The automated loop handles technical review in seconds per video, and you only need human eyes on the creative aspects.


We built ReportCast — an AI video pipeline that turns market research into short-form video with AI journalists. The review loop described here is part of our production stack. If you're building video automation and want help with the QA layer, book a call.

AI Video Review Loop with Gemini: Auto-QA Your Video Pipeline — AI Jungle