Use Gemini's multimodal capabilities to automatically review AI-generated videos. Score quality, catch errors, and iterate — all without watching a single frame yourself. Based on our ReportCast production pipeline.

TL;DR You can use Gemini's multimodal capabilities to automatically review AI-generated videos — scoring visual quality, audio sync, text accuracy, and pacing on a 1-10 scale. We built a 4-iteration loop for our ReportCast video pipeline that improved output scores from 6.5 to 8.0 without a single human review. Gemini caught lip sync issues, wrong aspect ratios, and audio overlap that we would have missed on first watch. This article covers the architecture, exact prompts, scoring criteria, and how to wire it into your own video pipeline.
If you're generating video with AI — Remotion, Runway, Kling, HeyGen, or any programmatic pipeline — you already know the problem: every render is a coin flip.
The audio might be 200ms off. A title card might overlap the speaker. The aspect ratio might be 16:9 when the client asked for 9:16. The background music might drown the narrator at the 30-second mark.
Human QA catches these issues. But watching every render takes time, and when you're iterating 4-8 times per video, the review bottleneck kills your shipping speed.
We hit this building ReportCast — an AI-generated video product that turns market research reports into short-form video with AI journalists. A 52-second video with two speakers, data visualizations, and branded overlays has at least 15 things that can go wrong per render.
So we built a review loop. Gemini watches the video, scores it, tells us what's broken, and we feed that back into the next render. Four iterations later: 6.5 → 8.0, no human in the loop.
The review loop has four components:
Render → Extract Frames → Gemini Review → Score + Feedback → Re-render
↑ |
└──────────────────────────────────────────────┘
(repeat until score ≥ 7.5)
Your video pipeline produces an MP4. In our case, that's Remotion rendering a React composition with:
The render itself takes 30-90 seconds depending on complexity.
You don't send the entire video to Gemini. You extract representative frames at critical moments:
# Extract frames at key timestamps
ffmpeg -i output.mp4 -vf "select='eq(n,0)+eq(n,30)+eq(n,90)+eq(n,150)+eq(n,last)'" \
-vsync vfn frame_%03d.png
For a 52-second video at 30fps, we extract:
Plus the full audio track as a separate file.
This is where it gets interesting. Gemini 2.5 Pro can process images and audio in a single prompt. You send:
Here's the core review prompt:
You are a professional video QA reviewer. Review this AI-generated
video based on the frames and audio provided.
Score each dimension 1-10:
1. VISUAL QUALITY
- Are title cards readable and properly positioned?
- Do speaker visuals match the audio timing?
- Are transitions smooth between segments?
- Is the aspect ratio consistent throughout?
2. AUDIO QUALITY
- Is the narration clear and properly paced?
- Does background music stay below voice level?
- Are there any audio artifacts, pops, or cuts?
- Do speaker transitions sound natural?
3. CONTENT ACCURACY
- Do data visualizations match the narrated numbers?
- Are company names and figures displayed correctly?
- Is the closing CTA present and readable?
4. PACING & FLOW
- Does the video feel rushed or too slow?
- Are pauses between speakers appropriate?
- Does the total runtime match the target (45-60s)?
For each dimension, provide:
- Score (1-10)
- Specific issues found (with timestamps if possible)
- Suggested fix for each issue
Overall score = weighted average:
Visual 30% + Audio 30% + Content 25% + Pacing 15%
If overall score >= 7.5: PASS
If overall score < 7.5: FAIL with specific fixes needed
Gemini returns structured feedback. The key is parsing it into actionable render parameters:
# Pseudocode for the feedback loop
def review_loop(video_path, max_iterations=4, target_score=7.5):
for i in range(max_iterations):
frames = extract_frames(video_path)
audio = extract_audio(video_path)
review = gemini_review(frames, audio)
score = review['overall_score']
print(f"Iteration {i+1}: {score}/10")
if score >= target_score:
return video_path, review
# Apply fixes based on feedback
render_params = translate_feedback_to_params(review)
video_path = re_render(render_params)
return video_path, review # Return best effort
The translate_feedback_to_params function is where domain knowledge matters. When Gemini says "audio overlap at 28-32s", you need to map that to a specific Remotion parameter — maybe increasing the gap between speaker segments by 500ms.
After running this loop on 20+ videos, here's what Gemini consistently detects:
Four iterations on a ReportCast demo video:
| Iteration | Overall Score | Issues Found | Key Fix Applied |
|---|---|---|---|
| 1 | 6.5 | Wrong aspect ratio, audio overlap at 28s, missing closing CTA | Fixed render dimensions, added 500ms gap, restored CTA component |
| 2 | 7.0 | Speaker B audio slightly louder than A, transition at 15s too abrupt | Normalized audio levels, added 300ms crossfade |
| 3 | 6.5 | Regression — new crossfade caused visual glitch on data chart | Removed crossfade on chart segment, kept for speaker transitions |
| 4 | 8.0 | Minor: closing card could hold 1s longer | Accepted — above threshold, shipped |
Total time: ~12 minutes (3 min per render + review cycle). Without the loop, a human would spend 15-20 minutes watching and noting issues, plus back-and-forth with the developer to fix them.
Cost per review cycle: ~$0.05 (Gemini Pro with 5 images + audio ≈ 2K input tokens + 500 output tokens).
Cost for 4 iterations: $0.20.
Compare that to a freelance video reviewer at $30-50/hour.
Remotion already outputs MP4 programmatically. Add a post-render step:
npx remotion render src/index.ts CompositionName output.mp4ffmpeg command from aboveSame pattern, but you can't automatically re-render with adjusted params (those tools are GUI-first). Instead, use the review loop to generate a specific revision brief that you paste into the tool's regeneration prompt.
Start with the frame extraction + Gemini review steps. Even without automated re-rendering, getting structured QA feedback in 30 seconds instead of watching the full video yourself is a massive time save.
Three things that dramatically improved review quality:
Generic: "Review this video for quality." Specific: "This is a 52-second vertical video (1080x1920) with two AI journalists discussing China EV market data. Speaker A (female voice) covers the first 25 seconds. Speaker B (male voice) covers seconds 26-48. Closing card runs from 48-52."
The more context Gemini has about what the video should look like, the better it spots deviations.
Free-form reviews give free-form (unusable) feedback. A numbered rubric with specific dimensions forces Gemini to evaluate systematically. Our 4-dimension rubric (visual, audio, content, pacing) consistently produces actionable feedback.
"Audio overlap" is vague. "Audio overlap at 28-32 seconds where Speaker A's final sentence overlaps Speaker B's intro" is fixable. Always ask Gemini to reference specific timestamps in its feedback.
This approach doesn't replace human creative review. It replaces technical QA — the boring, repeatable checks that catch production errors before they reach the client.
You still need a human to decide:
But for "is the render technically correct?" — Gemini handles it at $0.05 per review, 30 seconds per cycle, and catches things humans miss because we get tired after the third watch.
Gemini 2.5 Pro accepts images and audio but not direct video upload for analysis in the API. You extract key frames and the audio track separately. For most QA purposes, 5-10 frames plus full audio gives sufficient coverage. Frame selection at transition points catches 80%+ of visual issues.
About $0.05 per review cycle with Gemini Pro (5 images + audio). A full 4-iteration loop costs ~$0.20. Compare that to $30-50/hour for a human reviewer who needs 15-20 minutes per video.
Any format that ffmpeg can process — MP4, MOV, WebM, AVI. The review loop works with the extracted frames and audio, not the container format. Your pipeline can use Remotion, Runway, HeyGen, ffmpeg compositing, or any other tool.
For technical QA (aspect ratio, audio levels, missing elements, data accuracy), Gemini matches or exceeds human detection rates — 80%+ on the issues that matter most. For subjective quality (emotional tone, creative direction, audience fit), humans are still significantly better. The optimal approach is AI for technical QA, human for creative review.
Yes — this is where it shines most. If you're producing 10-50 short-form videos per week, manual QA becomes a full-time job. The automated loop handles technical review in seconds per video, and you only need human eyes on the creative aspects.
We built ReportCast — an AI video pipeline that turns market research into short-form video with AI journalists. The review loop described here is part of our production stack. If you're building video automation and want help with the QA layer, book a call.
The AI Agent Decision Guide walks you through a 20-question framework to figure out what setup actually fits your workflow. Free PDF.

Every tool powering our AI agency — from automation to voice AI, hosting to newsletters. Honest reviews, real commissions disclosed, and the exact stack that drives our results.
Everything you need to know about AI agents for business — what they are, how they work, real use cases, costs, and how to get started.

Real numbers, real deliverables. How we run an AI consulting agency with 2 humans and AI agents, and why the traditional consulting model is about to break.