Gemini 3 Flash Agentic Vision - The Sandwich Architecture for pixel-perfect UI generation. Phase 1: SURVEYOR measures layout BEFORE generation (grids, spacing, colors). Phase 2: QA TESTER verifies AFTER render (SSIM, diff regions, auto-fix). "Measure twice, cut once" - generator gets hard data, not guesses. Use when: video-to-code, image-to-code, UI verification, layout measurement, pixel-perfect generation, SSIM comparison, auto-fix suggestions.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: agentic-vision description: | Gemini 3 Flash Agentic Vision - The Sandwich Architecture for pixel-perfect UI generation. Phase 1: SURVEYOR measures layout BEFORE generation (grids, spacing, colors). Phase 2: QA TESTER verifies AFTER render (SSIM, diff regions, auto-fix). "Measure twice, cut once" - generator gets hard data, not guesses.
Use when: video-to-code, image-to-code, UI verification, layout measurement, pixel-perfect generation, SSIM comparison, auto-fix suggestions. user-invocable: true
Agentic Vision - The Sandwich Architecture
Version: 1.0.0 Last Updated: 2026-01-30
What is Agentic Vision?
Agentic Vision in Gemini 3 Flash converts image understanding from a static act into an agentic process. It combines visual reasoning with Code Execution.
Think β Act β Observe loop:
1. THINK: Analyze image, formulate plan
2. ACT: Generate and execute Python code (crop, measure, annotate)
3. OBSERVE: Process results, refine understanding
Key capability: Instead of "guessing" padding is p-4, it MEASURES and returns 24px.
The Sandwich Architecture
REPLAY "SANDWICH" ARCHITECTURE
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ββββββββββββ β
β β Video ββββββββββββββββββββββββββββββββ β
β β Input β β β
β ββββββ¬ββββββ β β
β β βΌ β
β β βββββββββββββββββββββββββββ β
β β β PHASE 1: THE SURVEYOR β β
β β β (Agentic Vision Flash) β β
β β βββββββββββββββββββββββββββ€ β
β β β 1. Measure Grids (px) β β
β β β 2. Extract Colors (hex) β β
β β β 3. Map Layout (JSON) β ββββ KEY
β β ββββββββββββββ¬βββββββββββββ β
β β β β
β βΌ βΌ β
β ββββββββββββββββ βββββββββββββββββββββββββββ β
β β Gemini 3 Pro βββββββββββββββ Architecture Specs β β
β β (Code Gen) β β (Hard Data JSON) β β
β ββββββββ¬ββββββββ βββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β
β β Render View βββββΆβ PHASE 2: THE QA TESTER β β
β ββββββββββββββββ β (Agentic Vision Flash) β β
β ββββββββββββββββββββββββββββββββββββ€ β
β β 1. Compare Original vs Render β β
β β 2. "Spot the difference" (SSIM) β β
β β 3. Auto-fix suggestions β β
β βββββββββββββββββββ¬βββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββ β
β β FINAL PIXEL-PERFECT β β
β β COMPONENT β β
β ββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Phase 1: THE SURVEYOR
Measures layout BEFORE code generation.
API Endpoint
POST /api/survey/measure
{
imageBase64: string, // Base64 encoded frame
mimeType?: string, // default: 'image/png'
useParallel?: boolean, // default: true (faster)
includePromptFormat?: boolean // Include formatted prompt for generator
}
Response
{
success: true,
measurements: {
imageDimensions: { width: 1920, height: 1080 },
grid: { columns: 12, gap: "24px" },
spacing: {
sidebarWidth: "256px",
navHeight: "64px",
cardPadding: "24px",
sectionGap: "48px",
containerPadding: "32px"
},
colors: {
background: "#0f172a",
surface: "#1e293b",
primary: "#6366f1",
text: "#ffffff",
textMuted: "#94a3b8",
border: "#334155"
},
typography: {
h1: "48px",
h2: "32px",
body: "16px",
small: "14px"
},
components: [
{ type: "sidebar", bbox: {...}, confidence: 0.95 }
],
confidence: 0.91
},
promptFormat: "... formatted for code generator ..."
}
Code Usage
import { runParallelSurveyor, formatSurveyorDataForPrompt } from '@/lib/agentic-vision';
// 1. Run Surveyor on video frame
const { measurements } = await runParallelSurveyor(frameBase64, 'image/png');
// 2. Inject into code generator prompt
const prompt = `
${SYSTEM_PROMPT}
${formatSurveyorDataForPrompt(measurements)}
Generate code based on the video above.
`;
// 3. Generator uses EXACT values: p-[24px] not p-4
Phase 2: THE QA TESTER
Verifies generated UI AFTER render.
API Endpoint
POST /api/verify/diff
{
originalImageBase64: string, // Original frame from video
generatedImageBase64: string, // Screenshot of generated code
mimeType?: string, // default: 'image/png'
quickCheck?: boolean, // Only SSIM, skip full analysis
includeReport?: boolean // Include formatted text report
}
Response
{
success: true,
verification: {
ssimScore: 0.94,
overallAccuracy: "94%",
verdict: "needs_fixes", // "pass" | "needs_fixes" | "major_issues"
issues: [
{
type: "spacing",
severity: "medium",
location: "card padding",
description: "Card padding is 16px, should be 24px",
expected: "24px",
actual: "16px"
}
],
autoFixSuggestions: [
{
selector: ".card",
property: "padding",
suggestedValue: "24px",
confidence: 0.85
}
]
},
report: "β
QA VERIFICATION REPORT..."
}
Verdict Rules
| Verdict | Condition |
|---|---|
pass | SSIM >= 0.95 AND no high severity issues |
needs_fixes | SSIM >= 0.85 AND <= 3 high severity issues |
major_issues | SSIM < 0.85 OR > 3 high severity issues |
Enabling Code Execution
Agentic Vision requires codeExecution tool in Gemini API:
import { GoogleGenAI } from '@google/genai';
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const response = await ai.models.generateContent({
model: 'gemini-3-flash',
contents: [
{ text: prompt },
{ inlineData: { data: imageBase64, mimeType: 'image/png' } }
],
config: {
tools: [{ codeExecution: {} }] // <-- CRITICAL
}
});
// Response contains:
// - executableCode: { code: "Python code..." }
// - codeExecutionResult: { outcome: "OUTCOME_OK", output: "JSON result" }
Available Python Libraries in Sandbox
# Data Science
import numpy as np
import pandas as pd
from scipy import ndimage
from sklearn import preprocessing
# Image Processing
from PIL import Image
from skimage import filters, measure, transform
from skimage.metrics import structural_similarity as ssim
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Utilities
import io
import json
Technical Considerations
1. Coordinate Normalization
Gemini may rescale images internally. Always request BOTH:
- Normalized coordinates (0.0-1.0)
- Image dimensions for backend rescaling
def normalize_bbox(x, y, w, h, img_width, img_height):
return {
"x": x / img_width,
"y": y / img_height,
"width": w / img_width,
"height": h / img_height
}
2. Parallel Execution for Speed
Run color sampling and spacing measurement in parallel:
const [colors, spacing] = await Promise.all([
surveyColors(frame), // Fast
surveySpacing(frame) // Heavier CV
]);
// Time reduced by ~50%
3. SSIM with scikit-image
Use industry-standard SSIM calculation:
from skimage.metrics import structural_similarity as ssim
score, diff_image = ssim(img1, img2, full=True)
# score: 0.0 (different) to 1.0 (identical)
# diff_image: per-pixel difference map
Integration with Replay Pipeline
Before (Without Surveyor)
Video β Gemini Pro "guesses" β p-4 or p-6? β 3-5 iterations
After (With Sandwich Architecture)
Video β Surveyor MEASURES β padding: 24px β Generator EXECUTES β 1-2 iterations
Result: First generation is 80% better!
File Structure
lib/agentic-vision/
βββ index.ts # Main exports
βββ types.ts # TypeScript interfaces
βββ prompts.ts # Surveyor & QA prompts
βββ surveyor.ts # Phase 1 implementation
βββ qa-tester.ts # Phase 2 implementation
app/api/
βββ survey/measure/route.ts # Surveyor endpoint
βββ verify/diff/route.ts # QA Tester endpoint
Quick Start
// Full pipeline with Agentic Vision
// 1. PHASE 1: Measure before generation
const surveyResult = await fetch('/api/survey/measure', {
method: 'POST',
body: JSON.stringify({
imageBase64: videoFrame,
includePromptFormat: true
})
});
const { measurements, promptFormat } = await surveyResult.json();
// 2. Generate code with HARD DATA
const codeResult = await generateWithConstraints(video, promptFormat);
// 3. Render and screenshot
const screenshot = await renderAndCapture(codeResult.code);
// 4. PHASE 2: Verify
const qaResult = await fetch('/api/verify/diff', {
method: 'POST',
body: JSON.stringify({
originalImageBase64: videoFrame,
generatedImageBase64: screenshot
})
});
const { verification } = await qaResult.json();
// 5. Check result
if (verification.verdict === 'pass') {
console.log('β
Pixel-perfect!');
} else {
console.log('β οΈ Apply fixes:', verification.autoFixSuggestions);
}
References
More by ma1orek
View allGenerate custom favicons from logos, text, or brand colors - prevents launching with CMS defaults. Extract icons from logos, create monogram favicons from initials, or use branded shapes. Outputs all required formats: favicon.svg, favicon.ico, apple-touch-icon.png, and web app manifest. Use when: initializing new websites, replacing WordPress/CMS default favicons, converting logos to favicons, creating branded icons from text only, or troubleshooting favicon not displaying, iOS icon transparency, or missing manifest files.
Add OAuth authentication to MCP servers on Cloudflare Workers. Uses @cloudflare/workers-oauth-provider with Google OAuth for Claude.ai-compatible authentication. Prevents 9 documented errors including RFC 8707 audience bugs, Claude.ai connection failures, and CSRF vulnerabilities. Use when building MCP servers that need user authentication, implementing Dynamic Client Registration (DCR) for Claude.ai, or replacing static auth tokens with OAuth flows. Includes workarounds for production redirect URI mismatches and re-auth loop issues.
Four slash commands for documentation lifecycle: /docs, /docs-init, /docs-update, /docs-claude. Create, maintain, and audit CLAUDE.md, README.md, and docs/ structure with smart templates. Use when: starting new projects, maintaining documentation, auditing docs for staleness, or ensuring CLAUDE.md matches project state.
Generate structured planning docs for web projects with context-safe phases, verification criteria, and exit conditions. Creates IMPLEMENTATION_PHASES.md plus conditional docs. Use when: starting new projects, adding major features, or breaking large work into manageable phases.
