From Generators to Generalists: Evidence that Video Models Are Zero‑Shot Learners and Reasoners

Table of Contents

Structured abstract

Background. Large language models (LLMs) transitioned NLP from many task‑specific systems to generalist foundation models through simple primitives—web‑scale data and generative training objectives. The paper asks whether modern video generators share this trajectory for vision, focusing on Veo 3 (with comparisons to Veo 2). Video models are zero-shot lear…

Objective. To test whether a single video model, prompted only with an image (first frame) and text, can perceive, model, manipulate, and reason about visual scenes in zero‑shot settings—i.e., without task‑specific training or adapters. Video models are zero-shot lear…

Methods.
Black‑box prompting. The authors use Vertex AI’s public Veo 2/3 endpoints (720p, 24 FPS, ~8 s per clip), which include an LLM‑based prompt rewriter; they treat rewriter+generator as one system. For several reasoning tasks, a standalone LLM cannot solve the image‑only input, suggesting visual (not purely linguistic) competence. Quantitatively, 7 tasks are benchmarked; qualitatively, 62 tasks cover the perception→modeling→manipulation→reasoning hierarchy. Total: 18,384 videos (17,640 quantitative + 744 qualitative), with best‑frame and last‑frame evaluation to reflect the model’s tendency to continue animating post‑solution (Table 1; Figs. 3–9; pp. 5–9, 31). Video models are zero-shot lear…

Results. Veo 3 shows strong zero‑shot breadth:
• Perception: Edge detection OIS pass@10 ≈ 0.77 best‑frame (last‑frame ≈ 0.74), clearly exceeding Veo 2 (0.57/0.51) though below SOTA (0.90). Many “false positives” reflect real edges missing from ground truth (Fig. 3 p. 5; Fig. 60 p. 31). Video models are zero-shot lear…
• Segmentation: Class‑agnostic, scene‑wide segmentation achieves mIoU ≈ 0.74 best‑frame with a green background vs 0.66 with white; last‑frame ≈ 0.56 (Fig. 4 p. 6). Video models are zero-shot lear…
• Manipulation: Object extraction (animal counting/alignment) reaches 93% pass@10 on the last frame, far above Veo 2’s chance‑level outcomes (Fig. 5 p. 6); a small human study favors Veo 3 over Veo 2 on edit fidelity and precision (Fig. 6 p. 7). Video models are zero-shot lear…
• Reasoning: Maze solving improves markedly from Veo 2 to Veo 3; on 5×5 grids, 78% pass@10 vs 14% (Fig. 7 p. 7). For visual symmetry, Veo 3 strongly outperforms Veo 2 and an image‑editor baseline across “shapes” and “random” splits; prompt wording can swing pass@1 by 40–64 points (Fig. 8 p. 8; Table 2 p. 40). Visual analogies: Veo 3 is strong on color (95% pass@1) and resize (67%), but below chance on reflect (29%) and rotate (19%), indicating systematic geometric biases (Fig. 9 p. 8; Fig. 61 p. 38). Video models are zero-shot lear…

Limitations. (i) Black‑box composition with an LLM rewriter complicates attribution. (ii) Reported numbers are lower bounds—performance is highly prompt‑sensitive; “best‑frame” can exceed practical “last‑frame” reliability (pp. 5, 9–10, 40). (iii) Weaknesses remain in metric geometry, symbolic tasks, and physically constrained interactions (Sec. D, pp. 42–46). Video models are zero-shot lear…

Conclusions. As LLMs did for language, video generators are beginning to look like general‑purpose vision systems: one prompted model covering many tasks with early signs of visual reasoning via a chain‑of‑frames (CoF) process (Figs. 1–2, pp. 2–3). While specialists remain stronger per‑task, the breadth, the Veo 2→Veo 3 gains, and test‑time compute headroom (pass@10 ≫ pass@1) suggest a plausible path to vision foundation models. Video models are zero-shot lear…

Key results (bulleted)

Breadth across the vision stack: 62 qualitative tasks span Perception→Modeling→Manipulation→Reasoning (Fig. 1 p. 2; Fig. 2 p. 3). Video models are zero-shot lear…
Edge detection (zero‑shot): Veo 3 OIS ~0.77 best‑frame; detailed edges often exceed dataset annotations (Figs. 3 & 60, pp. 5, 31). Video models are zero-shot lear…
Segmentation (class‑agnostic): mIoU ~0.74 best‑frame with a green background; prompt context matters (Fig. 4 p. 6). Video models are zero-shot lear…
Object extraction: 93% pass@10 on last‑frame, with Veo 2 near chance (Fig. 5 p. 6). Video models are zero-shot lear…
Editing quality: Human raters favor Veo 3 over Veo 2 on fidelity and precision (Fig. 6 p. 7). Video models are zero-shot lear…
Maze solving: 78% pass@10 on 5×5 grids (Veo 2: 14%), robust to irregular layouts where baselines fail (Fig. 7 p. 7). Video models are zero-shot lear…
Visual symmetry: Large margin over Veo 2 and an image‑editor baseline; prompt sensitivity swings pass@1 by 40–64 pts (Fig. 8 p. 8; Table 2 p. 40). Video models are zero-shot lear…
Analogies: Strong on color/resize; below chance on reflect/rotate, with majority vote getting worse as k grows—evidence of systematic bias (Figs. 9 & 61, pp. 8, 38). Video models are zero-shot lear…

Expanded discussion

1) Perception beyond the training objective

The paper shows zero‑shot super‑resolution, denoising, deblurring, low‑light enhancement, keypoints, and edge/segmentation without bespoke heads or fine‑tuning (Figs. 10–16, pp. 17–18; Fig. 3 p. 5; Fig. 4 p. 6). Particularly telling is the edge‑map audit (Fig. 60 p. 31): many “false positives” are actually legitimate edges (e.g., tire treads), exposing dataset limitations rather than model hallucination. These results imply a distributed visual primitive internalized by generative video training—analogous to emergent zero‑shot classification in image diffusion models. Video models are zero-shot lear…

2) Modeling the physical world

Under “Modeling,” Veo 3 renders flammability, rigid/soft body behavior, air resistance, buoyancy, optical refraction/reflection, additive/subtractive color mixing, category abstractions, part‑whole parsing, and state memory under camera motion (Figs. 21–31, pp. 19–23). The glass sphere inverts the background (refraction; Fig. 27 p. 22) and a bottle cap floats while a rock sinks (Fig. 24 p. 21). While not a full physics engine, these proto‑physics capabilities support later manipulation and reasoning. Video models are zero-shot lear…

3) Manipulation as controllable imagination

Veo 3 performs background removal, style transfer, colorization, inpainting/outpainting, text manipulation, scene composition, novel‑view synthesis, 3D re‑posing, and even professional portraitization (Figs. 32–43, pp. 23–26). The quantitative object extraction test provides a clean, verifiable success metric (93% pass@10 last‑frame; Fig. 5 p. 6), and a small user study shows Veo 3 edits are both faithful and precise relative to Veo 2 (Fig. 6 p. 7). Video models are zero-shot lear…

4) Reasoning by “chain‑of‑frames” (CoF)

Because a video solution must unfold step‑by‑step, the model can “think by doing.” The paper demonstrates tree BFS, graph traversal, sequence completion, symmetry completion, tool use, toy Sudoku, water puzzles, and maze/navigation (Figs. 48–59, pp. 28–30). Mazes (Fig. 7 p. 7) and symmetry (Fig. 8 p. 8) are especially diagnostic—requiring spatial planning and constraint satisfaction. However, analogical reasoning exposes coordinate‑frame weaknesses (reflect/rotate; Fig. 9 p. 8; Fig. 61 p. 38). Video models are zero-shot lear…

5) Evaluation choices matter: best‑frame vs last‑frame; pass@k

The authors report both best‑frame (upper bound) and last‑frame (pre‑specified target) because Veo often continues animating after solving the task (pp. 5, 9). Consistently, pass@10 ≫ pass@1, indicating test‑time compute strategies (sampling + selection) could lift practical reliability without retraining (Figs. 3–9, pp. 5–9). Video models are zero-shot lear…

6) Where it fails—and why that’s informative

Failures cluster around metric geometry (depth/normals; Fig. 62–63 p. 42), following force/trajectory annotations (Fig. 64 p. 42), symbolic constraints (word searches, equations; Figs. 67, 69 pp. 43–44), and contact‑rich physics (collisions, breaking glass, constrained motion, cloth folding; Figs. 72–77, pp. 44–46). These point to data/objective gaps and control‑interface limits in current models. Video models are zero-shot lear…

7) Economics and trajectory

The paper argues that—mirroring LLMs—inference costs are falling quickly, making generalists economically attractive over time (p. 9–10). Combined with the Veo 2→3 improvements and CoF behavior, this supports the thesis that video models are on a path to vision foundation models (Figs. 1–2, pp. 2–3). Video models are zero-shot lear…

From Generators to Generalists: Evidence that Video Models Are Zero‑Shot Learners and Reasoners

Up next

Marketing First, Operations Next: Why Pet Food’s AI Adoption Is Split—and Rational

Author

Deep Intellica Team

Share article

Structured abstract

Key results (bulleted)

Expanded discussion