V-JEPA 2

Video Joint Embedding Predictive Architecture for physical world understanding

Overview

V-JEPA 2 (Video Joint Embedding Predictive Architecture) learns representations of the physical world from video through self-supervised learning. It predicts future states of video without pixel-level reconstruction.

Key Features

Self-supervised video representation learning
Physical world understanding
No pixel-level reconstruction needed
Scales to large video datasets
Transfer to robotics and embodied AI tasks