V-JEPA 2
Video Joint Embedding Predictive Architecture for physical world understanding
View on GitHubOverview
V-JEPA 2 (Video Joint Embedding Predictive Architecture) learns representations of the physical world from video through self-supervised learning. It predicts future states of video without pixel-level reconstruction.
Key Features
- Self-supervised video representation learning
- Physical world understanding
- No pixel-level reconstruction needed
- Scales to large video datasets
- Transfer to robotics and embodied AI tasks