V-JEPA 2

Video Joint Embedding Predictive Architecture for physical world understanding

View on GitHub

Overview

V-JEPA 2 (Video Joint Embedding Predictive Architecture) learns representations of the physical world from video through self-supervised learning. It predicts future states of video without pixel-level reconstruction.

Key Features

  • Self-supervised video representation learning
  • Physical world understanding
  • No pixel-level reconstruction needed
  • Scales to large video datasets
  • Transfer to robotics and embodied AI tasks