Learning from Human Videos

A practical path to scaling embodied intelligence: leverage internet-scale video, learn general representations, then align to robot control.

LARYBench Latent action representation

Why video is both promising and difficult

Human video contains rich interaction signals (hands, objects, intent), but it typically lacks the robot-specific action labels needed for control. This makes direct supervised policy learning hard at web scale.

Cheap to scale: video is plentiful and diverse
Hard to supervise: no direct robot action labels
Domain mismatch: embodiment and viewpoint vary widely

A scalable bridge: latent action representations

Latent action representations aim to encode frame-to-frame changes into an embedding that captures action-relevant information and can transfer across embodiments. This supports pretraining on human video and later aligning to robot control.

If you’re evaluating these representations, start with the probing-based benchmark: LARYBench.

All Topics Benchmarks

Learning from Human Videos

Why video is both promising and difficult

A scalable bridge: latent action representations

Related pages