Learning from Human Videos
A practical path to scaling embodied intelligence: leverage internet-scale video, learn general representations, then align to robot control.
Why video is both promising and difficult
Human video contains rich interaction signals (hands, objects, intent), but it typically lacks the robot-specific action labels needed for control. This makes direct supervised policy learning hard at web scale.
- Cheap to scale: video is plentiful and diverse
- Hard to supervise: no direct robot action labels
- Domain mismatch: embodiment and viewpoint vary widely
A scalable bridge: latent action representations
Latent action representations aim to encode frame-to-frame changes into an embedding that captures action-relevant information and can transfer across embodiments. This supports pretraining on human video and later aligning to robot control.
If you’re evaluating these representations, start with the probing-based benchmark: LARYBench.