Latent Action Representation
A compact representation that captures “what changed between frames” in a way that can support robot control and generalize across embodiments.
Why it matters
Internet-scale human video is abundant, but robots typically need expensive action labels. Latent action representations aim to extract a reusable action signal from pure visual sequences so learning can scale with data—similar to how large vision encoders scale with images.
- Bridges modalities: from pixels to action-relevant features
- Improves transfer: less tied to a single robot’s control interface
- Enables scaling: supports pretraining on unlabeled video
How to evaluate representation quality
A common approach is probing: keep the representation model fixed and train a lightweight head to predict action targets. This tests whether the embedding contains usable action information without conflating it with a heavy downstream policy.
- Semantic probing: classify atomic/composite actions
- Control probing: regress end-effector pose trajectories (MSE)
See: LARYBench overview.