Latent Action Representation

A compact representation that captures “what changed between frames” in a way that can support robot control and generalize across embodiments.

Why it matters

Internet-scale human video is abundant, but robots typically need expensive action labels. Latent action representations aim to extract a reusable action signal from pure visual sequences so learning can scale with data—similar to how large vision encoders scale with images.

  • Bridges modalities: from pixels to action-relevant features
  • Improves transfer: less tied to a single robot’s control interface
  • Enables scaling: supports pretraining on unlabeled video

How to evaluate representation quality

A common approach is probing: keep the representation model fixed and train a lightweight head to predict action targets. This tests whether the embedding contains usable action information without conflating it with a heavy downstream policy.

  • Semantic probing: classify atomic/composite actions
  • Control probing: regress end-effector pose trajectories (MSE)

See: LARYBench overview.