Latent Action Representation

A compact representation that captures “what changed between frames” in a way that can support robot control and generalize across embodiments.

Start with LARYBench All Benchmarks

Why it matters

Internet-scale human video is abundant, but robots typically need expensive action labels. Latent action representations aim to extract a reusable action signal from pure visual sequences so learning can scale with data—similar to how large vision encoders scale with images.

Bridges modalities: from pixels to action-relevant features
Improves transfer: less tied to a single robot’s control interface
Enables scaling: supports pretraining on unlabeled video

How to evaluate representation quality

A common approach is probing: keep the representation model fixed and train a lightweight head to predict action targets. This tests whether the embedding contains usable action information without conflating it with a heavy downstream policy.

Semantic probing: classify atomic/composite actions
Control probing: regress end-effector pose trajectories (MSE)

See: LARYBench overview.

Related LongCat AI pages

All Topics News