Vision-Language-Action (VLA)

VLA models translate vision + language into actions. The key challenge is generalization: new scenes, new tasks, and new embodiments.

The core bottleneck

The internet provides abundant video, but robots require executable action labels. This creates a gap between “what we can observe” (pixels) and “what we must output” (robot controls). Latent action representations provide a scalable intermediate target: learning frame-to-frame dynamics in a way that is not tied to one robot.

  • Data bottleneck: high-quality robot action data is expensive to collect
  • Representation bottleneck: robot-specific action spaces limit cross-embodiment transfer
  • Scaling bottleneck: manual supervision does not scale like web data

How LARYBench helps

LARYBench evaluates whether an embedding z extracted from visual sequences contains information needed for both semantic actions and low-level control, using lightweight probing heads. This lets researchers compare representation families under a consistent protocol.