LARYBench

Latent Action Representation Yielding Benchmark — a systematic benchmark aiming to be an “ImageNet for embodied action representations”.

What problem does it solve?

Modern Vision-Language-Action (VLA) systems face a core mismatch: the internet has massive human video data, but robot learning typically requires costly, precisely labeled robot actions. LARYBench evaluates whether a latent action model can extract a general, embodiment-agnostic action representation from visual sequences—without relying on end-to-end task success rates that entangle policy and representation.

  • Representation-first evaluation: measure z quality with shallow probing heads
  • Cross-embodiment: multiple robot morphologies and human first-person data
  • Multi-granularity actions: from low-level control to high-level semantics

Benchmark setup (high level)

Given an image pair or short video sequence, a latent action model produces an embedding z. LARYBench evaluates z via:

  • Semantic action classification: probing head predicts atomic/composite action classes
  • Proprioceptive action regression: a lightweight decoder predicts end-effector pose trajectories; evaluated with MSE

This design makes it possible to compare representation families (specialized LAMs vs. general vision encoders) under a consistent protocol.

What’s included?

  • Scale: 1M+ labeled clips (1000+ hours total)
  • Action taxonomy: 151 action types with fine-to-coarse levels
  • Modalities: image pairs, motion trajectories, and video segments
  • Diversity: first/third-person, real/sim, single-arm and bimanual platforms