LARYBench
Latent Action Representation Yielding Benchmark — a systematic benchmark aiming to be an “ImageNet for embodied action representations”.
What problem does it solve?
Modern Vision-Language-Action (VLA) systems face a core mismatch: the internet has massive human video data, but robot learning typically requires costly, precisely labeled robot actions. LARYBench evaluates whether a latent action model can extract a general, embodiment-agnostic action representation from visual sequences—without relying on end-to-end task success rates that entangle policy and representation.
- Representation-first evaluation: measure z quality with shallow probing heads
- Cross-embodiment: multiple robot morphologies and human first-person data
- Multi-granularity actions: from low-level control to high-level semantics
Benchmark setup (high level)
Given an image pair or short video sequence, a latent action model produces an embedding z. LARYBench evaluates z via:
- Semantic action classification: probing head predicts atomic/composite action classes
- Proprioceptive action regression: a lightweight decoder predicts end-effector pose trajectories; evaluated with MSE
This design makes it possible to compare representation families (specialized LAMs vs. general vision encoders) under a consistent protocol.
What’s included?
- Scale: 1M+ labeled clips (1000+ hours total)
- Action taxonomy: 151 action types with fine-to-coarse levels
- Modalities: image pairs, motion trajectories, and video segments
- Diversity: first/third-person, real/sim, single-arm and bimanual platforms