LARYBench

Latent Action Representation Yielding Benchmark — a systematic benchmark aiming to be an “ImageNet for embodied action representations”.

GitHub Hugging Face Homepage

What problem does it solve?

Modern Vision-Language-Action (VLA) systems face a core mismatch: the internet has massive human video data, but robot learning typically requires costly, precisely labeled robot actions. LARYBench evaluates whether a latent action model can extract a general, embodiment-agnostic action representation from visual sequences—without relying on end-to-end task success rates that entangle policy and representation.

Representation-first evaluation: measure z quality with shallow probing heads
Cross-embodiment: multiple robot morphologies and human first-person data
Multi-granularity actions: from low-level control to high-level semantics

Benchmark setup (high level)

Given an image pair or short video sequence, a latent action model produces an embedding z. LARYBench evaluates z via:

Semantic action classification: probing head predicts atomic/composite action classes
Proprioceptive action regression: a lightweight decoder predicts end-effector pose trajectories; evaluated with MSE

This design makes it possible to compare representation families (specialized LAMs vs. general vision encoders) under a consistent protocol.

What’s included?

Scale: 1M+ labeled clips (1000+ hours total)
Action taxonomy: 151 action types with fine-to-coarse levels
Modalities: image pairs, motion trajectories, and video segments
Diversity: first/third-person, real/sim, single-arm and bimanual platforms

Official resources

Back to Benchmarks Back to News