Embodied AI Benchmarks
Benchmarks for robotics and VLA models should test generalization across tasks, scenes, and embodiments—without conflating representation and policy.
What makes evaluation hard?
Many embodied benchmarks report end-to-end task success, but that mixes multiple factors: perception, representation learning, planning, control, and reward shaping. If you want to improve a model’s action representation, you need a way to measure it directly.
- Task success is entangled: policies can compensate for weak representations (or vice versa)
- Embodiment matters: robot action spaces differ, making cross-platform comparison difficult
- Long-tail generalization: rare actions and unseen scenes can dominate real-world performance
Representation benchmarks (probing-based)
A probing benchmark evaluates a fixed representation by training a lightweight head for a specific prediction. This isolates whether the embedding contains information needed for semantics and control.
- Semantic action classification: accuracy over action classes
- Action regression: MSE for end-effector pose trajectories
LARYBench uses this approach for latent action representations: read overview.