Embodied AI Benchmarks

Benchmarks for robotics and VLA models should test generalization across tasks, scenes, and embodiments—without conflating representation and policy.

What makes evaluation hard?

Many embodied benchmarks report end-to-end task success, but that mixes multiple factors: perception, representation learning, planning, control, and reward shaping. If you want to improve a model’s action representation, you need a way to measure it directly.

  • Task success is entangled: policies can compensate for weak representations (or vice versa)
  • Embodiment matters: robot action spaces differ, making cross-platform comparison difficult
  • Long-tail generalization: rare actions and unseen scenes can dominate real-world performance

Representation benchmarks (probing-based)

A probing benchmark evaluates a fixed representation by training a lightweight head for a specific prediction. This isolates whether the embedding contains information needed for semantics and control.

  • Semantic action classification: accuracy over action classes
  • Action regression: MSE for end-effector pose trajectories

LARYBench uses this approach for latent action representations: read overview.