- Data scaling: large-scale teleoperation dataset with 10 tasks across 4 dexterous hands (Ability, Inspire, X-Hand1, Paxini DexH13), totaling ~2M state-action pairs.
- Cross-hand latent actions: an unsupervised latent autoencoder learns a unified action representation that can be decoded into different hand joint spaces.
- VLA integration: XL-VLA plugs the latent action space into a standard VLA architecture to predict one latent action chunk per step and decode it into hand-specific joint trajectories.
- Zero-shot generalization: stronger transfer to unseen hand–task combinations compared to raw-joint baselines.
- Cross-embodiment scaling: performance improves more as demonstrations are scaled across more hand embodiments with unified latent actions, similar to scaling with more data from a single hand.