Overview

XL-VLA is a vision-language-action framework for dexterous manipulation that learns an embodiment-invariant latent action space shared across diverse dexterous hands. This enables training a single policy on pooled multi-hand demonstrations and transferring to new hand–task configurations with minimal friction.

  • Data scaling: large-scale teleoperation dataset with 10 tasks across 4 dexterous hands (Ability, Inspire, X-Hand1, Paxini DexH13), totaling ~2M state-action pairs.
  • Cross-hand latent actions: an unsupervised latent autoencoder learns a unified action representation that can be decoded into different hand joint spaces.
  • VLA integration: XL-VLA plugs the latent action space into a standard VLA architecture to predict one latent action chunk per step and decode it into hand-specific joint trajectories.
  • Zero-shot generalization: stronger transfer to unseen hand–task combinations compared to raw-joint baselines.
  • Cross-embodiment scaling: performance improves more as demonstrations are scaled across more hand embodiments with unified latent actions, similar to scaling with more data from a single hand.

Summary Video

Some videos are accelerated.

Method

XL-VLA learns a shared latent action space across dexterous hands and trains a single VLA policy to predict latent action chunks that are decoded into each hand’s joint trajectories.

  • Per-hand encoders/decoders: map between raw joint actions and a shared latent space.
  • Latent chunk prediction: the VLA predicts a single latent vector per step instead of tokenizing high-rate dexterous control in raw joint space.

Tasks & Dataset

We design 10 manipulation tasks and collect large-scale teleoperation demonstrations across multiple dexterous hand embodiments.

Abbrev. Task Description (from paper)
PF Prepare Fruits Put the banana and orange on the green board for cutting.
SC Stack Cans Stack the cheese can on top of the salt.
SoC Sort Cans Put the tomato can and the cheese can into the container.
HB Hand over Bottle Hand over the white bottle from right hand to left hand.
RL Re-organize Lemons Put the yellow lemon and the green lime into the bowl.
PS Pour Sauce Pour mustard sauce into the meat can.
RB Re-arrange Boxes Keep the table organized by re-arranging the two boxes.
PuS Push Sugar Push the sugar boxes together.
PoS Pour Sugar Add sugar to the starfruit.
PC Push Cans Push the two tomato cans together.

Zero-shot Generalization

We evaluate our policy on unseen novel hand tasks. The hand-task combination is unseen during training.

Unitree G1 Demos

We co-train humanoid data with tabletop data, and find co-training with shared latent action space outperforms training with raw actions.

xArm Demos

xArm tabletop arms also show cross-embodiment scaling with latent action space, similar with G1.

Latent Retargeting

Below we show the sequence decoded from latent action space, retargeting to multiple target embodiments.

X-Hand1 (3x)

Inspire (3x)

Paxini DexH13 (3x)

All (3x w/ Rerun)

BibTeX


        @article{jiang2026cross,
          title={Cross-Hand Latent Representation for Vision-Language-Action Models},
          author={Jiang, Guangqi and Liang, Yutong and Ye, Jianglong and Huang, Jia-Yang and Jing, Changwei and Duan, Rocky and Abbeel, Pieter and Wang, Xiaolong and Zou, Xueyan},
          journal={arXiv preprint},
          year={2026}
        }