XL-VLA: Cross-Hand Latent Representation for Vision-Language-Action Models

Overview

XL-VLA is a vision-language-action framework for dexterous manipulation that learns an embodiment-invariant latent action space shared across diverse dexterous hands. This enables training a single policy on pooled multi-hand demonstrations and transferring to new hand–task configurations with minimal friction.

Data scaling: large-scale teleoperation dataset with 10 tasks across 4 dexterous hands (Ability, Inspire, X-Hand1, Paxini DexH13), totaling ~2M state-action pairs.
Cross-hand latent actions: an unsupervised latent autoencoder learns a unified action representation that can be decoded into different hand joint spaces.
VLA integration: XL-VLA plugs the latent action space into a standard VLA architecture to predict one latent action chunk per step and decode it into hand-specific joint trajectories.
Zero-shot generalization: stronger transfer to unseen hand–task combinations compared to raw-joint baselines.
Cross-embodiment scaling: performance improves more as demonstrations are scaled across more hand embodiments with unified latent actions, similar to scaling with more data from a single hand.

Summary Video

Some videos are accelerated.

Method

XL-VLA learns a shared latent action space across dexterous hands and trains a single VLA policy to predict latent action chunks that are decoded into each hand’s joint trajectories.

Per-hand encoders/decoders: map between raw joint actions and a shared latent space.
Latent chunk prediction: the VLA predicts a single latent vector per step instead of tokenizing high-rate dexterous control in raw joint space.

Tasks & Dataset

We design 10 manipulation tasks and collect large-scale teleoperation demonstrations across multiple dexterous hand embodiments.

Abbrev.	Task	Description (from paper)
PF	Prepare Fruits	Put the banana and orange on the green board for cutting.
SC	Stack Cans	Stack the cheese can on top of the salt.
SoC	Sort Cans	Put the tomato can and the cheese can into the container.
HB	Hand over Bottle	Hand over the white bottle from right hand to left hand.
RL	Re-organize Lemons	Put the yellow lemon and the green lime into the bowl.
PS	Pour Sauce	Pour mustard sauce into the meat can.
RB	Re-arrange Boxes	Keep the table organized by re-arranging the two boxes.
PuS	Push Sugar	Push the sugar boxes together.
PoS	Pour Sugar	Add sugar to the starfruit.
PC	Push Cans	Push the two tomato cans together.

Zero-shot Generalization

We evaluate our policy on unseen novel hand tasks. The hand-task combination is unseen during training.

Unitree G1 Demos

We co-train humanoid data with tabletop data, and find co-training with shared latent action space outperforms training with raw actions.

xArm Demos

xArm tabletop arms also show cross-embodiment scaling with latent action space, similar with G1.

Latent Retargeting

Below we show the sequence decoded from latent action space, retargeting to multiple target embodiments.

X-Hand1 (3x)

Inspire (3x)

Paxini DexH13 (3x)

All (3x w/ Rerun)

BibTeX


        @article{jiang2026cross,
          title={Cross-Hand Latent Representation for Vision-Language-Action Models},
          author={Jiang, Guangqi and Liang, Yutong and Ye, Jianglong and Huang, Jia-Yang and Jing, Changwei and Duan, Rocky and Abbeel, Pieter and Wang, Xiaolong and Zou, Xueyan},
          journal={arXiv preprint},
          year={2026}
        }

XL-VLA

Cross-Hand Latent Representation for Vision-Language-Action Models

CVPR 2026

Guangqi Jiang, Yutong Liang, Jianglong Ye, Jia-Yang Huang, Changwei Jing,
Rocky Duan, Pieter Abbeel, Xiaolong Wang†, Xueyan Zou†

UC San Diego · Amazon FAR · UC Berkeley

* Equal Contribution † Equal Advising

Overview

Summary Video

Method

Tasks & Dataset

Zero-shot Generalization

Unitree G1 Demos

xArm Demos

Latent Retargeting

BibTeX

XL-VLA

Cross-Hand Latent Representation for Vision-Language-Action Models

CVPR 2026

Guangqi Jiang*, Yutong Liang*, Jianglong Ye, Jia-Yang Huang, Changwei Jing, Rocky Duan, Pieter Abbeel, Xiaolong Wang†, Xueyan Zou†

UC San Diego · Amazon FAR · UC Berkeley

* Equal Contribution † Equal Advising

Overview

Summary Video

Method

Tasks & Dataset

Zero-shot Generalization

Unitree G1 Demos

xArm Demos

Latent Retargeting

BibTeX

Guangqi Jiang, Yutong Liang, Jianglong Ye, Jia-Yang Huang, Changwei Jing,
Rocky Duan, Pieter Abbeel, Xiaolong Wang†, Xueyan Zou†