AINA

Abstract

Learning multi-fingered robot policies from humans performing daily tasks in natural environments has long been a grand goal in the robotics community. Achieving this would mark significant progress toward generalizable robot manipulation in human environments, as it would reduce the reliance on labor-intensive robot interaction data collection. Despite substantial efforts, progress toward this goal has been bottle-necked by the embodiment gap between humans and robots, as well as by difficulties in extracting relevant contextual and motion cues that enable learning of autonomous policies from in-the-wild human videos. We claim that with simple yet sufficiently powerful hardware for obtaining human data and our proposed framework AINA, we are now one significant step closer to achieving this dream.

AINA enables learning multi-fingered policies from data collected by anyone, anywhere, and in any environment using Aria Gen 2 glasses. These glasses are lightweight and portable, feature a high-resolution RGB camera, provide accurate on-board 3D head and hand poses, and offer a wide stereo view that can be leveraged for depth estimation of the scene. This setup enables the learning of 3D point-based policies for multi-fingered hands that are robust to background changes and can be deployed directly without requiring any robot data (including online corrections, reinforcement learning, or simulation).

Overview

The workflow is as follows: a human wears the Aria 2 glasses and collects in-the-wild demonstrations on any surface with arbitrary backgrounds (left), then records a single demonstration in the robot deployment space (middle), after which point-based policies are trained and directly deployed on the robot (right). With an average of just 15 minutes of human video collection effort, AINA is able to train autonomous robot policies.

Collected In-The-Wild Human Data

We show example trajectories of the limited in-the-wild human data we collect with Aria glasses for training AINA robot policies. These are natural kitchen / office / lab scenes outside the robot workspace.

Stowing

Oven Turning

Oven Opening

Drawer Opening

Cup Pouring

Planar Reorientation

Toaster Press

Toy Picking

Wiping

Policy Rollouts

Autonomous policy rollouts from AINA across multiple manipulation tasks. These policies are trained only human data.

Stowing

Oven Turning

Oven Opening

Drawer Opening

Cup Pouring

Planar Reorientation

Toaster Press

Toy Picking

Wiping

Method

1. Data Processing

Given a single in-the-wild demonstration, we first apply 2D object tracking to each frame using language-prompted off-the-shelf computer vision models. Next, we rectify the images from the front two SLAM cameras and estimate the scene depth with FoundationStereo. Finally, we unproject the 2D object tracks onto the estimated depth to obtain 3D object tracks aligned with the on-board hand pose.

2. Human Demonstration Domain Alignment

In AINA, we use a single in-scene human demonstration as an anchor to align all in-the-wild human demonstrations. Specifically, we shift the object and hand points using the demonstrations’ center of mass, and we align their orientations by applying a rotation around the hands’ gravity axis. Here, we illustrate the importance of both steps and show how the alignment appears when either the shifting or rotation is omitted. In all the videos below, the red points (light: hand, dark: object) represent the in-the-wild demonstration being transformed, while the blue points (light: hand, dark: object) represent the in-scene demonstration. The in-scene demonstration remains the same across all videos, whereas the transformed in-the-wild demonstrations vary depending on the alignment method.

AINA

We showcase how the alignment looks like when both the shifting and rotation are applied.

No Shifting and Rotation

We illustrate how the alignment appears when neither shifting nor rotation is applied. As shown, although all objects lie on a similar plane, their positions are significantly misaligned.

No Rotation

We illustrate how the alignment appears when only shifting is applied. As shown, because the world frame is assigned randomly during in-the-wild data collection, the rotation of the hand points cannot be predicted, and their orientation may be significantly misaligned. In this case, the hand is fully rotated, and the object positions appear swapped.

3. Policy Learning

After this extraction, given a history of object and fingertip keypoints, we first pass them through a vector-neuron MLP to obtain a latent representation for each point. These latents are then fed into a transformer encoder as separate tokens. But since only fingertips have correspondence across different demonstrations, we apply learned positional encodings only to the fingertip tokens. The transformer encoder’s output is passed through an MLP to predict a future trajectory of points. Finally, the loss is computed as the MSE between the ground-truth and predicted fingertip points, and the entire system is trained end-to-end with this objective.

Experiments

1. Comparison to Baselines

TABLE I: Performance comparison of policies trained with different datasets. Comparisons are done in similar deployment scenarios, with a minimum of 10 trials each.

	Toaster Press	Toy Picking
In-Scene Only	3/10	1/10
In-The-Wild Only	0/10	0/10
In-Scene Transform & In-The-Wild	0/10	1/10
In-Scene Training & In-The-Wild	6/10	2/10
AINA	13/15	13/15

TABLE II: Comparison of success of AINA to policies trained with RGB images as input

	Oven Opening	Drawer Opening
Masked BAKU	6/15	1/15
Masked BAKU with History	0/15	0/15
AINA	12/15	11/15

2. Height Experiments

We test how AINA performs when the height of the operation space changes for two of our tasks. For each height level we collect an additional in-scene human demonstration and train new policies. We observe that with minimal human effort AINA generalizes well to different heights.

Toy Picking
Wiping

Height 1

Success Rate = 5/10

Height 2

Success Rate = 6/10

Height 3

Success Rate = 2/10

3. Object Generalization Experiments

We test object generalization capabilities of AINA by deploying trained policies on different objects for three of our tasks while segmenting the object with new prompts. We observe that while AINA performs relatively well, it fails when the shape and weight is significantly different. Text prompts used are showcased at the title of each video.

Toy Picking
Wiping
Toaster

Popcorn Package, Bowl

Success Rate = 1/10

Toy, Bowl

Success Rate = 2/10

Sponge

Success Rate = 7/10

Eraser

Success Rate = 5/10

Toaster

Success Rate = 6/10

BibTeX

@misc{guzey2025dexteritysmartlensesmultifingered,
      title={Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations}, 
      author={Irmak Guzey and Haozhi Qi and Julen Urain and Changhao Wang and Jessica Yin and Krishna Bodduluri and Mike Lambeta and Lerrel Pinto and Akshara Rai and Jitendra Malik and Tingfan Wu and Akash Sharma and Homanga Bharadhwaj},
      year={2025},
      eprint={2511.16661},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2511.16661}, 
    }

AINA 🪞
Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations

We present AINA, a framework for learning multi-fingered policies from in-the-wild human data collected with smart glasses, without requiring any robot data (including online corrections or simulation).

Abstract

Overview

Collected In-The-Wild Human Data

Policy Rollouts