Learning multi-fingered robot policies from humans performing
daily tasks in natural environments has long been a grand goal
in the robotics community. Achieving this would mark significant
progress toward generalizable robot manipulation in human environments,
as it would reduce the reliance on labor-intensive robot interaction data collection.
Despite substantial efforts, progress toward this goal has been bottle-necked
by the embodiment gap between humans and robots, as well as by difficulties
in extracting relevant contextual and motion cues that enable
learning of autonomous policies from in-the-wild human videos.
We claim that with simple yet sufficiently powerful hardware
for obtaining human data and our proposed framework AINA, we are now one significant step closer to
achieving this dream.
AINA enables learning multi-fingered policies from data collected by anyone,
anywhere, and in any environment using Aria Gen 2 glasses.
These glasses are lightweight and portable, feature a high-resolution
RGB camera, provide accurate on-board 3D head and hand poses,
and offer a wide stereo view that can be leveraged for depth
estimation of the scene. This setup enables the learning of 3D
point-based policies for multi-fingered hands that are robust to
background changes and can be deployed directly without requiring any
robot data (including online corrections, reinforcement learning, or simulation).